Re: [lxc-devel] cgroup management daemon
Hello, Tim. On Tue, Dec 03, 2013 at 08:53:21PM -0800, Tim Hockin wrote: If this daemon works as advertised, we will explore moving all write traffic to use it. I still have concerns that this can't handle read traffic at the scale we need. At least from the kernel side, cgroup doesn't and won't have any problem with direct reads. Tejun, I am not sure why chown came back into the conversation. This is a replacement for that. I guess I'm just confused because of the mentions of chown. If it isn't about giving unmoderated write access to untrusted domains, everything should be fine. Thanks! -- tejun -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Quoting Tim Hockin (thoc...@google.com): If this daemon works as advertised, we will explore moving all write traffic to use it. I still have concerns that this can't handle read traffic at the scale we need. Tejun, I am not sure why chown came back into the conversation. This is a replacement for that. Because the daemon is chowning directories and files. That's how the daemon decides whether clients have access. -serge -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Quoting Victor Marmol (vmar...@google.com): I thought we were going to use chown in the initial version to enforce the ownership/permissions on the hierarchy. Only the cgroup manager has access to the hierarchy, but it tries to access the hierarchy as the user that sent the request. It was only meant to be a for now solution while the real one rolls out. It may also have gotten thrown out since last I heard :) Actually that part wasn't meant as a for now solution. It can of course be thrown away in favor of having the daemon store all this information, but I'm seeing no advantages to that right now. There are other things which the daemon can eventually try to keep track of, if we don't decide they belong in a higher layer. -serge -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
On Wed, Dec 04, 2013 at 09:54:37AM -0600, Serge Hallyn wrote: Quoting Tim Hockin (thoc...@google.com): If this daemon works as advertised, we will explore moving all write traffic to use it. I still have concerns that this can't handle read traffic at the scale we need. Tejun, I am not sure why chown came back into the conversation. This is a replacement for that. Because the daemon is chowning directories and files. That's how the daemon decides whether clients have access. Ah, okay, so the manager is just using filesystem metadata for bookkeeping. That should be fine. Please note that cgroup filesystem also supports xattr and AFAIK systemd is already making use of it. Thanks. -- tejun -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Hello, guys. Sorry about the delay. On Mon, Nov 25, 2013 at 10:43:35PM +, Serge E. Hallyn wrote: Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the reasons for not exposing knobs as-is is that the knobs we currently have aren't consistent. The weight values have different ranges, some combinations of values don't make much sense, and so on. The user can cope with it but it'd probably be better to expose something which doesn't lead to mistakes too easily. The above addresses * creating cgroups * chowning cgroups * setting cgroup limits * moving tasks into cgroups . but does not address a 'cgexec group -- command' type of behavior. * To handle that (specifically for upstart), recommend that r do: if (!pid) { request_reclassify(cgroup, getpid()); do_execve(); } . alternatively, the daemon could, if kernel is new enough, setns to the requestor's namespaces to execute a command in a new cgroup. The new command would be daemonized to that pid namespaces' pid 1. So, IIUC, cgroup hierarchy management - creation and removal of cgroups and assignments of tasks will go through while configuring control knobs will be delegated to the cgroup owner, right? Hmmm... the plan is to allow delegating task assignments in the sub-hierarchy but require CAP_X for writes to knobs (not reads). This stems from the fact that, especially with unified hierarchy, those operations will be cgroup-core proper operations which are gonna be relatively safer and that task organizations in the subhierarchy and monitoring knobs are likely to be higher frequency operation than enabling and configuring controllers. As I communicated multiple times before, delegating write access to control knobs to untrusted domain has always been a security risk and is likely to continue to remain so. Also, organizationally, a cgroup's control knobs belong to the parent not the cgroup itself. That probably is why you were thinking about putting an extra cgroup inbetween for isolation, but the root problem there is that those knobs belong to the parent, not the directory itself. Security is in most part logistics - it's about getting all the details right, and we don't either design or implement each knob with security in mind and DoSing them has always been pretty easy, so I don't think delegating write accesses to knobs is a good idea. If you, for whatever reason, can trust the delegatee, which I believe is the case for google, it's fine. If you're trying to delegate to a container which you don't have any control over, it isn't a good idea. Another thing to consider is due to both the fundamental characterics of hierarchy and implementation issues, things will become expensive if nesting gets beyond several layers (if controllers are enabled, that is) and the controllers in general will be implemented and optimized with limited level of nesting in mind. IOW, building, say, 8 level deep hierarchy in the host and then doing the same thing inside the container with controllers enabled won't make a very happy system. It probably is something to keep in mind when laying out how the whole thing eventually would look like. Long-term we will want the cgroup manager to become more intelligent - to place its own limits on clients, to address cpu and device hotplug, etc. Since we will not be doing that in the first prototype, the daemon will not keep any state about the clients. Isn't the above conflicting with chowning control knobs? Thanks. -- tejun -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Ooh, can you also please cc Li Zefan lize...@huawei.com when replying? Thanks. -- tejun -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Hello, Tim. On Mon, Nov 25, 2013 at 08:58:09PM -0800, Tim Hockin wrote: Thanks for this! I think it helps a lot to discuss now, rather than over nearly-done code. On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote: Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. I'm torn here. While I agree in principle with Tejun, I am concerned that this agent will always lag new kernel features or that the thin abstraction you want to provide here does not easily accommodate some of the more ... oddball features of one cgroup interface or another. Yeah, that's the trade-off but cgroupfs is a kernel API. It shouldn't change or grow rapidly once things settle down. As long as there's not too crazy way to step-aside when such rare case arises, I think pros outweight cons. This agent is the very bottom of the stack, and should probably not do much by way of abstraction. I think I'd rather let something like lmctfy provide the abstraction more holistically, and relegate this agent to very simple plumbing and policy. It could be as simple as providing read/write/etc ops to specific control files. It needs to handle event_fd, too, I guess. This has the nice side-effect of always being current on kernel features :) The level of abstraction is definitely something debatable. Please note that the existing event_fd based mechanism won't grow any new users (BTW, event_control is one of the dos vectors if you give write access to it) and all new notifications will be using inotify. Thanks. -- tejun -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Hello, On Tue, Nov 26, 2013 at 09:19:18AM -0800, Victor Marmol wrote: From my discussions with Tejun, he wanted to move to using inotify so it may still be an fd we pass around. Hm, would that just be inotify on the memory.max_usage_in_bytes file, of inotify on a specific fd you've created which is associated with any threshold you specify? The former seems less ideal. Tejun can comment more, but I think it is still TBD. It's likely the former with configurable cadence or per-knob (not per-opener) configurable thresholds. max_usage_in_bytes is a special case here as all other knobs can simply generate an event on each transition. If event (de)muxing is necessary, it probably should be done from userland. Thanks. -- tejun -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Quoting Tejun Heo (t...@kernel.org): Hello, guys. Sorry about the delay. On Mon, Nov 25, 2013 at 10:43:35PM +, Serge E. Hallyn wrote: Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the reasons for not exposing knobs as-is is that the knobs we currently have aren't consistent. The weight values have different ranges, some combinations of values don't make much sense, and so on. The user can cope with it but it'd probably be better to expose something which doesn't lead to mistakes too easily. For the moment, for prototype (github.com/hallyn/cgmanager), I'm just going with filenames/values. When the bulk of the work is done, we can either (or both) (a) introduce a thin abstraction layer over the key/values, or/and (b) whitelist some of the filenames and filter some values. I know the upstart folks don't want to have to wait long for a specification... I'll hopefully make a final decision on this next week. The above addresses * creating cgroups * chowning cgroups * setting cgroup limits * moving tasks into cgroups . but does not address a 'cgexec group -- command' type of behavior. * To handle that (specifically for upstart), recommend that r do: if (!pid) { request_reclassify(cgroup, getpid()); do_execve(); } . alternatively, the daemon could, if kernel is new enough, setns to the requestor's namespaces to execute a command in a new cgroup. The new command would be daemonized to that pid namespaces' pid 1. So, IIUC, cgroup hierarchy management - creation and removal of cgroups and assignments of tasks will go through while configuring control knobs will be delegated to the cgroup owner, right? Not sure what you mean, but I think the answer is no. Everything goes through the manager. The manager doesn't try to enforce that, but by default the cgroup filesystems will only be mounted in the manager's private mnt_ns, and containers at least will not be allowed to mount cgroup fstype. Hmmm... the plan is to allow delegating task assignments in the sub-hierarchy but require CAP_X for writes to knobs (not reads). This stems from the fact that, especially with unified hierarchy, those operations will be cgroup-core proper operations which are gonna be relatively safer and that task organizations in the subhierarchy and monitoring knobs are likely to be higher frequency operation than enabling and configuring controllers. Should be ok for this. As I communicated multiple times before, delegating write access to control knobs to untrusted domain has always been a security risk and is likely to continue to remain so. Also, organizationally, a Then that will need to be address with per-key blacklisting and/or per-value filtering in the manager. Which is my way of saying: can we please have a list of the security issues so we can handle them? :) (I've asked several times before but haven't seen a list or anyone offering to make one) cgroup's control knobs belong to the parent not the cgroup itself. After thinking awhile I think this makes perfect sense. I haven't implemented set_value yet, and when I do I think I'll implement this guideline. That probably is why you were thinking about putting an extra cgroup inbetween for isolation, but the root problem there is that those knobs belong to the parent, not the directory itself. Yup. Security is in most part logistics - it's about getting all the details right, and we don't either design or implement each knob with security in mind and DoSing them has always been pretty easy, so I don't think delegating write accesses to knobs is a good idea. If you, for whatever reason, can trust the delegatee, which I believe is the case for google, it's fine. If you're trying to delegate to a container which you don't have any control over, it isn't a good idea. Another thing to consider is due to both the fundamental characterics of hierarchy and implementation issues, things will become expensive if nesting gets beyond several layers (if controllers are enabled, that is) and the controllers in general will be implemented and optimized with limited level of nesting in mind. IOW, building, say, 8 level deep hierarchy in the host and then doing the same thing inside the container with controllers enabled won't make a very happy Yes, I very much want to avoid that. system. It probably is something to keep in mind when laying out how the whole thing eventually would look like. Long-term we will want
Re: [lxc-devel] cgroup management daemon
Hello, Serge. On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote: As I communicated multiple times before, delegating write access to control knobs to untrusted domain has always been a security risk and is likely to continue to remain so. Also, organizationally, a Then that will need to be address with per-key blacklisting and/or per-value filtering in the manager. Which is my way of saying: can we please have a list of the security issues so we can handle them? :) (I've asked several times before but haven't seen a list or anyone offering to make one) Unfortunately, for now, please consider everything blacklisted. Yes, it is true that some knobs should be mostly safe but given the level of changes we're going through and the difficulty of properly auditing anything for delegation to untrusted environment, I don't feel comfortable at all about delegating through chown. It is an accidental feature which happened just because it uses filesystem as its interface and it is no where near the top of the todo list. It has never worked properly and won't in any foreseeable future. cgroup's control knobs belong to the parent not the cgroup itself. After thinking awhile I think this makes perfect sense. I haven't implemented set_value yet, and when I do I think I'll implement this guideline. I'm kinda confused here. You say *everything* is gonna go through the manager and then talks about chowning directories. Don't the two conflict? Long-term we will want the cgroup manager to become more intelligent - to place its own limits on clients, to address cpu and device hotplug, etc. Since we will not be doing that in the first prototype, the daemon will not keep any state about the clients. Isn't the above conflicting with chowning control knobs? Not sure what you mean by this. To be clear what I'm talking about is having the client be able to say grant 50% of cpus, then when more cpus are added, the actual cpuset gets recalculated. This may well forever stay outside of the cgmanager scope. It may be more appropriate to put that logic into the lmctfy layer. Yes, something like that would be nice but if you give out raw access to the control knobs by chowning them, I just don't see how that would be implementable. What am I missing here? Thanks. -- tejun -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
And can somebody please fix up lxc-devel so that it doesn't generate your message awaits moderator approval notification on *each* message? :( -- tejun -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Quoting Tejun Heo (t...@kernel.org): Hello, Serge. On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote: As I communicated multiple times before, delegating write access to control knobs to untrusted domain has always been a security risk and is likely to continue to remain so. Also, organizationally, a Then that will need to be address with per-key blacklisting and/or per-value filtering in the manager. Which is my way of saying: can we please have a list of the security issues so we can handle them? :) (I've asked several times before but haven't seen a list or anyone offering to make one) Unfortunately, for now, please consider everything blacklisted. Yes, it is true that some knobs should be mostly safe but given the level of changes we're going through and the difficulty of properly auditing anything for delegation to untrusted environment, I don't feel comfortable at all about delegating through chown. It is an accidental feature which happened just because it uses filesystem as its interface and it is no where near the top of the todo list. It has never worked properly and won't in any foreseeable future. cgroup's control knobs belong to the parent not the cgroup itself. After thinking awhile I think this makes perfect sense. I haven't implemented set_value yet, and when I do I think I'll implement this guideline. I'm kinda confused here. You say *everything* is gonna go through the manager and then talks about chowning directories. Don't the two conflict? No. I expect the user - except in the google case - to either have access to no cgroupfs mounts, or readonly mounts. -serge -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
If this daemon works as advertised, we will explore moving all write traffic to use it. I still have concerns that this can't handle read traffic at the scale we need. Tejun, I am not sure why chown came back into the conversation. This is a replacement for that. On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tejun Heo (t...@kernel.org): Hello, Serge. On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote: As I communicated multiple times before, delegating write access to control knobs to untrusted domain has always been a security risk and is likely to continue to remain so. Also, organizationally, a Then that will need to be address with per-key blacklisting and/or per-value filtering in the manager. Which is my way of saying: can we please have a list of the security issues so we can handle them? :) (I've asked several times before but haven't seen a list or anyone offering to make one) Unfortunately, for now, please consider everything blacklisted. Yes, it is true that some knobs should be mostly safe but given the level of changes we're going through and the difficulty of properly auditing anything for delegation to untrusted environment, I don't feel comfortable at all about delegating through chown. It is an accidental feature which happened just because it uses filesystem as its interface and it is no where near the top of the todo list. It has never worked properly and won't in any foreseeable future. cgroup's control knobs belong to the parent not the cgroup itself. After thinking awhile I think this makes perfect sense. I haven't implemented set_value yet, and when I do I think I'll implement this guideline. I'm kinda confused here. You say *everything* is gonna go through the manager and then talks about chowning directories. Don't the two conflict? No. I expect the user - except in the google case - to either have access to no cgroupfs mounts, or readonly mounts. -serge -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
I thought we were going to use chown in the initial version to enforce the ownership/permissions on the hierarchy. Only the cgroup manager has access to the hierarchy, but it tries to access the hierarchy as the user that sent the request. It was only meant to be a for now solution while the real one rolls out. It may also have gotten thrown out since last I heard :) On Tue, Dec 3, 2013 at 8:53 PM, Tim Hockin thoc...@google.com wrote: If this daemon works as advertised, we will explore moving all write traffic to use it. I still have concerns that this can't handle read traffic at the scale we need. Tejun, I am not sure why chown came back into the conversation. This is a replacement for that. On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tejun Heo (t...@kernel.org): Hello, Serge. On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote: As I communicated multiple times before, delegating write access to control knobs to untrusted domain has always been a security risk and is likely to continue to remain so. Also, organizationally, a Then that will need to be address with per-key blacklisting and/or per-value filtering in the manager. Which is my way of saying: can we please have a list of the security issues so we can handle them? :) (I've asked several times before but haven't seen a list or anyone offering to make one) Unfortunately, for now, please consider everything blacklisted. Yes, it is true that some knobs should be mostly safe but given the level of changes we're going through and the difficulty of properly auditing anything for delegation to untrusted environment, I don't feel comfortable at all about delegating through chown. It is an accidental feature which happened just because it uses filesystem as its interface and it is no where near the top of the todo list. It has never worked properly and won't in any foreseeable future. cgroup's control knobs belong to the parent not the cgroup itself. After thinking awhile I think this makes perfect sense. I haven't implemented set_value yet, and when I do I think I'll implement this guideline. I'm kinda confused here. You say *everything* is gonna go through the manager and then talks about chowning directories. Don't the two conflict? No. I expect the user - except in the google case - to either have access to no cgroupfs mounts, or readonly mounts. -serge -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ lxc-devel mailing list lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
At the start of this discussion, some months ago, we offered to co-devel this with Lennart et al. They did not seem keen on the idea. If they have an established DBUS protocol spec, we should consider adopting it instead of a new one, but we CAN'T just play follow the leader and do whatever they do, change whenever they feel like changing. It would be best if we could get a common DBUS api specc'ed and all agree to it. Serge, do you feel up to that? On Mon, Nov 25, 2013 at 6:18 PM, Michael H. Warfield m...@wittsend.com wrote: Serge... You have no idea how much I dread mentioning this (well, after LinuxPlumbers, maybe you can) but... You do realize that some of this is EXACTLY what the systemd crowd was talking about there in NOLA back then. I sat in those session grinding my teeth and listening to comments from some others around me about when systemd might subsume bash or even vi or quake. Somehow, you and others have tagged me as a systemd expert but I am far from it and even you noted that Lennart and I were on the edge of a physical discussion when I made some off the cuff remarks there about systemd design during my talk. I personally rank systemd in the same category as NetworkMangler (err, NetworkManager) in its propensity for committing inexplicable random acts of terrorism and changing its behavior from release to release to release. I'm not a fan and I'm not an expert, but I have to be involved with it and watch the damned thing like a trapped rat, like it or not. Like it or not, we can not go off on divergent designs. As much as they have delusions of taking over the Linux world, they are still going to be a major factor and this sort of thing needs to be coordinated. We are going to need exactly what you are proposing whether we have systemd in play or not. IF we CAN kick it to the curb, when we need to, we still need to know how to without tearing shit up and breaking shit that thinks it's there. Ideally, it shouldn't matter if systemd where in play or not. All I ask is that we not get too far off track that we have a major architectural divergence here. The risk is there. Mike On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege
Re: [lxc-devel] cgroup management daemon
Thanks for this! I think it helps a lot to discuss now, rather than over nearly-done code. On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote: Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. I'm torn here. While I agree in principle with Tejun, I am concerned that this agent will always lag new kernel features or that the thin abstraction you want to provide here does not easily accommodate some of the more ... oddball features of one cgroup interface or another. This agent is the very bottom of the stack, and should probably not do much by way of abstraction. I think I'd rather let something like lmctfy provide the abstraction more holistically, and relegate this agent to very simple plumbing and policy. It could be as simple as providing read/write/etc ops to specific control files. It needs to handle event_fd, too, I guess. This has the nice side-effect of always being current on kernel features :) Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). Where does the config come from? How do I specify which hierarchies I want and where, and which flags? . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. Problem with this definition. Being owned-by is not the same as has-root-in. Specifically, I may choose to give you root in your own namespace, but you sure as heck can not increase your own memory limit. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a SCM_CREDENTIAL to be sent. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over which it has privilege. * Requestors may be limited to a certain #/depth of cgroups (to limit memory usage) - DEFER? * Cgroup hierarchy is responsible for resource limits * A requestor must either be uid 0 in its userns with victim mapped ito its userns, or the same uid and in same/ancestor pidns as the victim * If r requests creation of cgroup '/x', /x will be interpreted as relative to r's cgroup. r cannot make changes to cgroups not under its own current cgroup. Does this imply that r in a lower-level (farter from root) of the hierarchy can not make requests of higher levels of the hierarchy (closer to root), even though they have permissions as per the definition of privilege? How do we reconcile this pseudo-virtualization with /proc/self/cgroup which DOES expose raw paths?
Re: [lxc-devel] cgroup management daemon
Quoting Tim Hockin (thoc...@google.com): What are the requirements/goals around performance and concurrency? Do you expect this to be a single-threaded thing, or can we handle some number of concurrent operations? Do you expect to use threads of processes? The cgmanager should be pretty dumb, so I would expect it to be quite fast. I don't have any specific perf goals though. If you have requirements I'm very interested to hear them. I should be able to tell pretty soon how far short I fall. By default I'd expect to run with a single thread, but I don't imagine one thread can serve a busy 1024-cpu system very well. Unless you have guidance right now, I think I'd like to get started with the basic functionality and see how it measures up to your requirements. I should add perf counters from the start so we can figure out where bottlenecks (if any) are and how to handle them. Otherwise I could start out with a basic numcpus/10 threadpool and have the main thread do socket i/o and parcel access verification and vfs work out to the threadpool, but I'd rather first know where the problems lie. Can you talk about logging - what and where? When started under upstart, anything we print out goes to /var/log/upstart/cgmanager.log. Would be nice to keep it that simple. We could log requests by r to do something it is not allowed to do, but it seems to me the failed attempts cause no harm, while the potential for overflowing logs can. Did you have anything in mind? Did you want logging to help detect certain conditions for system optimization, or just for failure notices and security violations? How will we handle event_fd? Pass a file-descriptor back to the caller? The only thing currently supporting eventfd is memory threshold, right? I haven't tested whether this will work or not, but ideally the caller would open the eventfd fd, pass it, the cgroup name, controller file to be watched, and the args to cgmanager; cgmanager confirms read access, opens the controller fd, makes the request over cgroup.event_control, then passes the controller fd back to the caller and closes its own copy. I'm also not sure whether the cgroup interface is going to be offering a new feature to replace eventfd, since it wants people to stop using cgroupfs... Tejun? That's all I can come up with for now. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): What are the requirements/goals around performance and concurrency? Do you expect this to be a single-threaded thing, or can we handle some number of concurrent operations? Do you expect to use threads of processes? The cgmanager should be pretty dumb, so I would expect it to be quite fast. I don't have any specific perf goals though. If you have requirements I'm very interested to hear them. I should be able to tell pretty soon how far short I fall. By default I'd expect to run with a single thread, but I don't imagine one thread can serve a busy 1024-cpu system very well. Unless you have guidance right now, I think I'd like to get started with the basic functionality and see how it measures up to your requirements. I should add perf counters from the start so we can figure out where bottlenecks (if any) are and how to handle them. Otherwise I could start out with a basic numcpus/10 threadpool and have the main thread do socket i/o and parcel access verification and vfs work out to the threadpool, but I'd rather first know where the problems lie. From Rohit's talk at Linux plumbers: http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf The goal is O(1000) reads and O(100) writes per second. Can you talk about logging - what and where? When started under upstart, anything we print out goes to /var/log/upstart/cgmanager.log. Would be nice to keep it that simple. We could log requests by r to do something it is not allowed to do, but it seems to me the failed attempts cause no harm, while the potential for overflowing logs can. Did you have anything in mind? Did you want logging to help detect certain conditions for system optimization, or just for failure notices and security violations? How will we handle event_fd? Pass a file-descriptor back to the caller? The only thing currently supporting eventfd is memory threshold, right? I haven't tested whether this will work or not, but ideally the caller would open the eventfd fd, pass it, the cgroup name, controller file to be watched, and the args to cgmanager; cgmanager confirms read access, opens the controller fd, makes the request over cgroup.event_control, then passes the controller fd back to the caller and closes its own copy. I'm also not sure whether the cgroup interface is going to be offering a new feature to replace eventfd, since it wants people to stop using cgroupfs... Tejun? From my discussions with Tejun, he wanted to move to using inotify so it may still be an fd we pass around. That's all I can come up with for now. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Quoting Tim Hockin (thoc...@google.com): At the start of this discussion, some months ago, we offered to co-devel this with Lennart et al. They did not seem keen on the idea. If they have an established DBUS protocol spec, see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html we should consider adopting it instead of a new one, but we CAN'T just play follow the leader and do whatever they do, change whenever they feel like changing. Right. And if we suspect that the APIs will always be at least subtly different, then keeping them obviously visually different seems to have some benefit. (i.e. systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M vs dbus-send cgmanager set-value http.server cpushares:500 memorylimit:500M swaplimit:1G ) rather than have admins try to remember now why did that not work here, oh yeah, MemoryLimit over here should be Memorylimit or whatever. Then again if lmctfy is the layer which admins will use, then it doesn't matter as much. It would be best if we could get a common DBUS api specc'ed and all agree to it. Serge, do you feel up to that? Not sure what you mean - I'll certainly send the API to these lists as the code is developed, and will accept all feedback that I get. My only requirements are that the requirements I've listed in the document be feasible, and be feasible back to, say, 3.2 kernels. So that is why we must send an scm-cred for the pid to move into a cgroup. (With 3.12 we may have alterntives, accepting a vpid as a simple dbus message and setns()ing into the requestor's pidns to echo the pid into the cgroup.tasks file.) -serge -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
On Tue, Nov 26, 2013 at 8:41 AM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Victor Marmol (vmar...@google.com): On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): What are the requirements/goals around performance and concurrency? Do you expect this to be a single-threaded thing, or can we handle some number of concurrent operations? Do you expect to use threads of processes? The cgmanager should be pretty dumb, so I would expect it to be quite fast. I don't have any specific perf goals though. If you have requirements I'm very interested to hear them. I should be able to tell pretty soon how far short I fall. By default I'd expect to run with a single thread, but I don't imagine one thread can serve a busy 1024-cpu system very well. Unless you have guidance right now, I think I'd like to get started with the basic functionality and see how it measures up to your requirements. I should add perf counters from the start so we can figure out where bottlenecks (if any) are and how to handle them. Otherwise I could start out with a basic numcpus/10 threadpool and have the main thread do socket i/o and parcel access verification and vfs work out to the threadpool, but I'd rather first know where the problems lie. From Rohit's talk at Linux plumbers: http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf The goal is O(1000) reads and O(100) writes per second. Cool, thanks. I can try and get a sense next week of how far off the mark I am for reads. Can you talk about logging - what and where? When started under upstart, anything we print out goes to /var/log/upstart/cgmanager.log. Would be nice to keep it that simple. We could log requests by r to do something it is not allowed to do, but it seems to me the failed attempts cause no harm, while the potential for overflowing logs can. Did you have anything in mind? Did you want logging to help detect certain conditions for system optimization, or just for failure notices and security violations? How will we handle event_fd? Pass a file-descriptor back to the caller? The only thing currently supporting eventfd is memory threshold, right? I haven't tested whether this will work or not, but ideally the caller would open the eventfd fd, pass it, the cgroup name, controller file to be watched, and the args to cgmanager; cgmanager confirms read access, opens the controller fd, makes the request over cgroup.event_control, then passes the controller fd back to the caller and closes its own copy. I'm also not sure whether the cgroup interface is going to be offering a new feature to replace eventfd, since it wants people to stop using cgroupfs... Tejun? From my discussions with Tejun, he wanted to move to using inotify so it may still be an fd we pass around. Hm, would that just be inotify on the memory.max_usage_in_bytes file, of inotify on a specific fd you've created which is associated with any threshold you specify? The former seems less ideal. Tejun can comment more, but I think it is still TBD. -serge -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): Thanks for this! I think it helps a lot to discuss now, rather than over nearly-done code. On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote: Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. I'm torn here. While I agree in principle with Tejun, I am concerned that this agent will always lag new kernel features or that the thin abstraction you want to provide here does not easily accommodate some of the more ... oddball features of one cgroup interface or another. This agent is the very bottom of the stack, and should probably not do much by way of abstraction. I think I'd rather let something like lmctfy provide the abstraction more holistically, and relegate this If lmctfy is an abstraction layer that should keep Tejun happy, and it could keep me out of the resource naming game which makes me happy :) agent to very simple plumbing and policy. It could be as simple as providing read/write/etc ops to specific control files. It needs to handle event_fd, too, I guess. This has the nice side-effect of always being current on kernel features :) Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). Where does the config come from? How do I specify which hierarchies I want and where, and which flags? That'll have to be in a file in /etc (which can be mounted readonly). There should be no surprises there so I've not thought about the format. . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. Problem with this definition. Being owned-by is not the same as has-root-in. Specifically, I may choose to give you root in your own namespace, but you sure as heck can not increase your own memory limit. 1. If you don't want me to change the value at all, then just don't map A's owner into the namespace. I'm uid 10 which is root in my namespace, but I only have privilege over other uids mapped into my namespace. I think I understand this, but it is subtle. Maybe some examples would help? 2. I've considered never allowing changes to your own cgroup. So if you're in /a/b, you can create /a/b/c and modify c's settings, but you can't modify b's. OTOH, that isn't strictly necessary - if we did allow it, then you could simply clam /a/b's memory to what you want, and stick me in /a/b/c, so I can't escape the memory limit you wanted. This is different from what we do internally, but it's an interesting semantic. I'm wary of how much we want to make this API about enforcement of policy vs simple enactment. In other words, semantics that diverge from UNIX ownership might be more complicated to understand than they are worth. 3. I've not considered having the daemon track resource limits - i.e. creating a cgroup and saying give it 100M swap, and if it asks, let it increase that to 200M. I'd prefer that be done incidentally through (1) and (2). Do you feel that would be insufficient? I think this is a higher-level issue that should not be addressed here. Or maybe your question is something different and I'm missing it? My point was that I, as machine admin, create a memory cgroup of 100 MB for you and put you in it. I also
Re: [lxc-devel] cgroup management daemon
On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): What are the requirements/goals around performance and concurrency? Do you expect this to be a single-threaded thing, or can we handle some number of concurrent operations? Do you expect to use threads of processes? The cgmanager should be pretty dumb, so I would expect it to be quite fast. I don't have any specific perf goals though. If you have requirements I'm very interested to hear them. I should be able to tell pretty soon how far short I fall. If we're limiting this to write traffic only, I think our perf goals are fairly relaxed. As longs as you don't develop it to preclude threading or multi-processing, we can adapt later. I would like to see at least a mention to this effect. We also need to beware DoS (accidental or otherwise) - perhaps we should force round-robin service of pending-requests, or something. By default I'd expect to run with a single thread, but I don't imagine one thread can serve a busy 1024-cpu system very well. Unless you have guidance right now, I think I'd like to get started with the basic functionality and see how it measures up to your requirements. I should add perf counters from the start so we can figure out where bottlenecks (if any) are and how to handle them. Otherwise I could start out with a basic numcpus/10 threadpool and have the main thread do socket i/o and parcel access verification and vfs work out to the threadpool, but I'd rather first know where the problems lie. Agree. Correct first, then fast :) Can you talk about logging - what and where? When started under upstart, anything we print out goes to /var/log/upstart/cgmanager.log. Would be nice to keep it that simple. We could log requests by r to do something it is not allowed to do, but it seems to me the failed attempts cause no harm, while the potential for overflowing logs can. I agree that we don't want to overflow logs. Did you have anything in mind? Did you want logging to help detect certain conditions for system optimization, or just for failure notices and security violations? When something goes amiss, we have to ty to figure out what happened - how far did a request get? Logging every change is probably important. Logging failures could be downsampled and rate-limited, something like 1 failure log per second or something. How will we handle event_fd? Pass a file-descriptor back to the caller? The only thing currently supporting eventfd is memory threshold, right? I haven't tested whether this will work or not, but ideally the caller would open the eventfd fd, pass it, the cgroup name, controller file to be watched, and the args to cgmanager; cgmanager confirms read access, opens the controller fd, makes the request over cgroup.event_control, then passes the controller fd back to the caller and closes its own copy. I'm also not sure whether the cgroup interface is going to be offering a new feature to replace eventfd, since it wants people to stop using cgroupfs... Tejun? That's all I can come up with for now. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
On Tue, Nov 26, 2013 at 8:37 AM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): At the start of this discussion, some months ago, we offered to co-devel this with Lennart et al. They did not seem keen on the idea. If they have an established DBUS protocol spec, see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html we should consider adopting it instead of a new one, but we CAN'T just play follow the leader and do whatever they do, change whenever they feel like changing. Right. And if we suspect that the APIs will always be at least subtly different, then keeping them obviously visually different seems to have some benefit. (i.e. systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M vs dbus-send cgmanager set-value http.server cpushares:500 memorylimit:500M swaplimit:1G ) rather than have admins try to remember now why did that not work here, oh yeah, MemoryLimit over here should be Memorylimit or whatever. Then again if lmctfy is the layer which admins will use, then it doesn't matter as much. It would be best if we could get a common DBUS api specc'ed and all agree to it. Serge, do you feel up to that? Not sure what you mean - I'll certainly send the API to these lists as What I meant was whether it is worth opening a discussion with the systemd folks on a common lowest-level DBUS interface. But it looks like their work is already a bit higher level, so it's probably moot. the code is developed, and will accept all feedback that I get. My only requirements are that the requirements I've listed in the document be feasible, and be feasible back to, say, 3.2 kernels. So that is why we must send an scm-cred for the pid to move into a cgroup. (With 3.12 we may have alterntives, accepting a vpid as a simple dbus message and setns()ing into the requestor's pidns to echo the pid into the cgroup.tasks file.) -serge -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
Quoting Tim Hockin (thoc...@google.com): On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): ... . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. Problem with this definition. Being owned-by is not the same as has-root-in. Specifically, I may choose to give you root in your own namespace, but you sure as heck can not increase your own memory limit. 1. If you don't want me to change the value at all, then just don't map A's owner into the namespace. I'm uid 10 which is root in my namespace, but I only have privilege over other uids mapped into my namespace. I think I understand this, but it is subtle. Maybe some examples would help? When you create a user namespace, at first it is empty, and you are 'nobody' (-1). Then magically some uids from the host, say 10-101999, are mapped into your namespace, to uids 0-1999. Now assume you're uid 0 inside that namespace. You have privilege over your uids, 0-999, which are 10-101999 on the host. If cgroup file A is owned by host uid 0, then the owner is not mapped into the user namespace. uid 0 inside the namespace only gets the world access rights to that file. If cgroup file A is owned by host uid 100100, then uid 0 in the namespace has access to that file by virtue of being root, and uid 100 in the namespace (100100 on the host) has access to the file by virtue of being the owner. 2. I've considered never allowing changes to your own cgroup. So if you're in /a/b, you can create /a/b/c and modify c's settings, but you can't modify b's. OTOH, that isn't strictly necessary - if we did allow it, then you could simply clam /a/b's memory to what you want, and stick me in /a/b/c, so I can't escape the memory limit you wanted. This is different from what we do internally, but it's an interesting semantic. I'm wary of how much we want to make this API about enforcement of policy vs simple enactment. In other words, semantics that diverge from UNIX ownership might be more complicated to understand than they are worth. The semantics I gave are exactly the user namespace semantics. If you're not using a user namespace then they simply do not apply, and you are back to strict UNIX ownership semantics that you want. But allowing 'root' in a user namespace to have privilege over uids, without having any privilege outside its own namespace, must be honored for this to be usable by lxc. Like I said, on the bright side, if you don't want to care about user namespaces, then everything falls back to strict unix semantics - so if you don't want to care, you don't have to care. 3. I've not considered having the daemon track resource limits - i.e. creating a cgroup and saying give it 100M swap, and if it asks, let it increase that to 200M. I'd prefer that be done incidentally through (1) and (2). Do you feel that would be insufficient? I think this is a higher-level issue that should not be addressed here. Or maybe your question is something different and I'm missing it? My point was that I, as machine admin, create a memory cgroup of 100 MB for you and put you in it. I also give you root-in-namespace. You must not be able to change 100 MB to 200 MB. From your (1) you are saying that system UID 0 owns the cgroup and is NOT mapped into your namespace. Therefore your definition holds. I think I can buy that. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a SCM_CREDENTIAL to be sent. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over
Re: [lxc-devel] cgroup management daemon
lmctfy literally supports .. as a container name :) On Tue, Nov 26, 2013 at 12:58 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): ... . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. Problem with this definition. Being owned-by is not the same as has-root-in. Specifically, I may choose to give you root in your own namespace, but you sure as heck can not increase your own memory limit. 1. If you don't want me to change the value at all, then just don't map A's owner into the namespace. I'm uid 10 which is root in my namespace, but I only have privilege over other uids mapped into my namespace. I think I understand this, but it is subtle. Maybe some examples would help? When you create a user namespace, at first it is empty, and you are 'nobody' (-1). Then magically some uids from the host, say 10-101999, are mapped into your namespace, to uids 0-1999. Now assume you're uid 0 inside that namespace. You have privilege over your uids, 0-999, which are 10-101999 on the host. If cgroup file A is owned by host uid 0, then the owner is not mapped into the user namespace. uid 0 inside the namespace only gets the world access rights to that file. If cgroup file A is owned by host uid 100100, then uid 0 in the namespace has access to that file by virtue of being root, and uid 100 in the namespace (100100 on the host) has access to the file by virtue of being the owner. 2. I've considered never allowing changes to your own cgroup. So if you're in /a/b, you can create /a/b/c and modify c's settings, but you can't modify b's. OTOH, that isn't strictly necessary - if we did allow it, then you could simply clam /a/b's memory to what you want, and stick me in /a/b/c, so I can't escape the memory limit you wanted. This is different from what we do internally, but it's an interesting semantic. I'm wary of how much we want to make this API about enforcement of policy vs simple enactment. In other words, semantics that diverge from UNIX ownership might be more complicated to understand than they are worth. The semantics I gave are exactly the user namespace semantics. If you're not using a user namespace then they simply do not apply, and you are back to strict UNIX ownership semantics that you want. But allowing 'root' in a user namespace to have privilege over uids, without having any privilege outside its own namespace, must be honored for this to be usable by lxc. Like I said, on the bright side, if you don't want to care about user namespaces, then everything falls back to strict unix semantics - so if you don't want to care, you don't have to care. 3. I've not considered having the daemon track resource limits - i.e. creating a cgroup and saying give it 100M swap, and if it asks, let it increase that to 200M. I'd prefer that be done incidentally through (1) and (2). Do you feel that would be insufficient? I think this is a higher-level issue that should not be addressed here. Or maybe your question is something different and I'm missing it? My point was that I, as machine admin, create a memory cgroup of 100 MB for you and put you in it. I also give you root-in-namespace. You must not be able to change 100 MB to 200 MB. From your (1) you are saying that system UID 0 owns the cgroup and is NOT mapped into your namespace. Therefore your definition holds. I think I can buy that. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a
Re: [lxc-devel] cgroup management daemon
Quoting Tim Hockin (thoc...@google.com): lmctfy literally supports .. as a container name :) So is ../.. ever used, or does noone every do anything beyond ..? -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
I think most of our usecases have only wanted to know about the parent, but I can see people wanting to go further. Would it be much different to support both? I feel like it'll be simpler to support all if we go that route. On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): lmctfy literally supports .. as a container name :) So is ../.. ever used, or does noone every do anything beyond ..? -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
I see three models: 1) Don't virtualize the cgroup path. This is what lmctfy does, though we have discussed changing to: 2) Virtualize to an administrative root - I get to tell you where your root is, and you can't see anythign higher than that. 3) Virtualize to CWD root - you can never go up, just down. #1 seems easy, but exposes a lot. #3 is restrictive and fairly easy - could we live with that? #2 seems ideal, but it's not clear to me how to actually implement it. On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol vmar...@google.com wrote: I think most of our usecases have only wanted to know about the parent, but I can see people wanting to go further. Would it be much different to support both? I feel like it'll be simpler to support all if we go that route. On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): lmctfy literally supports .. as a container name :) So is ../.. ever used, or does noone every do anything beyond ..? -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
I was planning on doing #3, but since you guys need to access .., my plan is to have 'a/b' refer to $cwd/a/b while /a/b is the absolute path, and allow read and eventfd but no write to any parent dirs. Quoting Tim Hockin (thoc...@google.com): I see three models: 1) Don't virtualize the cgroup path. This is what lmctfy does, though we have discussed changing to: 2) Virtualize to an administrative root - I get to tell you where your root is, and you can't see anythign higher than that. 3) Virtualize to CWD root - you can never go up, just down. #1 seems easy, but exposes a lot. #3 is restrictive and fairly easy - could we live with that? #2 seems ideal, but it's not clear to me how to actually implement it. On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol vmar...@google.com wrote: I think most of our usecases have only wanted to know about the parent, but I can see people wanting to go further. Would it be much different to support both? I feel like it'll be simpler to support all if we go that route. On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Tim Hockin (thoc...@google.com): lmctfy literally supports .. as a container name :) So is ../.. ever used, or does noone every do anything beyond ..? -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] cgroup management daemon
On 11/26/2013 12:43 AM, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a SCM_CREDENTIAL to be sent. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over which it has privilege. * Requestors may be limited to a certain #/depth of cgroups (to limit memory usage) - DEFER? * Cgroup hierarchy is responsible for resource limits * A requestor must either be uid 0 in its userns with victim mapped ito its userns, or the same uid and in same/ancestor pidns as the victim * If r requests creation of cgroup '/x', /x will be interpreted as relative to r's cgroup. r cannot make changes to cgroups not under its own current cgroup. * If r is not in the initial user_ns, then it may not change settings in its own cgroup, only descendants. (Not strictly necessary - we could require the use of extra cgroups when wanted, as lxc does
Re: [lxc-devel] cgroup management daemon
On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote: On 11/26/2013 12:43 AM, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a SCM_CREDENTIAL to be sent. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over which it has privilege. * Requestors may be limited to a certain #/depth of cgroups (to limit memory usage) - DEFER? * Cgroup hierarchy is responsible for resource limits * A requestor must either be uid 0 in its userns with victim mapped ito its userns, or the same uid and in same/ancestor pidns as the victim * If r requests creation of cgroup '/x', /x will be interpreted as relative to r's cgroup. r cannot make changes to cgroups not under its own current cgroup. * If r is not in the initial user_ns, then it may not change
Re: [lxc-devel] cgroup management daemon
On 11/26/2013 02:11 AM, Stéphane Graber wrote: On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote: On 11/26/2013 12:43 AM, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a SCM_CREDENTIAL to be sent. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over which it has privilege. * Requestors may be limited to a certain #/depth of cgroups (to limit memory usage) - DEFER? * Cgroup hierarchy is responsible for resource limits * A requestor must either be uid 0 in its userns with victim mapped ito its userns, or the same uid and in same/ancestor pidns as the victim * If r requests creation of cgroup '/x', /x will be interpreted as relative to r's cgroup. r cannot make changes to cgroups not under its own current cgroup. * If r is not in the initial user_ns, then it may not change
Re: [lxc-devel] cgroup management daemon
On Tue, Nov 26, 2013 at 03:35:22AM +0200, Marian Marinov wrote: On 11/26/2013 02:11 AM, Stéphane Graber wrote: On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote: On 11/26/2013 12:43 AM, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from /proc/PID(v)/status, which is the global uid, and check /proc/PID(r)/uid_map to see whether UID is mapped there. . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have the kernel translate it for the reader. Only 'move task v to cgroup A' will require a SCM_CREDENTIAL to be sent. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over which it has privilege. * Requestors may be limited to a certain #/depth of cgroups (to limit memory usage) - DEFER? * Cgroup hierarchy is responsible for resource limits * A requestor must either be uid 0 in its userns with victim mapped ito its userns, or the same uid and in same/ancestor pidns as the victim * If r requests creation of cgroup '/x', /x will be interpreted as relative to r's cgroup. r cannot make changes to cgroups not under its own current
Re: [lxc-devel] cgroup management daemon
Serge... You have no idea how much I dread mentioning this (well, after LinuxPlumbers, maybe you can) but... You do realize that some of this is EXACTLY what the systemd crowd was talking about there in NOLA back then. I sat in those session grinding my teeth and listening to comments from some others around me about when systemd might subsume bash or even vi or quake. Somehow, you and others have tagged me as a systemd expert but I am far from it and even you noted that Lennart and I were on the edge of a physical discussion when I made some off the cuff remarks there about systemd design during my talk. I personally rank systemd in the same category as NetworkMangler (err, NetworkManager) in its propensity for committing inexplicable random acts of terrorism and changing its behavior from release to release to release. I'm not a fan and I'm not an expert, but I have to be involved with it and watch the damned thing like a trapped rat, like it or not. Like it or not, we can not go off on divergent designs. As much as they have delusions of taking over the Linux world, they are still going to be a major factor and this sort of thing needs to be coordinated. We are going to need exactly what you are proposing whether we have systemd in play or not. IF we CAN kick it to the curb, when we need to, we still need to know how to without tearing shit up and breaking shit that thinks it's there. Ideally, it shouldn't matter if systemd where in play or not. All I ask is that we not get too far off track that we have a major architectural divergence here. The risk is there. Mike On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's
Re: [lxc-devel] cgroup management daemon
Haha, I was wondering how long it'd take before we got the first comment about systemd's own cgroup manager :) To try and keep this short, there are a lot of cases where systemd's plan of having an in-pid1 manager, as practical as it's for them, just isn't going to work for us. I believe our design makes things a bit cleaner by not having it tied to any specific init system or feature and have a relatively low level, very simple API that people can use as a building block for anything that wants to manage cgroups. At this point in time, there's no hard limitation for having one or more processes writing to the cgroup hierarchy, as much as some people may want this to change. I very much doubt it'll happen any time soon and until then, even if not perfectly adequate, there won't be any problem running both systemd's manager and our own. There's also the possibility if someone felt sufficiently strongly about this to contribute patches, to have our manager talk to systemd's if present and go through their manager instead of accessing cgroupfs itself. That's assuming systemd offers a sufficiently low level API that could be used for that without bringing an unreasonable amount of dependencies to our code. I don't want this thread to turn into some kind of flamewar or similarly overheated discussion about systemd vs everyone else, so I'll just state that from my point of view (and I suspect that of the group who worked on this early draft), systemd's manager while perfect for grouping and resource allocation for systemd units and user sessions doesn't quite fit our bill with regard to supporting multiple level of full distro-agnostic containers using nesting and mixing user namespaces. It also has what as a non-systemd person I consider a big drawback of being built into an init system which quite a few major distributions don't use (specifically those distros that account for the majority of LXC's users). I think there's room for two implementations and competition (even if we have slightly different goals) is a good thing and will undoubtedly help both project consider use cases they didn't think of leading to a better solution for everyone. And if some day one of the two wins or we can somehow converge into a solution that works for everyone, that'd be great. But our discussions at Linux Plumbers and other conferences have shown that this isn't going to happen now, so it's best to stop arguing and instead get some stuff done. On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote: Serge... You have no idea how much I dread mentioning this (well, after LinuxPlumbers, maybe you can) but... You do realize that some of this is EXACTLY what the systemd crowd was talking about there in NOLA back then. I sat in those session grinding my teeth and listening to comments from some others around me about when systemd might subsume bash or even vi or quake. Somehow, you and others have tagged me as a systemd expert but I am far from it and even you noted that Lennart and I were on the edge of a physical discussion when I made some off the cuff remarks there about systemd design during my talk. I personally rank systemd in the same category as NetworkMangler (err, NetworkManager) in its propensity for committing inexplicable random acts of terrorism and changing its behavior from release to release to release. I'm not a fan and I'm not an expert, but I have to be involved with it and watch the damned thing like a trapped rat, like it or not. Like it or not, we can not go off on divergent designs. As much as they have delusions of taking over the Linux world, they are still going to be a major factor and this sort of thing needs to be coordinated. We are going to need exactly what you are proposing whether we have systemd in play or not. IF we CAN kick it to the curb, when we need to, we still need to know how to without tearing shit up and breaking shit that thinks it's there. Ideally, it shouldn't matter if systemd where in play or not. All I ask is that we not get too far off track that we have a major architectural divergence here. The risk is there. Mike On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth; that it not allow escaping ones assigned limits; that it not allow subjegating tasks which do not belong to you; and that, within your limits, you be able to parcel those limits to your tasks as you like. Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific
Re: [lxc-devel] cgroup management daemon
On Mon, 2013-11-25 at 21:43 -0500, Stéphane Graber wrote: Haha, I was wondering how long it'd take before we got the first comment about systemd's own cgroup manager :) To try and keep this short, there are a lot of cases where systemd's plan of having an in-pid1 manager, as practical as it's for them, just isn't going to work for us. I believe our design makes things a bit cleaner by not having it tied to any specific init system or feature and have a relatively low level, very simple API that people can use as a building block for anything that wants to manage cgroups. At this point in time, there's no hard limitation for having one or more processes writing to the cgroup hierarchy, as much as some people may want this to change. I very much doubt it'll happen any time soon and until then, even if not perfectly adequate, there won't be any problem running both systemd's manager and our own. There's also the possibility if someone felt sufficiently strongly about this to contribute patches, to have our manager talk to systemd's if present and go through their manager instead of accessing cgroupfs itself. That's assuming systemd offers a sufficiently low level API that could be used for that without bringing an unreasonable amount of dependencies to our code. I don't want this thread to turn into some kind of flamewar or similarly overheated discussion about systemd vs everyone else, so I'll just state that from my point of view (and I suspect that of the group who worked on this early draft), systemd's manager while perfect for grouping and resource allocation for systemd units and user sessions doesn't quite fit our bill with regard to supporting multiple level of full distro-agnostic containers using nesting and mixing user namespaces. It also has what as a non-systemd person I consider a big drawback of being built into an init system which quite a few major distributions don't use (specifically those distros that account for the majority of LXC's users). I think there's room for two implementations and competition (even if we have slightly different goals) is a good thing and will undoubtedly help both project consider use cases they didn't think of leading to a better solution for everyone. And if some day one of the two wins or we can somehow converge into a solution that works for everyone, that'd be great. But our discussions at Linux Plumbers and other conferences have shown that this isn't going to happen now, so it's best to stop arguing and instead get some stuff done. Concur. And, as you know, I'm not a fan or supporter of that camp. I just want to make sure everyone is aware of all the gorillas in the room before the fecal flakes hit the rapidly whirling blades. That being said, I think this is a laudable goal. If we do it right, it well can become the standard they have to adhere to. Regards, Mike On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote: Serge... You have no idea how much I dread mentioning this (well, after LinuxPlumbers, maybe you can) but... You do realize that some of this is EXACTLY what the systemd crowd was talking about there in NOLA back then. I sat in those session grinding my teeth and listening to comments from some others around me about when systemd might subsume bash or even vi or quake. Somehow, you and others have tagged me as a systemd expert but I am far from it and even you noted that Lennart and I were on the edge of a physical discussion when I made some off the cuff remarks there about systemd design during my talk. I personally rank systemd in the same category as NetworkMangler (err, NetworkManager) in its propensity for committing inexplicable random acts of terrorism and changing its behavior from release to release to release. I'm not a fan and I'm not an expert, but I have to be involved with it and watch the damned thing like a trapped rat, like it or not. Like it or not, we can not go off on divergent designs. As much as they have delusions of taking over the Linux world, they are still going to be a major factor and this sort of thing needs to be coordinated. We are going to need exactly what you are proposing whether we have systemd in play or not. IF we CAN kick it to the curb, when we need to, we still need to know how to without tearing shit up and breaking shit that thinks it's there. Ideally, it shouldn't matter if systemd where in play or not. All I ask is that we not get too far off track that we have a major architectural divergence here. The risk is there. Mike On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: Hi, as i've mentioned several times, I want to write a standalone cgroup management daemon. Basic requirements are that it be a standalone program; that a single instance running on the host be usable from containers nested at any depth;
Re: [lxc-devel] cgroup management daemon
Quoting Tim Hockin (thoc...@google.com): Thanks for this! I think it helps a lot to discuss now, rather than over nearly-done code. On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote: Additionally, Tejun has specified that we do not want users to be too closely tied to the cgroupfs implementation. Therefore commands will be just a hair more general than specifying cgroupfs filenames and values. I may go so far as to avoid specifying specific controllers, as AFAIK there should be no redundancy in features. On the other hand, I don't want to get too general. So I'm basing the API loosely on the lmctfy command line API. I'm torn here. While I agree in principle with Tejun, I am concerned that this agent will always lag new kernel features or that the thin abstraction you want to provide here does not easily accommodate some of the more ... oddball features of one cgroup interface or another. This agent is the very bottom of the stack, and should probably not do much by way of abstraction. I think I'd rather let something like lmctfy provide the abstraction more holistically, and relegate this If lmctfy is an abstraction layer that should keep Tejun happy, and it could keep me out of the resource naming game which makes me happy :) agent to very simple plumbing and policy. It could be as simple as providing read/write/etc ops to specific control files. It needs to handle event_fd, too, I guess. This has the nice side-effect of always being current on kernel features :) Summary Each 'host' (identified by a separate instance of the linux kernel) will have exactly one running daemon to manage control groups. This daemon will answer cgroup management requests over a dbus socket, located at /sys/fs/cgroup/manager. This socket can be bind-mounted into various containers, so that one daemon can support the whole system. Programs will be able to make cgroup requests using dbus calls, or indirectly by linking against lmctfy which will be modified to use the dbus calls if available. Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It will mount the cgroup hierarchies in a private namespace and set defaults (clone_children, use_hierarchy, sane_behavior, release_agent?) It will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). Where does the config come from? How do I specify which hierarchies I want and where, and which flags? That'll have to be in a file in /etc (which can be mounted readonly). There should be no surprises there so I've not thought about the format. . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/manager using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. Problem with this definition. Being owned-by is not the same as has-root-in. Specifically, I may choose to give you root in your own namespace, but you sure as heck can not increase your own memory limit. 1. If you don't want me to change the value at all, then just don't map A's owner into the namespace. I'm uid 10 which is root in my namespace, but I only have privilege over other uids mapped into my namespace. 2. I've considered never allowing changes to your own cgroup. So if you're in /a/b, you can create /a/b/c and modify c's settings, but you can't modify b's. OTOH, that isn't strictly necessary - if we did allow it, then you could simply clam /a/b's memory to what you want, and stick me in /a/b/c, so I can't escape the memory limit you wanted. 3. I've not considered having the daemon track resource limits - i.e. creating a cgroup and saying give it 100M swap, and if it asks, let it increase that to 200M. I'd prefer that be done incidentally through (1) and (2). Do you feel that would be insufficient? Or maybe your question is something different and I'm missing it? . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the translated global pid. It will then read UID(v) from