Re: [v8 0/4] cgroup-aware OOM killer
In the example above: root /\ A D / \ B C Does oom_group allow me to express "compare A and D; if A is chosen compare B and C; kill the loser" ? As I understand the proposal (from reading thread, not patch) it does not. On Mon, Oct 2, 2017 at 12:56 PM, Michal Hockowrote: > On Mon 02-10-17 12:45:18, Shakeel Butt wrote: >> > I am sorry to cut the rest of your proposal because it simply goes over >> > the scope of the proposed solution while the usecase you are mentioning >> > is still possible. If we want to compare intermediate nodes (which seems >> > to be the case) then we can always provide a knob to opt-in - be it your >> > oom_gang or others. >> >> In the Roman's proposed solution we can already force the comparison >> of intermediate nodes using 'oom_group', I am just requesting to >> separate the killall semantics from it. > > oom_group _is_ about killall semantic. And comparing killable entities > is just a natural thing to do. So I am not sure what you mean > > -- > Michal Hocko > SUSE Labs
Re: [v8 0/4] cgroup-aware OOM killer
In the example above: root /\ A D / \ B C Does oom_group allow me to express "compare A and D; if A is chosen compare B and C; kill the loser" ? As I understand the proposal (from reading thread, not patch) it does not. On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko wrote: > On Mon 02-10-17 12:45:18, Shakeel Butt wrote: >> > I am sorry to cut the rest of your proposal because it simply goes over >> > the scope of the proposed solution while the usecase you are mentioning >> > is still possible. If we want to compare intermediate nodes (which seems >> > to be the case) then we can always provide a knob to opt-in - be it your >> > oom_gang or others. >> >> In the Roman's proposed solution we can already force the comparison >> of intermediate nodes using 'oom_group', I am just requesting to >> separate the killall semantics from it. > > oom_group _is_ about killall semantic. And comparing killable entities > is just a natural thing to do. So I am not sure what you mean > > -- > Michal Hocko > SUSE Labs
Re: [v8 0/4] cgroup-aware OOM killer
On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin <g...@fb.com> wrote: > On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote: >> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mho...@kernel.org> wrote: >> > On Tue 26-09-17 20:37:37, Tim Hockin wrote: >> > [...] >> >> I feel like David has offered examples here, and many of us at Google >> >> have offered examples as long ago as 2013 (if I recall) of cases where >> >> the proposed heuristic is EXACTLY WRONG. >> > >> > I do not think we have discussed anything resembling the current >> > approach. And I would really appreciate some more examples where >> > decisions based on leaf nodes would be EXACTLY WRONG. >> > >> >> We need OOM behavior to kill in a deterministic order configured by >> >> policy. >> > >> > And nobody is objecting to this usecase. I think we can build a priority >> > policy on top of leaf-based decision as well. The main point we are >> > trying to sort out here is a reasonable semantic that would work for >> > most workloads. Sibling based selection will simply not work on those >> > that have to use deeper hierarchies for organizational purposes. I >> > haven't heard a counter argument for that example yet. >> > > Hi, Tim! > >> We have a priority-based, multi-user cluster. That cluster runs a >> variety of work, including critical things like search and gmail, as >> well as non-critical things like batch work. We try to offer our >> users an SLA around how often they will be killed by factors outside >> themselves, but we also want to get higher utilization. We know for a >> fact (data, lots of data) that most jobs have spare memory capacity, >> set aside for spikes or simply because accurate sizing is hard. We >> can sell "guaranteed" resources to critical jobs, with a high SLA. We >> can sell "best effort" resources to non-critical jobs with a low SLA. >> We achieve much better overall utilization this way. > > This is well understood. > >> >> I need to represent the priority of these tasks in a way that gives me >> a very strong promise that, in case of system OOM, the non-critical >> jobs will be chosen before the critical jobs. Regardless of size. >> Regardless of how many non-critical jobs have to die. I'd rather kill >> *all* of the non-critical jobs than a single critical job. Size of >> the process or cgroup is simply not a factor, and honestly given 2 >> options of equal priority I'd say age matters more than size. >> >> So concretely I have 2 first-level cgroups, one for "guaranteed" and >> one for "best effort" classes. I always want to kill from "best >> effort", even if that means killing 100 small cgroups, before touching >> "guaranteed". >> >> I apologize if this is not as thorough as the rest of the thread - I >> am somewhat out of touch with the guts of it all these days. I just >> feel compelled to indicate that, as a historical user (via Google >> systems) and current user (via Kubernetes), some of the assertions >> being made here do not ring true for our very real use cases. I >> desperately want cgroup-aware OOM handing, but it has to be >> policy-based or it is just not useful to us. > > A policy-based approach was suggested by Michal at a very beginning of > this discussion. Although nobody had any strong objections against it, > we've agreed that this is out of scope of this patchset. > > The idea of this patchset is to introduce an ability to select a memcg > as an OOM victim with the following optional killing of all belonging tasks. > I believe, it's absolutely mandatory for _any_ further development > of the OOM killer, which wants to deal with memory cgroups as OOM entities. > > If you think that it makes impossible to support some use cases in the future, > let's discuss it. Otherwise, I'd prefer to finish this part of the work, > and proceed to the following improvements on top of it. > > Thank you! I am 100% in favor of killing whole groups. We want that too. I just needed to express disagreement with statements that size-based decisions could not produce bad results. They can and do.
Re: [v8 0/4] cgroup-aware OOM killer
On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin wrote: > On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote: >> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko wrote: >> > On Tue 26-09-17 20:37:37, Tim Hockin wrote: >> > [...] >> >> I feel like David has offered examples here, and many of us at Google >> >> have offered examples as long ago as 2013 (if I recall) of cases where >> >> the proposed heuristic is EXACTLY WRONG. >> > >> > I do not think we have discussed anything resembling the current >> > approach. And I would really appreciate some more examples where >> > decisions based on leaf nodes would be EXACTLY WRONG. >> > >> >> We need OOM behavior to kill in a deterministic order configured by >> >> policy. >> > >> > And nobody is objecting to this usecase. I think we can build a priority >> > policy on top of leaf-based decision as well. The main point we are >> > trying to sort out here is a reasonable semantic that would work for >> > most workloads. Sibling based selection will simply not work on those >> > that have to use deeper hierarchies for organizational purposes. I >> > haven't heard a counter argument for that example yet. >> > > Hi, Tim! > >> We have a priority-based, multi-user cluster. That cluster runs a >> variety of work, including critical things like search and gmail, as >> well as non-critical things like batch work. We try to offer our >> users an SLA around how often they will be killed by factors outside >> themselves, but we also want to get higher utilization. We know for a >> fact (data, lots of data) that most jobs have spare memory capacity, >> set aside for spikes or simply because accurate sizing is hard. We >> can sell "guaranteed" resources to critical jobs, with a high SLA. We >> can sell "best effort" resources to non-critical jobs with a low SLA. >> We achieve much better overall utilization this way. > > This is well understood. > >> >> I need to represent the priority of these tasks in a way that gives me >> a very strong promise that, in case of system OOM, the non-critical >> jobs will be chosen before the critical jobs. Regardless of size. >> Regardless of how many non-critical jobs have to die. I'd rather kill >> *all* of the non-critical jobs than a single critical job. Size of >> the process or cgroup is simply not a factor, and honestly given 2 >> options of equal priority I'd say age matters more than size. >> >> So concretely I have 2 first-level cgroups, one for "guaranteed" and >> one for "best effort" classes. I always want to kill from "best >> effort", even if that means killing 100 small cgroups, before touching >> "guaranteed". >> >> I apologize if this is not as thorough as the rest of the thread - I >> am somewhat out of touch with the guts of it all these days. I just >> feel compelled to indicate that, as a historical user (via Google >> systems) and current user (via Kubernetes), some of the assertions >> being made here do not ring true for our very real use cases. I >> desperately want cgroup-aware OOM handing, but it has to be >> policy-based or it is just not useful to us. > > A policy-based approach was suggested by Michal at a very beginning of > this discussion. Although nobody had any strong objections against it, > we've agreed that this is out of scope of this patchset. > > The idea of this patchset is to introduce an ability to select a memcg > as an OOM victim with the following optional killing of all belonging tasks. > I believe, it's absolutely mandatory for _any_ further development > of the OOM killer, which wants to deal with memory cgroups as OOM entities. > > If you think that it makes impossible to support some use cases in the future, > let's discuss it. Otherwise, I'd prefer to finish this part of the work, > and proceed to the following improvements on top of it. > > Thank you! I am 100% in favor of killing whole groups. We want that too. I just needed to express disagreement with statements that size-based decisions could not produce bad results. They can and do.
Re: [v8 0/4] cgroup-aware OOM killer
On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mho...@kernel.org> wrote: > On Tue 26-09-17 20:37:37, Tim Hockin wrote: > [...] >> I feel like David has offered examples here, and many of us at Google >> have offered examples as long ago as 2013 (if I recall) of cases where >> the proposed heuristic is EXACTLY WRONG. > > I do not think we have discussed anything resembling the current > approach. And I would really appreciate some more examples where > decisions based on leaf nodes would be EXACTLY WRONG. > >> We need OOM behavior to kill in a deterministic order configured by >> policy. > > And nobody is objecting to this usecase. I think we can build a priority > policy on top of leaf-based decision as well. The main point we are > trying to sort out here is a reasonable semantic that would work for > most workloads. Sibling based selection will simply not work on those > that have to use deeper hierarchies for organizational purposes. I > haven't heard a counter argument for that example yet. We have a priority-based, multi-user cluster. That cluster runs a variety of work, including critical things like search and gmail, as well as non-critical things like batch work. We try to offer our users an SLA around how often they will be killed by factors outside themselves, but we also want to get higher utilization. We know for a fact (data, lots of data) that most jobs have spare memory capacity, set aside for spikes or simply because accurate sizing is hard. We can sell "guaranteed" resources to critical jobs, with a high SLA. We can sell "best effort" resources to non-critical jobs with a low SLA. We achieve much better overall utilization this way. I need to represent the priority of these tasks in a way that gives me a very strong promise that, in case of system OOM, the non-critical jobs will be chosen before the critical jobs. Regardless of size. Regardless of how many non-critical jobs have to die. I'd rather kill *all* of the non-critical jobs than a single critical job. Size of the process or cgroup is simply not a factor, and honestly given 2 options of equal priority I'd say age matters more than size. So concretely I have 2 first-level cgroups, one for "guaranteed" and one for "best effort" classes. I always want to kill from "best effort", even if that means killing 100 small cgroups, before touching "guaranteed". I apologize if this is not as thorough as the rest of the thread - I am somewhat out of touch with the guts of it all these days. I just feel compelled to indicate that, as a historical user (via Google systems) and current user (via Kubernetes), some of the assertions being made here do not ring true for our very real use cases. I desperately want cgroup-aware OOM handing, but it has to be policy-based or it is just not useful to us. Thanks. Tim
Re: [v8 0/4] cgroup-aware OOM killer
On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko wrote: > On Tue 26-09-17 20:37:37, Tim Hockin wrote: > [...] >> I feel like David has offered examples here, and many of us at Google >> have offered examples as long ago as 2013 (if I recall) of cases where >> the proposed heuristic is EXACTLY WRONG. > > I do not think we have discussed anything resembling the current > approach. And I would really appreciate some more examples where > decisions based on leaf nodes would be EXACTLY WRONG. > >> We need OOM behavior to kill in a deterministic order configured by >> policy. > > And nobody is objecting to this usecase. I think we can build a priority > policy on top of leaf-based decision as well. The main point we are > trying to sort out here is a reasonable semantic that would work for > most workloads. Sibling based selection will simply not work on those > that have to use deeper hierarchies for organizational purposes. I > haven't heard a counter argument for that example yet. We have a priority-based, multi-user cluster. That cluster runs a variety of work, including critical things like search and gmail, as well as non-critical things like batch work. We try to offer our users an SLA around how often they will be killed by factors outside themselves, but we also want to get higher utilization. We know for a fact (data, lots of data) that most jobs have spare memory capacity, set aside for spikes or simply because accurate sizing is hard. We can sell "guaranteed" resources to critical jobs, with a high SLA. We can sell "best effort" resources to non-critical jobs with a low SLA. We achieve much better overall utilization this way. I need to represent the priority of these tasks in a way that gives me a very strong promise that, in case of system OOM, the non-critical jobs will be chosen before the critical jobs. Regardless of size. Regardless of how many non-critical jobs have to die. I'd rather kill *all* of the non-critical jobs than a single critical job. Size of the process or cgroup is simply not a factor, and honestly given 2 options of equal priority I'd say age matters more than size. So concretely I have 2 first-level cgroups, one for "guaranteed" and one for "best effort" classes. I always want to kill from "best effort", even if that means killing 100 small cgroups, before touching "guaranteed". I apologize if this is not as thorough as the rest of the thread - I am somewhat out of touch with the guts of it all these days. I just feel compelled to indicate that, as a historical user (via Google systems) and current user (via Kubernetes), some of the assertions being made here do not ring true for our very real use cases. I desperately want cgroup-aware OOM handing, but it has to be policy-based or it is just not useful to us. Thanks. Tim
Re: [v8 0/4] cgroup-aware OOM killer
I'm excited to see this being discussed again - it's been years since the last attempt. I've tried to stay out of the conversation, but I feel obligated say something and then go back to lurking. On Tue, Sep 26, 2017 at 10:26 AM, Johannes Weinerwrote: > On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote: >> On Tue 26-09-17 13:13:00, Roman Gushchin wrote: >> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote: >> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote: >> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote: >> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote: >> > > > > [...] >> > > > > > I'm not against this model, as I've said before. It feels logical, >> > > > > > and will work fine in most cases. >> > > > > > >> > > > > > In this case we can drop any mount/boot options, because it >> > > > > > preserves >> > > > > > the existing behavior in the default configuration. A big >> > > > > > advantage. >> > > > > >> > > > > I am not sure about this. We still need an opt-in, ragardless, >> > > > > because >> > > > > selecting the largest process from the largest memcg != selecting the >> > > > > largest task (just consider memcgs with many processes example). >> > > > >> > > > As I understand Johannes, he suggested to compare individual processes >> > > > with >> > > > group_oom mem cgroups. In other words, always select a killable entity >> > > > with >> > > > the biggest memory footprint. >> > > > >> > > > This is slightly different from my v8 approach, where I treat leaf >> > > > memcgs >> > > > as indivisible memory consumers independent on group_oom setting, so >> > > > by default I'm selecting the biggest task in the biggest memcg. >> > > >> > > My reading is that he is actually proposing the same thing I've been >> > > mentioning. Simply select the biggest killable entity (leaf memcg or >> > > group_oom hierarchy) and either kill the largest task in that entity >> > > (for !group_oom) or the whole memcg/hierarchy otherwise. >> > >> > He wrote the following: >> > "So I'm leaning toward the second model: compare all oomgroups and >> > standalone tasks in the system with each other, independent of the >> > failed hierarchical control structure. Then kill the biggest of them." >> >> I will let Johannes to comment but I believe this is just a >> misunderstanding. If we compared only the biggest task from each memcg >> then we are basically losing our fairness objective, aren't we? > > Sorry about the confusion. > > Yeah I was making the case for what Michal proposed, to kill the > biggest terminal consumer, which is either a task or an oomgroup. > > You'd basically iterate through all the tasks and cgroups in the > system and pick the biggest task that isn't in an oom group or the > biggest oom group and then kill that. > > Yeah, you'd have to compare the memory footprints of tasks with the > memory footprints of cgroups. These aren't defined identically, and > tasks don't get attributed every type of allocation that a cgroup > would. But it should get us in the ballpark, and I cannot picture a > scenario where this would lead to a completely undesirable outcome. That last sentence: > I cannot picture a scenario where this would lead to a completely undesirable > outcome. I feel like David has offered examples here, and many of us at Google have offered examples as long ago as 2013 (if I recall) of cases where the proposed heuristic is EXACTLY WRONG. We need OOM behavior to kill in a deterministic order configured by policy. Sometimes, I would literally prefer to kill every other cgroup before killing "the big one". The policy is *all* that matters for shared clusters of varying users and priorities. We did this in Borg, and it works REALLY well. Has for years. Now that the world is adopting Kubernetes we need it again, only it's much harder to carry a kernel patch in this case.
Re: [v8 0/4] cgroup-aware OOM killer
I'm excited to see this being discussed again - it's been years since the last attempt. I've tried to stay out of the conversation, but I feel obligated say something and then go back to lurking. On Tue, Sep 26, 2017 at 10:26 AM, Johannes Weiner wrote: > On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote: >> On Tue 26-09-17 13:13:00, Roman Gushchin wrote: >> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote: >> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote: >> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote: >> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote: >> > > > > [...] >> > > > > > I'm not against this model, as I've said before. It feels logical, >> > > > > > and will work fine in most cases. >> > > > > > >> > > > > > In this case we can drop any mount/boot options, because it >> > > > > > preserves >> > > > > > the existing behavior in the default configuration. A big >> > > > > > advantage. >> > > > > >> > > > > I am not sure about this. We still need an opt-in, ragardless, >> > > > > because >> > > > > selecting the largest process from the largest memcg != selecting the >> > > > > largest task (just consider memcgs with many processes example). >> > > > >> > > > As I understand Johannes, he suggested to compare individual processes >> > > > with >> > > > group_oom mem cgroups. In other words, always select a killable entity >> > > > with >> > > > the biggest memory footprint. >> > > > >> > > > This is slightly different from my v8 approach, where I treat leaf >> > > > memcgs >> > > > as indivisible memory consumers independent on group_oom setting, so >> > > > by default I'm selecting the biggest task in the biggest memcg. >> > > >> > > My reading is that he is actually proposing the same thing I've been >> > > mentioning. Simply select the biggest killable entity (leaf memcg or >> > > group_oom hierarchy) and either kill the largest task in that entity >> > > (for !group_oom) or the whole memcg/hierarchy otherwise. >> > >> > He wrote the following: >> > "So I'm leaning toward the second model: compare all oomgroups and >> > standalone tasks in the system with each other, independent of the >> > failed hierarchical control structure. Then kill the biggest of them." >> >> I will let Johannes to comment but I believe this is just a >> misunderstanding. If we compared only the biggest task from each memcg >> then we are basically losing our fairness objective, aren't we? > > Sorry about the confusion. > > Yeah I was making the case for what Michal proposed, to kill the > biggest terminal consumer, which is either a task or an oomgroup. > > You'd basically iterate through all the tasks and cgroups in the > system and pick the biggest task that isn't in an oom group or the > biggest oom group and then kill that. > > Yeah, you'd have to compare the memory footprints of tasks with the > memory footprints of cgroups. These aren't defined identically, and > tasks don't get attributed every type of allocation that a cgroup > would. But it should get us in the ballpark, and I cannot picture a > scenario where this would lead to a completely undesirable outcome. That last sentence: > I cannot picture a scenario where this would lead to a completely undesirable > outcome. I feel like David has offered examples here, and many of us at Google have offered examples as long ago as 2013 (if I recall) of cases where the proposed heuristic is EXACTLY WRONG. We need OOM behavior to kill in a deterministic order configured by policy. Sometimes, I would literally prefer to kill every other cgroup before killing "the big one". The policy is *all* that matters for shared clusters of varying users and priorities. We did this in Borg, and it works REALLY well. Has for years. Now that the world is adopting Kubernetes we need it again, only it's much harder to carry a kernel patch in this case.
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Feb 28, 2015 2:50 PM, "Tejun Heo" wrote: > > On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote: > > Wow, so much anger. I'm not even sure how to respond, so I'll just > > say this and sign off. All I want is a better, friendlier, more > > useful system overall. We clearly have different ways of looking at > > the problem. > > Can you communicate anything w/o passive aggression? If you have a > technical point, just state that. Can you at least agree that we > shouldn't be making design decisions based on 16bit pid_t? Hmm, I have screwed this thread up, I think. I've made some remarks that did not come through with the proper tongue-in-cheek slant. I'm not being passive aggressive - we DO look at this problem differently. OF COURSE we should not make decisions based on ancient artifacts of history. My point was that there are secondary considerations here - PIDs are more than just the memory that backs them. They _ARE_ a constrained resource, and you shouldn't assume the constraint is just physical memory. It is a piece of policy that is outside the control of the kernel proper - we handed those keys to userspace along time ago. Given that, I believe and have believed that the solution should model the problem as the user perceives it - limiting PIDs - rather than attaching to a solution-by-proxy. Yes a solution here partially overlaps with kmemcg, but I don't think that is a significant problem. They are different policies governing behavior that may result in the same condition, but for very different reasons. I do not think that is particularly bad for overall comprehension, and I think the fact that this popped up yet again indicates the existence of some nugget of user experience that is worth paying consideration to. I appreciate your promised consideration through a slightly refocused lens. I will go back to my cave and do something I hope is more productive and less antagonistic. I did not mean to bring out so much vitriol. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo wrote: > > On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote: > > I am sorry that real-user problems are not perceived as substantial. This > > was/is a real issue for us. Being in limbo for years on end might not be a > > technical point, but I do think it matters, and that was my point. > > It's a problem which is localized to you and caused by the specific > problems of your setup. This isn't a wide-spread problem at all and > the world doesn't revolve around you. If your setup is so messed up > as to require sticking to 16bit pids, handle that locally. If > something at larger scale eases that handling, you get lucky. If not, > it's *your* predicament to deal with. The rest of the world doesn't > exist to wipe your ass. Wow, so much anger. I'm not even sure how to respond, so I'll just say this and sign off. All I want is a better, friendlier, more useful system overall. We clearly have different ways of looking at the problem. No antagonism intended Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo t...@kernel.org wrote: On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote: I am sorry that real-user problems are not perceived as substantial. This was/is a real issue for us. Being in limbo for years on end might not be a technical point, but I do think it matters, and that was my point. It's a problem which is localized to you and caused by the specific problems of your setup. This isn't a wide-spread problem at all and the world doesn't revolve around you. If your setup is so messed up as to require sticking to 16bit pids, handle that locally. If something at larger scale eases that handling, you get lucky. If not, it's *your* predicament to deal with. The rest of the world doesn't exist to wipe your ass. Wow, so much anger. I'm not even sure how to respond, so I'll just say this and sign off. All I want is a better, friendlier, more useful system overall. We clearly have different ways of looking at the problem. No antagonism intended Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Feb 28, 2015 2:50 PM, Tejun Heo t...@kernel.org wrote: On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote: Wow, so much anger. I'm not even sure how to respond, so I'll just say this and sign off. All I want is a better, friendlier, more useful system overall. We clearly have different ways of looking at the problem. Can you communicate anything w/o passive aggression? If you have a technical point, just state that. Can you at least agree that we shouldn't be making design decisions based on 16bit pid_t? Hmm, I have screwed this thread up, I think. I've made some remarks that did not come through with the proper tongue-in-cheek slant. I'm not being passive aggressive - we DO look at this problem differently. OF COURSE we should not make decisions based on ancient artifacts of history. My point was that there are secondary considerations here - PIDs are more than just the memory that backs them. They _ARE_ a constrained resource, and you shouldn't assume the constraint is just physical memory. It is a piece of policy that is outside the control of the kernel proper - we handed those keys to userspace along time ago. Given that, I believe and have believed that the solution should model the problem as the user perceives it - limiting PIDs - rather than attaching to a solution-by-proxy. Yes a solution here partially overlaps with kmemcg, but I don't think that is a significant problem. They are different policies governing behavior that may result in the same condition, but for very different reasons. I do not think that is particularly bad for overall comprehension, and I think the fact that this popped up yet again indicates the existence of some nugget of user experience that is worth paying consideration to. I appreciate your promised consideration through a slightly refocused lens. I will go back to my cave and do something I hope is more productive and less antagonistic. I did not mean to bring out so much vitriol. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo wrote: > On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote: >> > In general, I'm pretty strongly against adding controllers for things >> > which aren't fundamental resources in the system. What's next? Open >> > files? Pipe buffer? Number of flocks? Number of session leaders or >> > program groups? >> >> Yes to some or all of those. We do exactly this internally and it has >> greatly added to the stability of our overall container management >> system. and while you have been telling everyone to wait for kmemcg, >> we have had an extra 3+ years of stability. > > Yeah, good job. I totally get why kernel part of memory consumption > needs protection. I'm not arguing against that at all. You keep shifting the focus to be about memory, but that's not what people are asking for. You're letting the desire for a perfect solution (which is years late) block good solutions that exist NOW. >> > If you want to prevent a certain class of jobs from exhausting a given >> > resource, protecting that resource is the obvious thing to do. >> >> I don't follow your argument - isn't this exactly what this patch set >> is doing - protecting resources? > > If you have proper protection over kernel memory consumption, this is > completely covered because memory is the fundamental resource here. > Controlling distribution of those fundamental resources is what > cgroups are primarily about. You say that's what cgroups are about, but it's not at all obvious that you are right. What users, admins, systems people want is building blocks that are usable and make sense. Limiting kernel memory is NOT the logical building block, here. It's not something people can reason about or quantify easily. if you need to implement the interfaces in terms of memory, go nuts, but making users think liek that is just not right. >> > Wasn't it like a year ago? Yeah, it's taking longer than everybody >> > hoped but seriously kmemcg reclaimer just got merged and also did the >> > new memcg interface which will tie kmemcg and memcg together. >> >> By my email it was almost 2 years ago, and that was the second or >> third incarnation of this patch. > > Again, I agree this is taking a while. Memory people had to retool > the whole reclamation path to make this work, which is the pattern > being repeated across the different controllers - we're refactoring a > lot of infrastructure code so that resource control can integrate with > the regular operation of the kernel, which BTW is what we should have > been doing from the beginning. > > If your complaint is that this is taking too long, I hear you, and > there's a certain amount of validity in arguing that upstreaming a > temporary measure is the better trade-off, but the rationale for nproc > (or nfds, or virtual memory, whatever) has been pretty weak otherwise. At least 3 or 4 people have INDEPENDENTLY decided this is what is causing them pain and tried to fix it and invested the time to send a patch says that it is actually a thing. There exists a problem that you are disallowing to be fixed. Do you recognize that users are experiencing pain? Why do you hate your users? :) > And as for the different incarnations of this patchset. Reposting the > same stuff repeatedly doesn't really change anything. Why would it? Because reasonable people might survey the ecosystem and say "humm, things have changed over the years - isolation has become a pretty serious topic". or maybe they hope that you'll finally agree that fixing the problem NOW is worthwhile, even if the solution is imperfect, and that a more perfect solution will arrive. >> >> Something like this is long overdue, IMO, and is still more >> >> appropriate and obvious than kmemcg anyway. >> > >> > Thanks for chiming in again but if you aren't bringing out anything >> > new to the table (I don't remember you doing that last time either), >> > I'm not sure why the decision would be different this time. >> >> I'm just vocalizing my support for this idea in defense of practical >> solutions that work NOW instead of "engineering ideals" that never >> actually arrive. >> >> As containers take the server world by storm, stuff like this gets >> more and more important. > > Again, protection of kernel side memory consumption is important. > There's no question about that. As for the never-arriving part, well, > it is arriving. If you still can't believe, just take a look at the > code. Are you willing to put a drop-dead date on it? If we don't have kmemcg working well enough to _actually_ bound PID usage and F
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo wrote: > Hello, > > On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote: >> Kernel memory consumption isn't the only valid reason to want to limit the >> number of processes in a cgroup. Limiting the number of processes is very >> useful to ensure that a program is working correctly (for example, the NTP >> daemon should (usually) have an _exact_ number of children if it is >> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_ >> children), to prevent PID number exhaustion, to head off DoS attacks against >> forking network servers before they get to the point of causing kmem >> exhaustion, and to limit the number of processes in a cgroup that uses lots >> of kernel memory very infrequently. > > All the use cases you're listing are extremely niche and can be > trivially achieved without introducing another cgroup controller. Not > only that, they're actually pretty silly. Let's say NTP daemon is > misbehaving (or its code changed w/o you knowing or there are corner > cases which trigger extremely infrequently). What do you exactly > achieve by rejecting its fork call? It's just adding another > variation to the misbehavior. It was misbehaving before and would now > be continuing to misbehave after a failed fork. > > In general, I'm pretty strongly against adding controllers for things > which aren't fundamental resources in the system. What's next? Open > files? Pipe buffer? Number of flocks? Number of session leaders or > program groups? Yes to some or all of those. We do exactly this internally and it has greatly added to the stability of our overall container management system. and while you have been telling everyone to wait for kmemcg, we have had an extra 3+ years of stability. > If you want to prevent a certain class of jobs from exhausting a given > resource, protecting that resource is the obvious thing to do. I don't follow your argument - isn't this exactly what this patch set is doing - protecting resources? > Wasn't it like a year ago? Yeah, it's taking longer than everybody > hoped but seriously kmemcg reclaimer just got merged and also did the > new memcg interface which will tie kmemcg and memcg together. By my email it was almost 2 years ago, and that was the second or third incarnation of this patch. >> Something like this is long overdue, IMO, and is still more >> appropriate and obvious than kmemcg anyway. > > Thanks for chiming in again but if you aren't bringing out anything > new to the table (I don't remember you doing that last time either), > I'm not sure why the decision would be different this time. I'm just vocalizing my support for this idea in defense of practical solutions that work NOW instead of "engineering ideals" that never actually arrive. As containers take the server world by storm, stuff like this gets more and more important. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Fri, Feb 27, 2015 at 8:42 AM, Austin S Hemmelgarn wrote: > On 2015-02-27 06:49, Tejun Heo wrote: >> >> Hello, >> >> On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote: >>> >>> The current state of resource limitation for the number of open >>> processes (as well as the number of open file descriptors) requires you >>> to use setrlimit(2), which means that you are limited to resource >>> limiting process trees rather than resource limiting cgroups (which is >>> the point of cgroups). >>> >>> There was a patch to implement this in 2011[1], but that was rejected >>> because it implemented a general-purpose rlimit subsystem -- which meant >>> that you couldn't control distinct resource limits in different >>> heirarchies. This patch implements a resource controller *specifically* >>> for the number of processes in a cgroup, overcoming this issue. >>> >>> There has been a similar attempt to implement a resource controller for >>> the number of open file descriptors[2], which has not been merged >>> becasue the reasons were dubious. Merely from a "sane interface" >>> perspective, it should be possible to utilise cgroups to do such >>> rudimentary resource management (which currently only exists for process >>> trees). >> >> >> This isn't a proper resource to control. kmemcg just grew proper >> reclaim support and will be useable to control kernel side of memory >> consumption. I was told that the plan was to use kmemcg - but I was told that YEARS AGO. In the mean time we all either do our own thing or we do nothing and suffer. Something like this is long overdue, IMO, and is still more appropriate and obvious than kmemcg anyway. >> Thanks. >> > Kernel memory consumption isn't the only valid reason to want to limit the > number of processes in a cgroup. Limiting the number of processes is very > useful to ensure that a program is working correctly (for example, the NTP > daemon should (usually) have an _exact_ number of children if it is > functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_ > children), to prevent PID number exhaustion, to head off DoS attacks against > forking network servers before they get to the point of causing kmem > exhaustion, and to limit the number of processes in a cgroup that uses lots > of kernel memory very infrequently. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Fri, Feb 27, 2015 at 8:42 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2015-02-27 06:49, Tejun Heo wrote: Hello, On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote: The current state of resource limitation for the number of open processes (as well as the number of open file descriptors) requires you to use setrlimit(2), which means that you are limited to resource limiting process trees rather than resource limiting cgroups (which is the point of cgroups). There was a patch to implement this in 2011[1], but that was rejected because it implemented a general-purpose rlimit subsystem -- which meant that you couldn't control distinct resource limits in different heirarchies. This patch implements a resource controller *specifically* for the number of processes in a cgroup, overcoming this issue. There has been a similar attempt to implement a resource controller for the number of open file descriptors[2], which has not been merged becasue the reasons were dubious. Merely from a sane interface perspective, it should be possible to utilise cgroups to do such rudimentary resource management (which currently only exists for process trees). This isn't a proper resource to control. kmemcg just grew proper reclaim support and will be useable to control kernel side of memory consumption. I was told that the plan was to use kmemcg - but I was told that YEARS AGO. In the mean time we all either do our own thing or we do nothing and suffer. Something like this is long overdue, IMO, and is still more appropriate and obvious than kmemcg anyway. Thanks. Kernel memory consumption isn't the only valid reason to want to limit the number of processes in a cgroup. Limiting the number of processes is very useful to ensure that a program is working correctly (for example, the NTP daemon should (usually) have an _exact_ number of children if it is functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_ children), to prevent PID number exhaustion, to head off DoS attacks against forking network servers before they get to the point of causing kmem exhaustion, and to limit the number of processes in a cgroup that uses lots of kernel memory very infrequently. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo t...@kernel.org wrote: Hello, On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote: Kernel memory consumption isn't the only valid reason to want to limit the number of processes in a cgroup. Limiting the number of processes is very useful to ensure that a program is working correctly (for example, the NTP daemon should (usually) have an _exact_ number of children if it is functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_ children), to prevent PID number exhaustion, to head off DoS attacks against forking network servers before they get to the point of causing kmem exhaustion, and to limit the number of processes in a cgroup that uses lots of kernel memory very infrequently. All the use cases you're listing are extremely niche and can be trivially achieved without introducing another cgroup controller. Not only that, they're actually pretty silly. Let's say NTP daemon is misbehaving (or its code changed w/o you knowing or there are corner cases which trigger extremely infrequently). What do you exactly achieve by rejecting its fork call? It's just adding another variation to the misbehavior. It was misbehaving before and would now be continuing to misbehave after a failed fork. In general, I'm pretty strongly against adding controllers for things which aren't fundamental resources in the system. What's next? Open files? Pipe buffer? Number of flocks? Number of session leaders or program groups? Yes to some or all of those. We do exactly this internally and it has greatly added to the stability of our overall container management system. and while you have been telling everyone to wait for kmemcg, we have had an extra 3+ years of stability. If you want to prevent a certain class of jobs from exhausting a given resource, protecting that resource is the obvious thing to do. I don't follow your argument - isn't this exactly what this patch set is doing - protecting resources? Wasn't it like a year ago? Yeah, it's taking longer than everybody hoped but seriously kmemcg reclaimer just got merged and also did the new memcg interface which will tie kmemcg and memcg together. By my email it was almost 2 years ago, and that was the second or third incarnation of this patch. Something like this is long overdue, IMO, and is still more appropriate and obvious than kmemcg anyway. Thanks for chiming in again but if you aren't bringing out anything new to the table (I don't remember you doing that last time either), I'm not sure why the decision would be different this time. I'm just vocalizing my support for this idea in defense of practical solutions that work NOW instead of engineering ideals that never actually arrive. As containers take the server world by storm, stuff like this gets more and more important. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] add nproc cgroup subsystem
On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo t...@kernel.org wrote: On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote: In general, I'm pretty strongly against adding controllers for things which aren't fundamental resources in the system. What's next? Open files? Pipe buffer? Number of flocks? Number of session leaders or program groups? Yes to some or all of those. We do exactly this internally and it has greatly added to the stability of our overall container management system. and while you have been telling everyone to wait for kmemcg, we have had an extra 3+ years of stability. Yeah, good job. I totally get why kernel part of memory consumption needs protection. I'm not arguing against that at all. You keep shifting the focus to be about memory, but that's not what people are asking for. You're letting the desire for a perfect solution (which is years late) block good solutions that exist NOW. If you want to prevent a certain class of jobs from exhausting a given resource, protecting that resource is the obvious thing to do. I don't follow your argument - isn't this exactly what this patch set is doing - protecting resources? If you have proper protection over kernel memory consumption, this is completely covered because memory is the fundamental resource here. Controlling distribution of those fundamental resources is what cgroups are primarily about. You say that's what cgroups are about, but it's not at all obvious that you are right. What users, admins, systems people want is building blocks that are usable and make sense. Limiting kernel memory is NOT the logical building block, here. It's not something people can reason about or quantify easily. if you need to implement the interfaces in terms of memory, go nuts, but making users think liek that is just not right. Wasn't it like a year ago? Yeah, it's taking longer than everybody hoped but seriously kmemcg reclaimer just got merged and also did the new memcg interface which will tie kmemcg and memcg together. By my email it was almost 2 years ago, and that was the second or third incarnation of this patch. Again, I agree this is taking a while. Memory people had to retool the whole reclamation path to make this work, which is the pattern being repeated across the different controllers - we're refactoring a lot of infrastructure code so that resource control can integrate with the regular operation of the kernel, which BTW is what we should have been doing from the beginning. If your complaint is that this is taking too long, I hear you, and there's a certain amount of validity in arguing that upstreaming a temporary measure is the better trade-off, but the rationale for nproc (or nfds, or virtual memory, whatever) has been pretty weak otherwise. At least 3 or 4 people have INDEPENDENTLY decided this is what is causing them pain and tried to fix it and invested the time to send a patch says that it is actually a thing. There exists a problem that you are disallowing to be fixed. Do you recognize that users are experiencing pain? Why do you hate your users? :) And as for the different incarnations of this patchset. Reposting the same stuff repeatedly doesn't really change anything. Why would it? Because reasonable people might survey the ecosystem and say humm, things have changed over the years - isolation has become a pretty serious topic. or maybe they hope that you'll finally agree that fixing the problem NOW is worthwhile, even if the solution is imperfect, and that a more perfect solution will arrive. Something like this is long overdue, IMO, and is still more appropriate and obvious than kmemcg anyway. Thanks for chiming in again but if you aren't bringing out anything new to the table (I don't remember you doing that last time either), I'm not sure why the decision would be different this time. I'm just vocalizing my support for this idea in defense of practical solutions that work NOW instead of engineering ideals that never actually arrive. As containers take the server world by storm, stuff like this gets more and more important. Again, protection of kernel side memory consumption is important. There's no question about that. As for the never-arriving part, well, it is arriving. If you still can't believe, just take a look at the code. Are you willing to put a drop-dead date on it? If we don't have kmemcg working well enough to _actually_ bound PID usage and FD usage by, say, June 1st, will you then accept a patch to this effect? If the answer is no, then I have zero faith that it's coming any time soon - I heard this 2 years ago. I believed you then. I see further downthread that you said you'll think about it. Thank you. Just because our use cases are not normal does not mean we're not valid :) Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord
Re: Cache Allocation Technology Design
On Thu, Oct 30, 2014 at 10:12 AM, Tejun Heo wrote: > On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote: >> Another reason unified hierarchy is a bad model. > > Things wrong with this message. > > 1. Top posted. It isn't clear which part you're referring to and this >was pointed out to you multiple times in the past. I occasionally fall victim to gmail's defaults. I apologize for that. > 2. No real thoughts or technical details. Maybe you had some in your >head but nothing was elaborated. This forces me to guess what you >had on mind when you produced the above sentence and of course me >not being you this takes a considerable amount of brain cycle and >I'd still end up with multiple alternative scenarios that I'll have >to cover. I think the conversation is well enough understood by the people for whom this bit of snark was intended that reading my mind was not that hard. That said, it was overly snark-tastic, and sent in haste. My point, of course, was that here is an example of something which maps very well to the idea of cgroups (a set of processes that share some controller) but DOES NOT map well to the unified hierarchy model. It must be managed more carefully than arbitrary hierarchy can enforce. The result is the mish-mash of workarounds proposed in this thread to force it into arbitrary hierarchy mode, including this no-win situation of running out of hardware resources - it is going to fail. Will it fail at cgroup creation time (doesn't scale to arbitrary hierarchy) or will it fail when you add processes to it (awkward at best) or will it fail when you flip some control file to enable the feature? I know the unified hierarchy ship has sailed, so there's not non-snarky way to argue the point any further, but this is such an obvious case, to me, that I had to say something. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Cache Allocation Technology Design
On Thu, Oct 30, 2014 at 10:12 AM, Tejun Heo t...@kernel.org wrote: On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote: Another reason unified hierarchy is a bad model. Things wrong with this message. 1. Top posted. It isn't clear which part you're referring to and this was pointed out to you multiple times in the past. I occasionally fall victim to gmail's defaults. I apologize for that. 2. No real thoughts or technical details. Maybe you had some in your head but nothing was elaborated. This forces me to guess what you had on mind when you produced the above sentence and of course me not being you this takes a considerable amount of brain cycle and I'd still end up with multiple alternative scenarios that I'll have to cover. I think the conversation is well enough understood by the people for whom this bit of snark was intended that reading my mind was not that hard. That said, it was overly snark-tastic, and sent in haste. My point, of course, was that here is an example of something which maps very well to the idea of cgroups (a set of processes that share some controller) but DOES NOT map well to the unified hierarchy model. It must be managed more carefully than arbitrary hierarchy can enforce. The result is the mish-mash of workarounds proposed in this thread to force it into arbitrary hierarchy mode, including this no-win situation of running out of hardware resources - it is going to fail. Will it fail at cgroup creation time (doesn't scale to arbitrary hierarchy) or will it fail when you add processes to it (awkward at best) or will it fail when you flip some control file to enable the feature? I know the unified hierarchy ship has sailed, so there's not non-snarky way to argue the point any further, but this is such an obvious case, to me, that I had to say something. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups
How is this different from RLIMIT_AS? You specifically mentioned it earlier but you don't explain how this is different. >From my perspective, this is pointless. There's plenty of perfectly correct software that mmaps files without concern for VSIZE, because they never fault most of those pages in. From my observations it is not generally possible to predict an average VSIZE limit that would satisfy your concerns *and* not kill lots of valid apps. It sounds like what you want is to limit or even disable swap usage. Given your example, your hypothetical user would probably be better of getting an OOM kill early so she can fix her job spec to request more memory. On Wed, Jul 9, 2014 at 12:52 AM, Vladimir Davydov wrote: > On Thu, Jul 03, 2014 at 04:48:16PM +0400, Vladimir Davydov wrote: >> Hi, >> >> Typically, when a process calls mmap, it isn't given all the memory pages it >> requested immediately. Instead, only its address space is grown, while the >> memory pages will be actually allocated on the first use. If the system fails >> to allocate a page, it will have no choice except invoking the OOM killer, >> which may kill this or any other process. Obviously, it isn't the best way of >> telling the user that the system is unable to handle his request. It would be >> much better to fail mmap with ENOMEM instead. >> >> That's why Linux has the memory overcommit control feature, which accounts >> and >> limits VM size that may contribute to mem+swap, i.e. private writable >> mappings >> and shared memory areas. However, currently it's only available system-wide, >> and there's no way of avoiding OOM in cgroups. >> >> This patch set is an attempt to fill the gap. It implements the resource >> controller for cgroups that accounts and limits address space allocations >> that >> may contribute to mem+swap. >> >> The interface is similar to the one of the memory cgroup except it controls >> virtual memory usage, not actual memory allocation: >> >> vm.usage_in_bytescurrent vm usage of processes inside cgroup >>(read-only) >> >> vm.max_usage_in_bytesmax vm.usage_in_bytes, can be reset by >> writing 0 >> >> vm.limit_in_bytesvm.usage_in_bytes must be <= >> vm.limite_in_bytes; >>allocations that hit the limit will be failed >>with ENOMEM >> >> vm.failcnt number of times the limit was hit, can be >> reset >>by writing 0 >> >> In future, the controller can be easily extended to account for locked pages >> and shmem. > > Any thoughts on this? > > Thanks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups
How is this different from RLIMIT_AS? You specifically mentioned it earlier but you don't explain how this is different. From my perspective, this is pointless. There's plenty of perfectly correct software that mmaps files without concern for VSIZE, because they never fault most of those pages in. From my observations it is not generally possible to predict an average VSIZE limit that would satisfy your concerns *and* not kill lots of valid apps. It sounds like what you want is to limit or even disable swap usage. Given your example, your hypothetical user would probably be better of getting an OOM kill early so she can fix her job spec to request more memory. On Wed, Jul 9, 2014 at 12:52 AM, Vladimir Davydov vdavy...@parallels.com wrote: On Thu, Jul 03, 2014 at 04:48:16PM +0400, Vladimir Davydov wrote: Hi, Typically, when a process calls mmap, it isn't given all the memory pages it requested immediately. Instead, only its address space is grown, while the memory pages will be actually allocated on the first use. If the system fails to allocate a page, it will have no choice except invoking the OOM killer, which may kill this or any other process. Obviously, it isn't the best way of telling the user that the system is unable to handle his request. It would be much better to fail mmap with ENOMEM instead. That's why Linux has the memory overcommit control feature, which accounts and limits VM size that may contribute to mem+swap, i.e. private writable mappings and shared memory areas. However, currently it's only available system-wide, and there's no way of avoiding OOM in cgroups. This patch set is an attempt to fill the gap. It implements the resource controller for cgroups that accounts and limits address space allocations that may contribute to mem+swap. The interface is similar to the one of the memory cgroup except it controls virtual memory usage, not actual memory allocation: vm.usage_in_bytescurrent vm usage of processes inside cgroup (read-only) vm.max_usage_in_bytesmax vm.usage_in_bytes, can be reset by writing 0 vm.limit_in_bytesvm.usage_in_bytes must be = vm.limite_in_bytes; allocations that hit the limit will be failed with ENOMEM vm.failcnt number of times the limit was hit, can be reset by writing 0 In future, the controller can be easily extended to account for locked pages and shmem. Any thoughts on this? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] net: Implement SO_PEERCGROUP
I don't buy that it is not practical. Not convenient, maybe. Not clean, sure. But it is practical - it uses mechanisms that exist on all kernels today. That is a win, to me. On Thu, Mar 13, 2014 at 10:58 AM, Simo Sorce wrote: > On Thu, 2014-03-13 at 10:55 -0700, Andy Lutomirski wrote: >> >> So give each container its own unix socket. Problem solved, no? > > Not really practical if you have hundreds of containers. > > Simo. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] net: Implement SO_PEERCGROUP
In some sense a cgroup is a pgrp that mere mortals can't escape. Why not just do something like that? root can set this "container id" or "job id" on your process when it first starts (e.g. docker sets it on your container process) or even make a cgroup that sets this for all processes in that cgroup. ints are better than strings anyway. On Thu, Mar 13, 2014 at 10:25 AM, Andy Lutomirski wrote: > On Thu, Mar 13, 2014 at 9:33 AM, Simo Sorce wrote: >> On Thu, 2014-03-13 at 11:00 -0400, Vivek Goyal wrote: >>> On Thu, Mar 13, 2014 at 10:55:34AM -0400, Simo Sorce wrote: >>> >>> [..] >>> > > > This might not be quite as awful as I thought. At least you're >>> > > > looking up the cgroup at connection time instead of at send time. >>> > > > >>> > > > OTOH, this is still racy -- the socket could easily outlive the cgroup >>> > > > that created it. >>> > > >>> > > That's a good point. What guarantees that previous cgroup was not >>> > > reassigned to a different container. >>> > > >>> > > What if a process A opens the connection with sssd. Process A passes the >>> > > file descriptor to a different process B in a differnt container. >>> > >>> > Stop right here. >>> > If the process passes the fd it is not my problem anymore. >>> > The process can as well just 'proxy' all the information to another >>> > process. >>> > >>> > We just care to properly identify the 'original' container, we are not >>> > in the business of detecting malicious behavior. That's something other >>> > mechanism need to protect against (SELinux or other LSMs, normal >>> > permissions, capabilities, etc...). >>> > >>> > > Process A exits. Container gets removed from system and new one gets >>> > > launched which uses same cgroup as old one. Now process B sends a new >>> > > request and SSSD will serve it based on policy of newly launched >>> > > container. >>> > > >>> > > This sounds very similar to pid race where socket/connection will >>> > > outlive >>> > > the pid. >>> > >>> > Nope, completely different. >>> > >>> >>> I think you missed my point. Passing file descriptor is not the problem. >>> Problem is reuse of same cgroup name for a different container while >>> socket lives on. And it is same race as reuse of a pid for a different >>> process. >> >> The cgroup name should not be reused of course, if userspace does that, >> it is userspace's issue. cgroup names are not a constrained namespace >> like pids which force the kernel to reuse them for processes of a >> different nature. >> > > You're proposing a feature that will enshrine cgroups into the API use > by non-cgroup-controlling applications. I don't think that anyone > thinks that cgroups are pretty, so this is an unfortunate thing to > have to do. > > I've suggested three different ways that your goal could be achieved > without using cgroups at all. You haven't really addressed any of > them. > > In order for something like this to go into the kernel, I would expect > a real use case and a justification for why this is the right way to > do it. > > "Docker containers can be identified by cgroup path" is completely > unconvincing to me. > > --Andy > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] net: Implement SO_PEERCGROUP
In some sense a cgroup is a pgrp that mere mortals can't escape. Why not just do something like that? root can set this container id or job id on your process when it first starts (e.g. docker sets it on your container process) or even make a cgroup that sets this for all processes in that cgroup. ints are better than strings anyway. On Thu, Mar 13, 2014 at 10:25 AM, Andy Lutomirski l...@amacapital.net wrote: On Thu, Mar 13, 2014 at 9:33 AM, Simo Sorce sso...@redhat.com wrote: On Thu, 2014-03-13 at 11:00 -0400, Vivek Goyal wrote: On Thu, Mar 13, 2014 at 10:55:34AM -0400, Simo Sorce wrote: [..] This might not be quite as awful as I thought. At least you're looking up the cgroup at connection time instead of at send time. OTOH, this is still racy -- the socket could easily outlive the cgroup that created it. That's a good point. What guarantees that previous cgroup was not reassigned to a different container. What if a process A opens the connection with sssd. Process A passes the file descriptor to a different process B in a differnt container. Stop right here. If the process passes the fd it is not my problem anymore. The process can as well just 'proxy' all the information to another process. We just care to properly identify the 'original' container, we are not in the business of detecting malicious behavior. That's something other mechanism need to protect against (SELinux or other LSMs, normal permissions, capabilities, etc...). Process A exits. Container gets removed from system and new one gets launched which uses same cgroup as old one. Now process B sends a new request and SSSD will serve it based on policy of newly launched container. This sounds very similar to pid race where socket/connection will outlive the pid. Nope, completely different. I think you missed my point. Passing file descriptor is not the problem. Problem is reuse of same cgroup name for a different container while socket lives on. And it is same race as reuse of a pid for a different process. The cgroup name should not be reused of course, if userspace does that, it is userspace's issue. cgroup names are not a constrained namespace like pids which force the kernel to reuse them for processes of a different nature. You're proposing a feature that will enshrine cgroups into the API use by non-cgroup-controlling applications. I don't think that anyone thinks that cgroups are pretty, so this is an unfortunate thing to have to do. I've suggested three different ways that your goal could be achieved without using cgroups at all. You haven't really addressed any of them. In order for something like this to go into the kernel, I would expect a real use case and a justification for why this is the right way to do it. Docker containers can be identified by cgroup path is completely unconvincing to me. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] net: Implement SO_PEERCGROUP
I don't buy that it is not practical. Not convenient, maybe. Not clean, sure. But it is practical - it uses mechanisms that exist on all kernels today. That is a win, to me. On Thu, Mar 13, 2014 at 10:58 AM, Simo Sorce sso...@redhat.com wrote: On Thu, 2014-03-13 at 10:55 -0700, Andy Lutomirski wrote: So give each container its own unix socket. Problem solved, no? Not really practical if you have hundreds of containers. Simo. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves
On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo wrote: > Hello, Tim. > > On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote: >> Yeah sorry. Replying from my phone is awkward at best. I know better :) > > Heh, sorry about being bitchy. :) > >> In my mind, the ONLY point of pulling system-OOM handling into >> userspace is to make it easier for crazy people (Google) to implement >> bizarre system-OOM policies. Example: > > I think that's one of the places where we largely disagree. If at all Just to be clear - I say this because it doesn't feel right to impose my craziness on others, and it sucks when we try and are met with "you're crazy, go away". And you have to admit that happens to Google. :) Punching an escape valve that allows us to be crazy without hurting anyone else sounds ideal, IF and ONLY IF that escape valve is itself maintainable. If the escape valve is userspace it's REALLY easy to iterate on our craziness. If it is kernel space, it's somewhat less easy, but not impossible. > possible, I'd much prefer google's workload to be supported inside the > general boundaries of the upstream kernel without having to punch a > large hole in it. To me, the general development history of memcg in > general and this thread in particular seem to epitomize why it is a > bad idea to have isolated, large and deep "crazy" use cases. Punching > the initial hole is the easy part; however, we all are quite limited > in anticpating future needs and sooner or later that crazy use case is > bound to evolve further towards the isolated extreme it departed > towards and require more and larger holes and further contortions to > accomodate such progress. > > The concern I have with the suggested solution is not necessarily that > it's technically complex than it looks like on the surface - I'm sure > it can be made to work one way or the other - but that it's a fairly > large step toward an isolated extreme which memcg as a project > probably should not head toward. > > There sure are cases where such exceptions can't be avoided and are > good trade-offs but, here, we're talking about a major architectural > decision which not only affects memcg but mm in general. I'm afraid > this doesn't sound like an no-brainer flexibility we can afford. > >> When we have a system OOM we want to do a walk of the administrative >> memcg tree (which is only a couple levels deep, users can make >> non-admin sub-memcgs), selecting the lowest priority entity at each >> step (where both tasks and memcgs have a priority and the priority >> range is much wider than the current OOM scores, and where memcg >> priority is sometimes a function of memcg usage), until we reach a >> leaf. >> >> Once we reach a leaf, I want to log some info about the memcg doing >> the allocation, the memcg being terminated, and maybe some other bits >> about the system (depending on the priority of the selected victim, >> this may or may not be an "acceptable" situation). Then I want to >> kill *everything* under that memcg. Then I want to "publish" some >> information through a sane API (e.g. not dmesg scraping). >> >> This is basically our policy as we understand it today. This is >> notably different than it was a year ago, and it will probably evolve >> further in the next year. > > I think per-memcg score and killing is something which makes > fundamental sense. In fact, killing a single process has never made > much sense to me as that is a unit which ultimately is only meaningful > to the kernel itself and not necessraily to userland, so no matter > what I think we're gonna gain per-memcg behavior and it seems most, > albeit not all, of what you described above should be implementable > through that. Well that's an awesome start. We have or had patches to do a lot of this. I don't know how well scrubbed they are for pushing or whether they apply at all to current head, though. > Ultimately, if the use case calls for very fine level of control, I > think the right thing to do is making nesting work properly which is > likely to take some time. In the meantime, even if such use case > requires modifying the kernel to tailor the OOM behavior, I think > sticking to kernel OOM provides a lot easier way to eventual > convergence. Userland system OOM basically means giving up and would > lessen the motivation towards improving the shared infrastructures > while adding significant pressure towards schizophreic diversion. > >> We have a long tail of kernel memory usage. If we provision machines >> so that the "do work here" first-level memcg excludes the average >> kernel usage, we have a huge number
Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves
On Thu, Dec 12, 2013 at 6:21 AM, Tejun Heo wrote: > Hey, Tim. > > Sidenote: Please don't top-post with the whole body quoted below > unless you're adding new cc's. Please selectively quote the original > message's body to remind the readers of the context and reply below > it. It's a basic lkml etiquette and one with good reasons. If you > have to top-post for whatever reason - say you're typing from a > machine which doesn't allow easy editing of the original message, > explain so at the top of the message, or better yet, wait till you can > unless it's urgent. Yeah sorry. Replying from my phone is awkward at best. I know better :) > On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote: >> The immediate problem I see with setting aside reserves "off the top" >> is that we don't really know a priori how much memory the kernel >> itself is going to use, which could still land us in an overcommitted >> state. >> >> In other words, if I have your 128 MB machine, and I set aside 8 MB >> for OOM handling, and give 120 MB for jobs, I have not accounted for >> the kernel. So I set aside 8 MB for OOM and 100 MB for jobs, leaving >> 20 MB for jobs. That should be enough right? Hell if I know, and >> nothing ensures that. > > Yes, sure thing, that's the reason why I mentioned "with some slack" > in the original message and also that it might not be completely the > same. It doesn't allow you to aggressively use system level OOM > handling as the sizing estimator for the root cgroup; however, it's > more of an implementation details than something which should guide > the overall architecture - it's a problem which lessens in severity as > [k]memcg improves and its coverage becomes more complete, which is the > direction we should be headed no matter what. In my mind, the ONLY point of pulling system-OOM handling into userspace is to make it easier for crazy people (Google) to implement bizarre system-OOM policies. Example: When we have a system OOM we want to do a walk of the administrative memcg tree (which is only a couple levels deep, users can make non-admin sub-memcgs), selecting the lowest priority entity at each step (where both tasks and memcgs have a priority and the priority range is much wider than the current OOM scores, and where memcg priority is sometimes a function of memcg usage), until we reach a leaf. Once we reach a leaf, I want to log some info about the memcg doing the allocation, the memcg being terminated, and maybe some other bits about the system (depending on the priority of the selected victim, this may or may not be an "acceptable" situation). Then I want to kill *everything* under that memcg. Then I want to "publish" some information through a sane API (e.g. not dmesg scraping). This is basically our policy as we understand it today. This is notably different than it was a year ago, and it will probably evolve further in the next year. Teaching the kernel all of this stuff has proven to be sort of difficult to maintain and forward-port, and has been very slow to evolve because of how painful it is to test and deploy new kernels. Maybe we can find a way to push this level of policy down to the kernel OOM killer? When this was mentioned internally I got shot down (gently, but shot down none the less). Assuming we had nearly-reliable (it doesn't have to be 100% guaranteed to be useful) OOM-in-userspace, I can keep the adminstrative memcg metadata in memory, implement killing as cruelly as I need, and do all of the logging and publication after the OOM kill is done. Most importantly I can test and deploy new policy changes pretty trivially. Handling per-memcg OOM is a different discussion. Here is where we want to be able to extract things like heap profiles or take stats snapshots, grow memcgs (if so configured) etc. Allowing our users to have a moment of mercy before we put a bullet in their brain enables a whole new realm of debugging, as well as a lot of valuable features. > It'd depend on the workload but with memcg fully configured it > shouldn't fluctuate wildly. If it does, we need to hunt down whatever > is causing such fluctuatation and include it in kmemcg, right? That > way, memcg as a whole improves for all use cases not just your niche > one and I strongly believe that aligning as many use cases as possible > along the same axis, rather than creating a large hole to stow away > the exceptions, is vastly more beneficial to *everyone* in the long > term. We have a long tail of kernel memory usage. If we provision machines so that the "do work here" first-level memcg excludes the average kernel usage, we have a huge number of machines that will fail to apply OOM policy because of actual overcommitment. If we provision for 95th or 99th perce
Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves
On Thu, Dec 12, 2013 at 6:21 AM, Tejun Heo t...@kernel.org wrote: Hey, Tim. Sidenote: Please don't top-post with the whole body quoted below unless you're adding new cc's. Please selectively quote the original message's body to remind the readers of the context and reply below it. It's a basic lkml etiquette and one with good reasons. If you have to top-post for whatever reason - say you're typing from a machine which doesn't allow easy editing of the original message, explain so at the top of the message, or better yet, wait till you can unless it's urgent. Yeah sorry. Replying from my phone is awkward at best. I know better :) On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote: The immediate problem I see with setting aside reserves off the top is that we don't really know a priori how much memory the kernel itself is going to use, which could still land us in an overcommitted state. In other words, if I have your 128 MB machine, and I set aside 8 MB for OOM handling, and give 120 MB for jobs, I have not accounted for the kernel. So I set aside 8 MB for OOM and 100 MB for jobs, leaving 20 MB for jobs. That should be enough right? Hell if I know, and nothing ensures that. Yes, sure thing, that's the reason why I mentioned with some slack in the original message and also that it might not be completely the same. It doesn't allow you to aggressively use system level OOM handling as the sizing estimator for the root cgroup; however, it's more of an implementation details than something which should guide the overall architecture - it's a problem which lessens in severity as [k]memcg improves and its coverage becomes more complete, which is the direction we should be headed no matter what. In my mind, the ONLY point of pulling system-OOM handling into userspace is to make it easier for crazy people (Google) to implement bizarre system-OOM policies. Example: When we have a system OOM we want to do a walk of the administrative memcg tree (which is only a couple levels deep, users can make non-admin sub-memcgs), selecting the lowest priority entity at each step (where both tasks and memcgs have a priority and the priority range is much wider than the current OOM scores, and where memcg priority is sometimes a function of memcg usage), until we reach a leaf. Once we reach a leaf, I want to log some info about the memcg doing the allocation, the memcg being terminated, and maybe some other bits about the system (depending on the priority of the selected victim, this may or may not be an acceptable situation). Then I want to kill *everything* under that memcg. Then I want to publish some information through a sane API (e.g. not dmesg scraping). This is basically our policy as we understand it today. This is notably different than it was a year ago, and it will probably evolve further in the next year. Teaching the kernel all of this stuff has proven to be sort of difficult to maintain and forward-port, and has been very slow to evolve because of how painful it is to test and deploy new kernels. Maybe we can find a way to push this level of policy down to the kernel OOM killer? When this was mentioned internally I got shot down (gently, but shot down none the less). Assuming we had nearly-reliable (it doesn't have to be 100% guaranteed to be useful) OOM-in-userspace, I can keep the adminstrative memcg metadata in memory, implement killing as cruelly as I need, and do all of the logging and publication after the OOM kill is done. Most importantly I can test and deploy new policy changes pretty trivially. Handling per-memcg OOM is a different discussion. Here is where we want to be able to extract things like heap profiles or take stats snapshots, grow memcgs (if so configured) etc. Allowing our users to have a moment of mercy before we put a bullet in their brain enables a whole new realm of debugging, as well as a lot of valuable features. It'd depend on the workload but with memcg fully configured it shouldn't fluctuate wildly. If it does, we need to hunt down whatever is causing such fluctuatation and include it in kmemcg, right? That way, memcg as a whole improves for all use cases not just your niche one and I strongly believe that aligning as many use cases as possible along the same axis, rather than creating a large hole to stow away the exceptions, is vastly more beneficial to *everyone* in the long term. We have a long tail of kernel memory usage. If we provision machines so that the do work here first-level memcg excludes the average kernel usage, we have a huge number of machines that will fail to apply OOM policy because of actual overcommitment. If we provision for 95th or 99th percentile kernel usage, we're wasting large amounts of memory that could be used to schedule jobs. This is the fundamental problem we face with static apportionment (and we face it in a dozen other situations, too). Expressing this set-aside memory
Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves
On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote: Yeah sorry. Replying from my phone is awkward at best. I know better :) Heh, sorry about being bitchy. :) In my mind, the ONLY point of pulling system-OOM handling into userspace is to make it easier for crazy people (Google) to implement bizarre system-OOM policies. Example: I think that's one of the places where we largely disagree. If at all Just to be clear - I say this because it doesn't feel right to impose my craziness on others, and it sucks when we try and are met with you're crazy, go away. And you have to admit that happens to Google. :) Punching an escape valve that allows us to be crazy without hurting anyone else sounds ideal, IF and ONLY IF that escape valve is itself maintainable. If the escape valve is userspace it's REALLY easy to iterate on our craziness. If it is kernel space, it's somewhat less easy, but not impossible. possible, I'd much prefer google's workload to be supported inside the general boundaries of the upstream kernel without having to punch a large hole in it. To me, the general development history of memcg in general and this thread in particular seem to epitomize why it is a bad idea to have isolated, large and deep crazy use cases. Punching the initial hole is the easy part; however, we all are quite limited in anticpating future needs and sooner or later that crazy use case is bound to evolve further towards the isolated extreme it departed towards and require more and larger holes and further contortions to accomodate such progress. The concern I have with the suggested solution is not necessarily that it's technically complex than it looks like on the surface - I'm sure it can be made to work one way or the other - but that it's a fairly large step toward an isolated extreme which memcg as a project probably should not head toward. There sure are cases where such exceptions can't be avoided and are good trade-offs but, here, we're talking about a major architectural decision which not only affects memcg but mm in general. I'm afraid this doesn't sound like an no-brainer flexibility we can afford. When we have a system OOM we want to do a walk of the administrative memcg tree (which is only a couple levels deep, users can make non-admin sub-memcgs), selecting the lowest priority entity at each step (where both tasks and memcgs have a priority and the priority range is much wider than the current OOM scores, and where memcg priority is sometimes a function of memcg usage), until we reach a leaf. Once we reach a leaf, I want to log some info about the memcg doing the allocation, the memcg being terminated, and maybe some other bits about the system (depending on the priority of the selected victim, this may or may not be an acceptable situation). Then I want to kill *everything* under that memcg. Then I want to publish some information through a sane API (e.g. not dmesg scraping). This is basically our policy as we understand it today. This is notably different than it was a year ago, and it will probably evolve further in the next year. I think per-memcg score and killing is something which makes fundamental sense. In fact, killing a single process has never made much sense to me as that is a unit which ultimately is only meaningful to the kernel itself and not necessraily to userland, so no matter what I think we're gonna gain per-memcg behavior and it seems most, albeit not all, of what you described above should be implementable through that. Well that's an awesome start. We have or had patches to do a lot of this. I don't know how well scrubbed they are for pushing or whether they apply at all to current head, though. Ultimately, if the use case calls for very fine level of control, I think the right thing to do is making nesting work properly which is likely to take some time. In the meantime, even if such use case requires modifying the kernel to tailor the OOM behavior, I think sticking to kernel OOM provides a lot easier way to eventual convergence. Userland system OOM basically means giving up and would lessen the motivation towards improving the shared infrastructures while adding significant pressure towards schizophreic diversion. We have a long tail of kernel memory usage. If we provision machines so that the do work here first-level memcg excludes the average kernel usage, we have a huge number of machines that will fail to apply OOM policy because of actual overcommitment. If we provision for 95th or 99th percentile kernel usage, we're wasting large amounts of memory that could be used to schedule jobs. This is the fundamental problem we face with static apportionment (and we face it in a dozen other situations, too). Expressing this set-aside memory as off-the-top rather than absolute limits makes the whole
Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves
The immediate problem I see with setting aside reserves "off the top" is that we don't really know a priori how much memory the kernel itself is going to use, which could still land us in an overcommitted state. In other words, if I have your 128 MB machine, and I set aside 8 MB for OOM handling, and give 120 MB for jobs, I have not accounted for the kernel. So I set aside 8 MB for OOM and 100 MB for jobs, leaving 20 MB for jobs. That should be enough right? Hell if I know, and nothing ensures that. On Wed, Dec 11, 2013 at 4:42 AM, Tejun Heo wrote: > Yo, > > On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote: >> > Well, the gotcha there is that you won't be able to do that with >> > system level OOM handler either unless you create a separately >> > reserved memory, which, again, can be achieved using hierarchical >> > memcg setup already. Am I missing something here? >> >> System oom conditions would only arise when the usage of memcgs A + B >> above cause the page allocator to not be able to allocate memory without >> oom killing something even though the limits of both A and B may not have >> been reached yet. No userspace oom handler can allocate memory with >> access to memory reserves in the page allocator in such a context; it's >> vital that if we are to handle system oom conditions in userspace that we >> given them access to memory that other processes can't allocate. You >> could attach a userspace system oom handler to any memcg in this scenario >> with memory.oom_reserve_in_bytes and since it has PF_OOM_HANDLER it would >> be able to allocate in reserves in the page allocator and overcharge in >> its memcg to handle it. This isn't possible only with a hierarchical >> memcg setup unless you ensure the sum of the limits of the top level >> memcgs do not equal or exceed the sum of the min watermarks of all memory >> zones, and we exceed that. > > Yes, exactly. If system memory is 128M, create top level memcgs w/ > 120M and 8M each (well, with some slack of course) and then overcommit > the descendants of 120M while putting OOM handlers and friends under > 8M without overcommitting. > > ... >> The stronger rationale is that you can't handle system oom in userspace >> without this functionality and we need to do so. > > You're giving yourself an unreasonable precondition - overcommitting > at root level and handling system OOM from userland - and then trying > to contort everything to fit that. How can possibly "overcommitting > at root level" be a goal of and in itself? Please take a step back > and look at and explain the *problem* you're trying to solve. You > haven't explained why that *need*s to be the case at all. > > I wrote this at the start of the thread but you're still doing the > same thing. You're trying to create a hidden memcg level inside a > memcg. At the beginning of this thread, you were trying to do that > for !root memcgs and now you're arguing that you *need* that for root > memcg. Because there's no other limit we can make use of, you're > suggesting the use of kernel reserve memory for that purpose. It > seems like an absurd thing to do to me. It could be that you might > not be able to achieve exactly the same thing that way, but the right > thing to do would be improving memcg in general so that it can instead > of adding yet more layer of half-baked complexity, right? > > Even if there are some inherent advantages of system userland OOM > handling with a separate physical memory reserve, which AFAICS you > haven't succeeded at showing yet, this is a very invasive change and, > as you said before, something with an *extremely* narrow use case. > Wouldn't it be a better idea to improve the existing mechanisms - be > that memcg in general or kernel OOM handling - to fit the niche use > case better? I mean, just think about all the corner cases. How are > you gonna handle priority inversion through locked pages or > allocations given out to other tasks through slab? You're suggesting > opening a giant can of worms for extremely narrow benefit which > doesn't even seem like actually needing opening the said can. > > Thanks. > > -- > tejun > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves
The immediate problem I see with setting aside reserves off the top is that we don't really know a priori how much memory the kernel itself is going to use, which could still land us in an overcommitted state. In other words, if I have your 128 MB machine, and I set aside 8 MB for OOM handling, and give 120 MB for jobs, I have not accounted for the kernel. So I set aside 8 MB for OOM and 100 MB for jobs, leaving 20 MB for jobs. That should be enough right? Hell if I know, and nothing ensures that. On Wed, Dec 11, 2013 at 4:42 AM, Tejun Heo t...@kernel.org wrote: Yo, On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote: Well, the gotcha there is that you won't be able to do that with system level OOM handler either unless you create a separately reserved memory, which, again, can be achieved using hierarchical memcg setup already. Am I missing something here? System oom conditions would only arise when the usage of memcgs A + B above cause the page allocator to not be able to allocate memory without oom killing something even though the limits of both A and B may not have been reached yet. No userspace oom handler can allocate memory with access to memory reserves in the page allocator in such a context; it's vital that if we are to handle system oom conditions in userspace that we given them access to memory that other processes can't allocate. You could attach a userspace system oom handler to any memcg in this scenario with memory.oom_reserve_in_bytes and since it has PF_OOM_HANDLER it would be able to allocate in reserves in the page allocator and overcharge in its memcg to handle it. This isn't possible only with a hierarchical memcg setup unless you ensure the sum of the limits of the top level memcgs do not equal or exceed the sum of the min watermarks of all memory zones, and we exceed that. Yes, exactly. If system memory is 128M, create top level memcgs w/ 120M and 8M each (well, with some slack of course) and then overcommit the descendants of 120M while putting OOM handlers and friends under 8M without overcommitting. ... The stronger rationale is that you can't handle system oom in userspace without this functionality and we need to do so. You're giving yourself an unreasonable precondition - overcommitting at root level and handling system OOM from userland - and then trying to contort everything to fit that. How can possibly overcommitting at root level be a goal of and in itself? Please take a step back and look at and explain the *problem* you're trying to solve. You haven't explained why that *need*s to be the case at all. I wrote this at the start of the thread but you're still doing the same thing. You're trying to create a hidden memcg level inside a memcg. At the beginning of this thread, you were trying to do that for !root memcgs and now you're arguing that you *need* that for root memcg. Because there's no other limit we can make use of, you're suggesting the use of kernel reserve memory for that purpose. It seems like an absurd thing to do to me. It could be that you might not be able to achieve exactly the same thing that way, but the right thing to do would be improving memcg in general so that it can instead of adding yet more layer of half-baked complexity, right? Even if there are some inherent advantages of system userland OOM handling with a separate physical memory reserve, which AFAICS you haven't succeeded at showing yet, this is a very invasive change and, as you said before, something with an *extremely* narrow use case. Wouldn't it be a better idea to improve the existing mechanisms - be that memcg in general or kernel OOM handling - to fit the niche use case better? I mean, just think about all the corner cases. How are you gonna handle priority inversion through locked pages or allocations given out to other tasks through slab? You're suggesting opening a giant can of worms for extremely narrow benefit which doesn't even seem like actually needing opening the said can. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering wrote: > Heya, > > > On 29.06.2013 05:05, Tim Hockin wrote: >> >> Come on, now, Lennart. You put a lot of words in my mouth. > > >>> I for sure am not going to make the PID 1 a client of another daemon. >>> That's >>> just wrong. If you have a daemon that is both conceptually the manager of >>> another service and the client of that other service, then that's bad >>> design >>> and you will easily run into deadlocks and such. Just think about it: if >>> you >>> have some external daemon for managing cgroups, and you need cgroups for >>> running external daemons, how are you going to start the external daemon >>> for >>> managing cgroups? Sure, you can hack around this, make that daemon >>> special, >>> and magic, and stuff -- or you can just not do such nonsense. There's no >>> reason to repeat the fuckup that cgroup became in kernelspace a second >>> time, >>> but this time in userspace, with multiple manager daemons all with >>> different >>> and slightly incompatible definitions what a unit to manage actualy is... >> >> >> I forgot about the tautology of systemd. systemd is monolithic. > > > systemd is certainly not monolithic for almost any definition of that term. > I am not sure where you are taking that from, and I am not sure I want to > discuss on that level. This just sounds like FUD you picked up somewhere and > are repeating carelessly... It does a number of sort-of-related things. Maybe it does them better by doing them together. I can't say, really. We don't use it at work, and I am on Ubuntu elsewhere, for now. >> But that's not my point. It seems pretty easy to make this cgroup >> management (in "native mode") a library that can have either a thin >> veneer of a main() function, while also being usable by systemd. The >> point is to solve all of the problems ONCE. I'm trying to make the >> case that systemd itself should be focusing on features and policies >> and awesome APIs. > > You know, getting this all right isn't easy. If you want to do things > properly, then you need to propagate attribute changes between the units you > manage. You also need something like a scheduler, since a number of > controllers can only be configured under certain external conditions (for > example: the blkio or devices controller use major/minor parameters for > configuring per-device limits. Since major/minor assignments are pretty much > unpredictable these days -- and users probably want to configure things with > friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to > wait for devices to show up before we can configure the parameters.) Soo... > you need a graph of units, where you can propagate things, and schedule > things based on some execution/event queue. And the propagation and > scheduling are closely intermingled. I'm really just talking about the most basic low-level substrate of writing to cgroupfs. Again, we don't use udev (yet?) so we don't have these problems. It seems to me that it's possible to formulate a bottom layer that is usable by both systemd and non-systemd systems. But, you know, maybe I am wrong and our internal universe is so much simpler (and behind the times) than the rest of the world that layering can work for us and not you. > Now, that's pretty much exactly what systemd actually *is*. It implements a > graph of units with a scheduler. And if you rip that part out of systemd to > make this an "easy cgroup management library", then you simply turn what > systemd is into a library without leaving anything. Which is just bogus. > > So no, if you say "seems pretty easy to make this cgroup management a > library" then well, I have to disagree with you. > > >>> We want to run fewer, simpler things on our systems, we want to reuse as >> >> >> Fewer and simpler are not compatible, unless you are losing >> functionality. Systemd is fewer, but NOT simpler. > > > Oh, certainly it is. If we'd split up the cgroup fs access into separate > daemon of some kind, then we'd need some kind of IPC for that, and so you > have more daemons and you have some complex IPC between the processes. So > yeah, the systemd approach is certainly both simpler and uses fewer daemons > then your hypothetical one. Well, it SOUNDS like Serge is trying to develop this to demonstrate that a standalone daemon works. That's what I am keen to help with (or else we have to invent ourselves). I am not really afraid of IPC or of "more daemons". I much prefer simple agents doing one thing and interacting
Re: cgroup: status-quo and userland efforts
On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering lpoet...@redhat.com wrote: Heya, On 29.06.2013 05:05, Tim Hockin wrote: Come on, now, Lennart. You put a lot of words in my mouth. I for sure am not going to make the PID 1 a client of another daemon. That's just wrong. If you have a daemon that is both conceptually the manager of another service and the client of that other service, then that's bad design and you will easily run into deadlocks and such. Just think about it: if you have some external daemon for managing cgroups, and you need cgroups for running external daemons, how are you going to start the external daemon for managing cgroups? Sure, you can hack around this, make that daemon special, and magic, and stuff -- or you can just not do such nonsense. There's no reason to repeat the fuckup that cgroup became in kernelspace a second time, but this time in userspace, with multiple manager daemons all with different and slightly incompatible definitions what a unit to manage actualy is... I forgot about the tautology of systemd. systemd is monolithic. systemd is certainly not monolithic for almost any definition of that term. I am not sure where you are taking that from, and I am not sure I want to discuss on that level. This just sounds like FUD you picked up somewhere and are repeating carelessly... It does a number of sort-of-related things. Maybe it does them better by doing them together. I can't say, really. We don't use it at work, and I am on Ubuntu elsewhere, for now. But that's not my point. It seems pretty easy to make this cgroup management (in native mode) a library that can have either a thin veneer of a main() function, while also being usable by systemd. The point is to solve all of the problems ONCE. I'm trying to make the case that systemd itself should be focusing on features and policies and awesome APIs. You know, getting this all right isn't easy. If you want to do things properly, then you need to propagate attribute changes between the units you manage. You also need something like a scheduler, since a number of controllers can only be configured under certain external conditions (for example: the blkio or devices controller use major/minor parameters for configuring per-device limits. Since major/minor assignments are pretty much unpredictable these days -- and users probably want to configure things with friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to wait for devices to show up before we can configure the parameters.) Soo... you need a graph of units, where you can propagate things, and schedule things based on some execution/event queue. And the propagation and scheduling are closely intermingled. I'm really just talking about the most basic low-level substrate of writing to cgroupfs. Again, we don't use udev (yet?) so we don't have these problems. It seems to me that it's possible to formulate a bottom layer that is usable by both systemd and non-systemd systems. But, you know, maybe I am wrong and our internal universe is so much simpler (and behind the times) than the rest of the world that layering can work for us and not you. Now, that's pretty much exactly what systemd actually *is*. It implements a graph of units with a scheduler. And if you rip that part out of systemd to make this an easy cgroup management library, then you simply turn what systemd is into a library without leaving anything. Which is just bogus. So no, if you say seems pretty easy to make this cgroup management a library then well, I have to disagree with you. We want to run fewer, simpler things on our systems, we want to reuse as Fewer and simpler are not compatible, unless you are losing functionality. Systemd is fewer, but NOT simpler. Oh, certainly it is. If we'd split up the cgroup fs access into separate daemon of some kind, then we'd need some kind of IPC for that, and so you have more daemons and you have some complex IPC between the processes. So yeah, the systemd approach is certainly both simpler and uses fewer daemons then your hypothetical one. Well, it SOUNDS like Serge is trying to develop this to demonstrate that a standalone daemon works. That's what I am keen to help with (or else we have to invent ourselves). I am not really afraid of IPC or of more daemons. I much prefer simple agents doing one thing and interacting with each other in simple ways. But that's me. much of the code as we can. You don't achieve that by running yet another daemon that does worse what systemd can anyway do simpler, easier and better. Considering this is all hypothetical, I find this to be a funny debate. My hypothetical idea is better than your hypothetical idea. Well, systemd is pretty real, and the code to do the unified cgroup management within systemd is pretty complete. systemd is certainly not hypothetical. Fair enough - I did not realize you had
Re: cgroup: status-quo and userland efforts
Come on, now, Lennart. You put a lot of words in my mouth. On Fri, Jun 28, 2013 at 6:48 PM, Lennart Poettering wrote: > On 28.06.2013 20:53, Tim Hockin wrote: > >> a single-agent, we should make a kick-ass implementation that is >> flexible and scalable, and full-featured enough to not require >> divergence at the lowest layer of the stack. Then build systemd on >> top of that. Let systemd offer more features and policies and >> "semantic" APIs. > > > Well, what if systemd is already kick-ass? I mean, if you have a problem > with systemd, then that's your own problem, but I really don't think why I > should bother? I didn't say it wasn't. I said that we can build a common substrate that systemd can build on *and* non-systemd systems can use *and* Google can participate in. > I for sure am not going to make the PID 1 a client of another daemon. That's > just wrong. If you have a daemon that is both conceptually the manager of > another service and the client of that other service, then that's bad design > and you will easily run into deadlocks and such. Just think about it: if you > have some external daemon for managing cgroups, and you need cgroups for > running external daemons, how are you going to start the external daemon for > managing cgroups? Sure, you can hack around this, make that daemon special, > and magic, and stuff -- or you can just not do such nonsense. There's no > reason to repeat the fuckup that cgroup became in kernelspace a second time, > but this time in userspace, with multiple manager daemons all with different > and slightly incompatible definitions what a unit to manage actualy is... I forgot about the tautology of systemd. systemd is monolithic. Therefore it can not have any external dependencies. Therefore it must absorb anything it depends on. Therefore systemd continues to grow in size and scope. Up next: systemd manages your X sessions! But that's not my point. It seems pretty easy to make this cgroup management (in "native mode") a library that can have either a thin veneer of a main() function, while also being usable by systemd. The point is to solve all of the problems ONCE. I'm trying to make the case that systemd itself should be focusing on features and policies and awesome APIs. > We want to run fewer, simpler things on our systems, we want to reuse as Fewer and simpler are not compatible, unless you are losing functionality. Systemd is fewer, but NOT simpler. > much of the code as we can. You don't achieve that by running yet another > daemon that does worse what systemd can anyway do simpler, easier and > better. Considering this is all hypothetical, I find this to be a funny debate. My hypothetical idea is better than your hypothetical idea. > The least you could grant us is to have a look at the final APIs we will > have to offer before you already imply that systemd cannot be a valid > implementation of any API people could ever agree on. Whoah, don't get defensive. I said nothing of the sort. The fact of the matter is that we do not run systemd, at least in part because of the monolithic nature. That's unlikely to change in this timescale. What I said was that it would be a shame if we had to invent our own low-level cgroup daemon just because the "upstream" daemons was too tightly coupled with systemd. I think we have a lot of experience to offer to this project, and a vested interest in seeing it done well. But if it is purely targetting systemd, we have little incentive to devote resources to it. Please note that I am strictly talking about the lowest layer of the API. Just the thing that guards cgroupfs against mere mortals. The higher layers - where abstractions exist, that are actually USEFUL to end users - are not really in scope right now. We already have our own higher level APIs. This is supposed to be collaborative, not combative. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup access daemon
On Fri, Jun 28, 2013 at 12:21 PM, Serge Hallyn wrote: > Quoting Tim Hockin (thoc...@hockin.org): >> On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn >> wrote: >> > Quoting Tim Hockin (thoc...@hockin.org): >> >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn >> >> wrote: >> >> > Quoting Tim Hockin (thoc...@hockin.org): >> > Could you give examples? >> > >> > If you have a white/academic paper I should go read, that'd be great. >> >> We don't have anything on this, but examples may help. >> >> Someone running as root should be able to connect to the "native" >> daemon and read or write any cgroup file they want, right? You could >> argue that root should be able to do this to a child-daemon, too, but >> let's ignore that. >> >> But inside a container, I don't want the users to be able to write to >> anything in their own container. I do want them to be able to make >> sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be >> able to write to memory.limit_in_bytes, to read but not write >> memory.soft_limit_in_bytes, and not be able to read memory.stat. >> >> To get even fancier, a user should be able to create a sub-cgroup and >> then designate that sub-cgroup as "final" - no further sub-sub-cgroups >> allowed under it. They should also be able to designate that a >> sub-cgroup is "one-way" - once a process enters it, it can not leave. >> >> These are real(ish) examples based on what people want to do today. >> In particular, the last couple are things that we want to do, but >> don't do today. >> >> The particular policy can differ per-container. Production jobs might >> be allowed to create sub-cgroups, but batch jobs are not. Some user >> jobs are designated "trusted" in one facet or another and get more >> (but still not full) access. > > Interesting, thanks. > > I'll think a bit on how to best address these. > >> > At the moment I'm going on the naive belief that proper hierarchy >> > controls will be enforced (eventually) by the kernel - i.e. if >> > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it >> > won't be possible for /lxc/c1/lxc/c2 to take that access. >> > >> > The native cgroup manager (the one using cgroupfs) will be checking >> > the credentials of the requesting child manager for access(2) to >> > the cgroup files. >> >> This might be sufficient, or the basis for a sufficient access control >> system for users. The problem comes that we have multiple jobs on a >> single machine running as the same user. We need to ensure that the >> jobs can not modify each other. > > Would running them each in user namespaces with different mappings (all > jobs running as uid 1000, but uid 1000 mapped to different host uids > for each job) would be (long-term) feasible? Possibly. It's a largish imposition to make on the caller (we don't use user namespaces today, though we are evaluating how to start using them) but perhaps not terrible. >> > It is a named socket. >> >> So anyone can connect? even with SO_PEERCRED, how do you know which >> branches of the cgroup tree I am allowed to modify if the same user >> owns more than one? > > I was assuming that any process requesting management of > /c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1) > > So if you have two jobs running as uid 1000, one under /c1 and one > under /c2, and one as uid 1001 running under /c3 (with the uids owning > the cgroups), then the file permissions will prevent 1000 and 1001 > from walk over each other, while the cgroup manager will not allow > a process (child manager or otherwise) under /c1 to manage cgroups > under /c2 and vice versa. > >> >> Do you have a design spec, or a requirements list, or even a prototype >> >> that we can look at? >> > >> > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README >> > shows what I have in mind. It (and the sloppy code next to it) >> > represent a few hours' work over the last few days while waiting >> > for compiles and in between emails... >> >> Awesome. Do you mind if we look? > > No, but it might not be worth it (other than the readme) :) - so far > it's only served to help me think through what I want and need from > the mgr. > >> > But again, it is completely predicated on my goal to have libvirt >> > and lxc (and other cgroup users) be able to use the same library >> > or API to make their requests whe
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 8:53 AM, Serge Hallyn wrote: > Quoting Daniel P. Berrange (berra...@redhat.com): >> Are you also planning to actually write a new cgroup parent manager >> daemon too ? Currently my plan for libvirt is to just talk directly > > I'm toying with the idea, yes. (Right now my toy runs in either native > mode, using cgroupfs, or child mode, talking to a parent manager) I'd > love if someone else does it, but it needs to be done. > > As I've said elsewhere in the thread, I see 2 problems to be addressed: > > 1. The ability to nest the cgroup manager daemons, so that a daemon > running in a container can talk to a daemon running on the host. This > is the problem my current toy is aiming to address. But the API it > exports is just a thin layer over cgroupfs. > > 2. Abstract away the kernel/cgroupfs details so that userspace can > explain its cgroup needs generically. This is IIUC what systemd is > addressing with slices and scopes. > > (2) is where I'd really like to have a well thought out, community > designed API that everyone can agree on, and it might be worth getting > together (with Tejun) at plumbers or something to lay something out. We're also working on (2) (well, we HAVE it, but we're dis-integrating it so we can hopefully publish more widely). But our (2) depends on direct cgroupfs access. If that is to change, we need a really robust (1). It's OK (desireable, in fact) that (1) be a very thin layer of abstraction. > In the end, something like libvirt or lxc should not need to care > what is running underneat it. It should be able to make its requests > the same way regardless of whether it running in fedora or ubuntu, > and whether it is running on the host or in a tightly bound container. > That's my goal anyway :) > >> to systemd's new DBus APIs for all management of cgroups, and then >> fall back to writing to cgroupfs directly for cases where systemd >> is not around. Having a library to abstract these two possible >> alternatives isn't all that compelling unless we think there will >> be multiple cgroups manager daemons. I've been somewhat assuming that >> even Ubuntu will eventually see the benefits & switch to systemd, > > So far I've seen no indication of that :) > > If the systemd code to manage slices could be made separately > compileable as a standalone library or daemon, then I'd advocate > using that. But I don't see a lot of incentive for systemd to do > that, so I'd feel like a heel even asking. I want to say "let the best API win", but I know that systemd is a giant katamari ball, and it's absorbing subsystems so it may win by default. That isn't going to stop us from trying to do what we do, and share that with the world. >> then the issue of multiple manager daemons wouldn't really exist. > > True. But I'm running under the assumption that Ubuntu will stick with > upstart, and therefore yes I'll need a separate (perhaps pair of) > management daemons. > > Even if we were to switch to systemd, I'd like the API for userspace > programs to configure and use cgroups to be as generic as possible, > so that anyone who wanted to write their own daemon could do so. > > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 8:05 AM, Michal Hocko wrote: > On Thu 27-06-13 22:01:38, Tejun Heo wrote: >> Oh, that in itself is not bad. I mean, if you're root, it's pretty >> easy to play with and that part is fine. But combined with the >> hierarchical nature of cgroup and file permissions, it encourages >> people to "deligate" subdirectories to less previledged domains, > > OK, this really depends on what you expose to non-root users. I have > seen use cases where admin prepares top-level which is root-only but > it allows creating sub-groups which are under _full_ control of the > subdomain. This worked nicely for memcg for example because hard limit, > oom handling and other knobs are hierarchical so the subdomain cannot > overwrite what admin has said. bingo > And the systemd, with its history of eating projects and not caring much > about their previous users who are not willing to jump in to the systemd > car, doesn't sound like a good place where to place the new interface to > me. +1 If systemd is the only upstream implementation of this single-agent idea, we will have to invent our own, and continue to diverge rather than converge. I think that, if we are going to pursue this model of a single-agent, we should make a kick-ass implementation that is flexible and scalable, and full-featured enough to not require divergence at the lowest layer of the stack. Then build systemd on top of that. Let systemd offer more features and policies and "semantic" APIs. We will build our own semantic APIs that are, necessarily, different from systemd. But we can all use the same low-level mechanism. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 2:04 PM, Tejun Heo wrote: > Hello, > > On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote: >> So what you're saying is that you don't care that this new thing is >> less capable than the old thing, despite it having real impact. > > Sort of. I'm saying, at least up until now, moving away from > orthogonal hierarchy support seems to be the right trade-off. It all > depends on how you measure how much things are simplified and how > heavy the "real impacts" are. It's not like these things can be > determined white and black. Given the current situation, I think it's > the right call. I totally understand where you're coming from - trying to get back to a stable feature set. But it sucks to be on the losing end of that battle - you're cutting things that REALLY matter to us, and without a really viable alternative. So we'll keep fighting. >> If controller C is enabled at level X but disabled at level X/Y, does >> that mean that X/Y uses the limits set in X? How about X/Y/Z? > > Y and Y/Z wouldn't make any difference. Tasks belonging to them would > behave as if they belong to X as far as C is concerened. OK, that *sounds* sane. It doesn't solve all our problems, but it alleviates some of them. >> So take away some of the flexibility that has minimal impact and >> maximum return. Splitting threads across cgroups - we use it, but we >> could get off that. Force all-or-nothing joining of an aggregate > > Please do so. Splitting threads is sort of important for some cgroups, like CPU. I wonder if pjt is paying attention to this thread. >> construct (a container vs N cgroups). >> >> But perform surgery with a scalpel, not a hatchet. > > As anything else, it's drawing a line in a continuous spectrum of > grey. Right now, given that maintaining multiple orthogonal > hierarchies while introducing a proper concept of resource container > involves addition of completely new constructs and complexity, I don't > think that's a good option. If there are problems which can't be > resolved / worked around in a reasonable manner, please bring them up > along with their contexts. Let's examine them and see whether there > are other ways to accomodate them. You're arguing that the abstraction you want is that of a "container" but that it's easier to remove options than to actually build a better API. I think this is wrong. Take the opportunity to define the RIGHT interface that you WANT - a container. Implement it in terms of cgroups (and maybe other stuff!). Make that API so compelling that people want to use it, and your war of attrition on direct cgroup madness will be won, but with net progress rather than regress. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup access daemon
On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn wrote: > Quoting Tim Hockin (thoc...@hockin.org): >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn >> wrote: >> > Quoting Tim Hockin (thoc...@hockin.org): >> > >> >> For our use case this is a huge problem. We have people who access >> >> cgroup files in a fairly tight loops, polling for information. We >> >> have literally hundeds of jobs running on sub-second frequencies - >> >> plumbing all of that through a daemon is going to be a disaster. >> >> Either your daemon becomes a bottleneck, or we have to build something >> >> far more scalable than you really want to. Not to mention the >> >> inefficiency of inserting a layer. >> > >> > Currently you can trivially create a container which has the >> > container's cgroups bind-mounted to the expected places >> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the >> > configuration file, and handle cgroups through cgroupfs there. >> > (This is what the management agent wants to be an alternative >> > for) The main deficiency there is that /proc/self/cgroups is >> > not filtered, so it will show /lxc/c1 for init's cgroup, while >> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what >> > is seen under the container's /sys/fs/cgroup/devices (for >> > instance). Not ideal. >> >> I'm really saying that if your daemon is to provide a replacement for >> cgroupfs direct access, it needs to be designed to be scalable. If >> we're going to get away from bind mounting cgroupfs into user >> namespaces, then let's try to solve ALL the problems. >> >> >> We also need the ability to set up eventfds for users or to let them >> >> poll() on the socket from this daemon. >> > >> > So you'd want to be able to request updates when any cgroup value >> > is changed, right? >> >> Not necessarily ANY, but that's the terminus of this API facet. >> >> > That's currently not in my very limited set of commands, but I can >> > certainly add it, and yes it would be a simple unix sock so you can >> > set up eventfd, select/poll, etc. >> >> Assuming the protocol is basically a pass-through to basic filesystem >> ops, it should be pretty easy. You just need to add it to your >> protocol. >> >> But it brings up another point - access control. How do you decide >> which files a child agent should have access to? Does that ever >> change based on the child's configuration? In our world, the answer is >> almost certainly yes. > > Could you give examples? > > If you have a white/academic paper I should go read, that'd be great. We don't have anything on this, but examples may help. Someone running as root should be able to connect to the "native" daemon and read or write any cgroup file they want, right? You could argue that root should be able to do this to a child-daemon, too, but let's ignore that. But inside a container, I don't want the users to be able to write to anything in their own container. I do want them to be able to make sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be able to write to memory.limit_in_bytes, to read but not write memory.soft_limit_in_bytes, and not be able to read memory.stat. To get even fancier, a user should be able to create a sub-cgroup and then designate that sub-cgroup as "final" - no further sub-sub-cgroups allowed under it. They should also be able to designate that a sub-cgroup is "one-way" - once a process enters it, it can not leave. These are real(ish) examples based on what people want to do today. In particular, the last couple are things that we want to do, but don't do today. The particular policy can differ per-container. Production jobs might be allowed to create sub-cgroups, but batch jobs are not. Some user jobs are designated "trusted" in one facet or another and get more (but still not full) access. > At the moment I'm going on the naive belief that proper hierarchy > controls will be enforced (eventually) by the kernel - i.e. if > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it > won't be possible for /lxc/c1/lxc/c2 to take that access. > > The native cgroup manager (the one using cgroupfs) will be checking > the credentials of the requesting child manager for access(2) to > the cgroup files. This might be sufficient, or the basis for a sufficient access control system for users. The problem comes that we have multiple jobs on a single machine running as the same user. We need to ensure that the jobs can not modify each other. >>
Re: cgroup access daemon
On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tim Hockin (thoc...@hockin.org): On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tim Hockin (thoc...@hockin.org): For our use case this is a huge problem. We have people who access cgroup files in a fairly tight loops, polling for information. We have literally hundeds of jobs running on sub-second frequencies - plumbing all of that through a daemon is going to be a disaster. Either your daemon becomes a bottleneck, or we have to build something far more scalable than you really want to. Not to mention the inefficiency of inserting a layer. Currently you can trivially create a container which has the container's cgroups bind-mounted to the expected places (/sys/fs/cgroup/$controller) by uncommenting two lines in the configuration file, and handle cgroups through cgroupfs there. (This is what the management agent wants to be an alternative for) The main deficiency there is that /proc/self/cgroups is not filtered, so it will show /lxc/c1 for init's cgroup, while the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what is seen under the container's /sys/fs/cgroup/devices (for instance). Not ideal. I'm really saying that if your daemon is to provide a replacement for cgroupfs direct access, it needs to be designed to be scalable. If we're going to get away from bind mounting cgroupfs into user namespaces, then let's try to solve ALL the problems. We also need the ability to set up eventfds for users or to let them poll() on the socket from this daemon. So you'd want to be able to request updates when any cgroup value is changed, right? Not necessarily ANY, but that's the terminus of this API facet. That's currently not in my very limited set of commands, but I can certainly add it, and yes it would be a simple unix sock so you can set up eventfd, select/poll, etc. Assuming the protocol is basically a pass-through to basic filesystem ops, it should be pretty easy. You just need to add it to your protocol. But it brings up another point - access control. How do you decide which files a child agent should have access to? Does that ever change based on the child's configuration? In our world, the answer is almost certainly yes. Could you give examples? If you have a white/academic paper I should go read, that'd be great. We don't have anything on this, but examples may help. Someone running as root should be able to connect to the native daemon and read or write any cgroup file they want, right? You could argue that root should be able to do this to a child-daemon, too, but let's ignore that. But inside a container, I don't want the users to be able to write to anything in their own container. I do want them to be able to make sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be able to write to memory.limit_in_bytes, to read but not write memory.soft_limit_in_bytes, and not be able to read memory.stat. To get even fancier, a user should be able to create a sub-cgroup and then designate that sub-cgroup as final - no further sub-sub-cgroups allowed under it. They should also be able to designate that a sub-cgroup is one-way - once a process enters it, it can not leave. These are real(ish) examples based on what people want to do today. In particular, the last couple are things that we want to do, but don't do today. The particular policy can differ per-container. Production jobs might be allowed to create sub-cgroups, but batch jobs are not. Some user jobs are designated trusted in one facet or another and get more (but still not full) access. At the moment I'm going on the naive belief that proper hierarchy controls will be enforced (eventually) by the kernel - i.e. if a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it won't be possible for /lxc/c1/lxc/c2 to take that access. The native cgroup manager (the one using cgroupfs) will be checking the credentials of the requesting child manager for access(2) to the cgroup files. This might be sufficient, or the basis for a sufficient access control system for users. The problem comes that we have multiple jobs on a single machine running as the same user. We need to ensure that the jobs can not modify each other. So then the idea would be that userspace (like libvirt and lxc) would talk over /dev/cgroup to its manager. Userspace inside a container (which can't actually mount cgroups itself) would talk to its own manager which is talking over a passed-in socket to the host manager, which in turn runs natively (uses cgroupfs, and nests create /c1 under the requestor's cgroup). How do you handle updates of this agent? Suppose I have hundreds of running containers, and I want to release a new version of the cgroupd ? This may change (which is part
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 2:04 PM, Tejun Heo t...@kernel.org wrote: Hello, On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote: So what you're saying is that you don't care that this new thing is less capable than the old thing, despite it having real impact. Sort of. I'm saying, at least up until now, moving away from orthogonal hierarchy support seems to be the right trade-off. It all depends on how you measure how much things are simplified and how heavy the real impacts are. It's not like these things can be determined white and black. Given the current situation, I think it's the right call. I totally understand where you're coming from - trying to get back to a stable feature set. But it sucks to be on the losing end of that battle - you're cutting things that REALLY matter to us, and without a really viable alternative. So we'll keep fighting. If controller C is enabled at level X but disabled at level X/Y, does that mean that X/Y uses the limits set in X? How about X/Y/Z? Y and Y/Z wouldn't make any difference. Tasks belonging to them would behave as if they belong to X as far as C is concerened. OK, that *sounds* sane. It doesn't solve all our problems, but it alleviates some of them. So take away some of the flexibility that has minimal impact and maximum return. Splitting threads across cgroups - we use it, but we could get off that. Force all-or-nothing joining of an aggregate Please do so. Splitting threads is sort of important for some cgroups, like CPU. I wonder if pjt is paying attention to this thread. construct (a container vs N cgroups). But perform surgery with a scalpel, not a hatchet. As anything else, it's drawing a line in a continuous spectrum of grey. Right now, given that maintaining multiple orthogonal hierarchies while introducing a proper concept of resource container involves addition of completely new constructs and complexity, I don't think that's a good option. If there are problems which can't be resolved / worked around in a reasonable manner, please bring them up along with their contexts. Let's examine them and see whether there are other ways to accomodate them. You're arguing that the abstraction you want is that of a container but that it's easier to remove options than to actually build a better API. I think this is wrong. Take the opportunity to define the RIGHT interface that you WANT - a container. Implement it in terms of cgroups (and maybe other stuff!). Make that API so compelling that people want to use it, and your war of attrition on direct cgroup madness will be won, but with net progress rather than regress. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 8:53 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Daniel P. Berrange (berra...@redhat.com): Are you also planning to actually write a new cgroup parent manager daemon too ? Currently my plan for libvirt is to just talk directly I'm toying with the idea, yes. (Right now my toy runs in either native mode, using cgroupfs, or child mode, talking to a parent manager) I'd love if someone else does it, but it needs to be done. As I've said elsewhere in the thread, I see 2 problems to be addressed: 1. The ability to nest the cgroup manager daemons, so that a daemon running in a container can talk to a daemon running on the host. This is the problem my current toy is aiming to address. But the API it exports is just a thin layer over cgroupfs. 2. Abstract away the kernel/cgroupfs details so that userspace can explain its cgroup needs generically. This is IIUC what systemd is addressing with slices and scopes. (2) is where I'd really like to have a well thought out, community designed API that everyone can agree on, and it might be worth getting together (with Tejun) at plumbers or something to lay something out. We're also working on (2) (well, we HAVE it, but we're dis-integrating it so we can hopefully publish more widely). But our (2) depends on direct cgroupfs access. If that is to change, we need a really robust (1). It's OK (desireable, in fact) that (1) be a very thin layer of abstraction. In the end, something like libvirt or lxc should not need to care what is running underneat it. It should be able to make its requests the same way regardless of whether it running in fedora or ubuntu, and whether it is running on the host or in a tightly bound container. That's my goal anyway :) to systemd's new DBus APIs for all management of cgroups, and then fall back to writing to cgroupfs directly for cases where systemd is not around. Having a library to abstract these two possible alternatives isn't all that compelling unless we think there will be multiple cgroups manager daemons. I've been somewhat assuming that even Ubuntu will eventually see the benefits switch to systemd, So far I've seen no indication of that :) If the systemd code to manage slices could be made separately compileable as a standalone library or daemon, then I'd advocate using that. But I don't see a lot of incentive for systemd to do that, so I'd feel like a heel even asking. I want to say let the best API win, but I know that systemd is a giant katamari ball, and it's absorbing subsystems so it may win by default. That isn't going to stop us from trying to do what we do, and share that with the world. then the issue of multiple manager daemons wouldn't really exist. True. But I'm running under the assumption that Ubuntu will stick with upstart, and therefore yes I'll need a separate (perhaps pair of) management daemons. Even if we were to switch to systemd, I'd like the API for userspace programs to configure and use cgroups to be as generic as possible, so that anyone who wanted to write their own daemon could do so. -serge -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 8:05 AM, Michal Hocko mho...@suse.cz wrote: On Thu 27-06-13 22:01:38, Tejun Heo wrote: Oh, that in itself is not bad. I mean, if you're root, it's pretty easy to play with and that part is fine. But combined with the hierarchical nature of cgroup and file permissions, it encourages people to deligate subdirectories to less previledged domains, OK, this really depends on what you expose to non-root users. I have seen use cases where admin prepares top-level which is root-only but it allows creating sub-groups which are under _full_ control of the subdomain. This worked nicely for memcg for example because hard limit, oom handling and other knobs are hierarchical so the subdomain cannot overwrite what admin has said. bingo And the systemd, with its history of eating projects and not caring much about their previous users who are not willing to jump in to the systemd car, doesn't sound like a good place where to place the new interface to me. +1 If systemd is the only upstream implementation of this single-agent idea, we will have to invent our own, and continue to diverge rather than converge. I think that, if we are going to pursue this model of a single-agent, we should make a kick-ass implementation that is flexible and scalable, and full-featured enough to not require divergence at the lowest layer of the stack. Then build systemd on top of that. Let systemd offer more features and policies and semantic APIs. We will build our own semantic APIs that are, necessarily, different from systemd. But we can all use the same low-level mechanism. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup access daemon
On Fri, Jun 28, 2013 at 12:21 PM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tim Hockin (thoc...@hockin.org): On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tim Hockin (thoc...@hockin.org): On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tim Hockin (thoc...@hockin.org): Could you give examples? If you have a white/academic paper I should go read, that'd be great. We don't have anything on this, but examples may help. Someone running as root should be able to connect to the native daemon and read or write any cgroup file they want, right? You could argue that root should be able to do this to a child-daemon, too, but let's ignore that. But inside a container, I don't want the users to be able to write to anything in their own container. I do want them to be able to make sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be able to write to memory.limit_in_bytes, to read but not write memory.soft_limit_in_bytes, and not be able to read memory.stat. To get even fancier, a user should be able to create a sub-cgroup and then designate that sub-cgroup as final - no further sub-sub-cgroups allowed under it. They should also be able to designate that a sub-cgroup is one-way - once a process enters it, it can not leave. These are real(ish) examples based on what people want to do today. In particular, the last couple are things that we want to do, but don't do today. The particular policy can differ per-container. Production jobs might be allowed to create sub-cgroups, but batch jobs are not. Some user jobs are designated trusted in one facet or another and get more (but still not full) access. Interesting, thanks. I'll think a bit on how to best address these. At the moment I'm going on the naive belief that proper hierarchy controls will be enforced (eventually) by the kernel - i.e. if a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it won't be possible for /lxc/c1/lxc/c2 to take that access. The native cgroup manager (the one using cgroupfs) will be checking the credentials of the requesting child manager for access(2) to the cgroup files. This might be sufficient, or the basis for a sufficient access control system for users. The problem comes that we have multiple jobs on a single machine running as the same user. We need to ensure that the jobs can not modify each other. Would running them each in user namespaces with different mappings (all jobs running as uid 1000, but uid 1000 mapped to different host uids for each job) would be (long-term) feasible? Possibly. It's a largish imposition to make on the caller (we don't use user namespaces today, though we are evaluating how to start using them) but perhaps not terrible. It is a named socket. So anyone can connect? even with SO_PEERCRED, how do you know which branches of the cgroup tree I am allowed to modify if the same user owns more than one? I was assuming that any process requesting management of /c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1) So if you have two jobs running as uid 1000, one under /c1 and one under /c2, and one as uid 1001 running under /c3 (with the uids owning the cgroups), then the file permissions will prevent 1000 and 1001 from walk over each other, while the cgroup manager will not allow a process (child manager or otherwise) under /c1 to manage cgroups under /c2 and vice versa. Do you have a design spec, or a requirements list, or even a prototype that we can look at? The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README shows what I have in mind. It (and the sloppy code next to it) represent a few hours' work over the last few days while waiting for compiles and in between emails... Awesome. Do you mind if we look? No, but it might not be worth it (other than the readme) :) - so far it's only served to help me think through what I want and need from the mgr. But again, it is completely predicated on my goal to have libvirt and lxc (and other cgroup users) be able to use the same library or API to make their requests whether they are on host or in a container, and regardless of the distro they're running under. I think that is a good goal. We'd like to not be different, if possible. Obviously, we can't impose our needs on you if you don't want to handle them. It sounds like what you are building is the bottom layer in a stack - we (Google) should use that same bottom layer. But that can only happen iff you're open to hearing our requirements. Otherwise we have to strike out on our own or build more layers in-between. I'm definately open to your requirements - whether providing what you need for another layer on top, or building it right in. Great. That's a good place to start :) -- To unsubscribe from this list: send the line
Re: cgroup: status-quo and userland efforts
Come on, now, Lennart. You put a lot of words in my mouth. On Fri, Jun 28, 2013 at 6:48 PM, Lennart Poettering lpoet...@redhat.com wrote: On 28.06.2013 20:53, Tim Hockin wrote: a single-agent, we should make a kick-ass implementation that is flexible and scalable, and full-featured enough to not require divergence at the lowest layer of the stack. Then build systemd on top of that. Let systemd offer more features and policies and semantic APIs. Well, what if systemd is already kick-ass? I mean, if you have a problem with systemd, then that's your own problem, but I really don't think why I should bother? I didn't say it wasn't. I said that we can build a common substrate that systemd can build on *and* non-systemd systems can use *and* Google can participate in. I for sure am not going to make the PID 1 a client of another daemon. That's just wrong. If you have a daemon that is both conceptually the manager of another service and the client of that other service, then that's bad design and you will easily run into deadlocks and such. Just think about it: if you have some external daemon for managing cgroups, and you need cgroups for running external daemons, how are you going to start the external daemon for managing cgroups? Sure, you can hack around this, make that daemon special, and magic, and stuff -- or you can just not do such nonsense. There's no reason to repeat the fuckup that cgroup became in kernelspace a second time, but this time in userspace, with multiple manager daemons all with different and slightly incompatible definitions what a unit to manage actualy is... I forgot about the tautology of systemd. systemd is monolithic. Therefore it can not have any external dependencies. Therefore it must absorb anything it depends on. Therefore systemd continues to grow in size and scope. Up next: systemd manages your X sessions! But that's not my point. It seems pretty easy to make this cgroup management (in native mode) a library that can have either a thin veneer of a main() function, while also being usable by systemd. The point is to solve all of the problems ONCE. I'm trying to make the case that systemd itself should be focusing on features and policies and awesome APIs. We want to run fewer, simpler things on our systems, we want to reuse as Fewer and simpler are not compatible, unless you are losing functionality. Systemd is fewer, but NOT simpler. much of the code as we can. You don't achieve that by running yet another daemon that does worse what systemd can anyway do simpler, easier and better. Considering this is all hypothetical, I find this to be a funny debate. My hypothetical idea is better than your hypothetical idea. The least you could grant us is to have a look at the final APIs we will have to offer before you already imply that systemd cannot be a valid implementation of any API people could ever agree on. Whoah, don't get defensive. I said nothing of the sort. The fact of the matter is that we do not run systemd, at least in part because of the monolithic nature. That's unlikely to change in this timescale. What I said was that it would be a shame if we had to invent our own low-level cgroup daemon just because the upstream daemons was too tightly coupled with systemd. I think we have a lot of experience to offer to this project, and a vested interest in seeing it done well. But if it is purely targetting systemd, we have little incentive to devote resources to it. Please note that I am strictly talking about the lowest layer of the API. Just the thing that guards cgroupfs against mere mortals. The higher layers - where abstractions exist, that are actually USEFUL to end users - are not really in scope right now. We already have our own higher level APIs. This is supposed to be collaborative, not combative. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 11:14 AM, Serge Hallyn wrote: > Quoting Tejun Heo (t...@kernel.org): >> Hello, Serge. >> >> On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: >> > At some point (probably soon) we might want to talk about a standard API >> > for these things. However I think it will have to come in the form of >> > a standard library, which knows to either send requests over dbus to >> > systemd, or over /dev/cgroup sock to the manager. >> >> Yeah, eventually, I think we'll have a standardized way to configure >> resource distribution in the system. Maybe we'll agree on a >> standardized dbus protocol or there will be library, I don't know; >> however, whatever form it may be in, it abstraction level should be >> way higher than that of direct cgroupfs access. It's way too low >> level and very easy to end up in a complete nonsense configuration. >> >> e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone >> wouldn't enable fair scheduling on that cgroup but drastically reduce >> the amount of cpu share it gets as it now gets treated as single >> entity competing with all tasks at the parent level. > > Right. I *think* this can be offered as a daemon which sits as the > sole consumer of my agent's API, and offers a higher level "do what I > want" API. But designing that API is going to be interesting. This is something we have, partially, and are working to be able to open-source. We have a LOT of experience feeding into the semantics that actually make users happy. Today it leverages split-hierarchies, but that is not required in the generic case (only if you want to offer the semantics we do). It explicitly delegates some aspects of sub-cgroup control to users, but that could go away if your lowest-level agency can handle it. > I should find a good, up-to-date summary of the current behaviors of > each controller so I can talk more intelligently about it. (I'll > start by looking at the kernel Documentation/cgroups, but don't > feel too confident that they'll be uptodate :) > >> At the moment, I'm not sure what the eventual abstraction would look >> like. systemd is extending its basic constructs by adding slices and >> scopes and it does make sense to integrate the general organization of >> the system (services, user sessions, VMs and so on) with resource >> management. Given some time, I'm hoping we'll be able to come up with >> and agree on some common constructs so that each workload can indicate >> its resource requirements in a unified way. >> >> That said, I really think we should experiment for a while before >> trying to settle down on things. We've now just started exploring how >> system-wide resource managment can be made widely available to systems >> without requiring extremely specialized hand-crafted configurations >> and I'm pretty sure we're getting and gonna get quite a few details >> wrong, so I don't think it'd be a good idea to try to agree on things >> right now. As far as such integration goes, I think it's time to play >> with things and observe the results. > > Right, I'm not attached to my toy implementation at all - except for > the ability, in some fashion, to have nested agents which don't have > cgroupfs access but talk to another agent to get the job done. > > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 10:38 AM, Tejun Heo wrote: > Hello, Tim. > > On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote: >> OK, then what I don't know is what is the new interface? A new cgroupfs? > > It's gonna be a new mount option for cgroupfs. > >> DTF and CPU and cpuset all have "default" groups for some tasks (and >> not others) in our world today. DTF actually has default, prio, and >> "normal". I was simplifying before. I really wish it were as simple >> as you think it is. But if it were, do you think I'd still be >> arguing? > > How am I supposed to know when you don't communicate it but just wave > your hands saying it's all very complicated? The cpuset / blkcg > example is pretty bad because you can enforce any cpuset rules at the > leaves. Modifying hundreds of cgroups is really painful, and yes, we do it often enough to be able to see it. >> This really doesn't scale when I have thousands of jobs running. >> Being able to disable at some levels on some controllers probably >> helps some, but I can't say for sure without knowing the new interface > > How does the number of jobs affect it? Does each job create a new > cgroup? Well, in your model it does... >> We tried it in unified hierarchy. We had our Top People on the >> problem. The best we could get was bad enough that we embarked on a >> LITERAL 2 year transition to make it better. > > What didn't work? What part was so bad? I find it pretty difficult > to believe that multiple orthogonal hierarchies is the only possible > solution, so please elaborate the issues that you guys have > experienced. I'm looping in more Google people. > The hierarchy is for organization and enforcement of dynamic > hierarchical resource distribution and that's it. If its expressive > power is lacking, take compromise or tune the configuration according > to the workloads. The latter is necessary in workloads which have > clear distinction of foreground and background anyway - anything which > interacts with human beings including androids. So what you're saying is that you don't care that this new thing is less capable than the old thing, despite it having real impact. >> In other words, define a container as a set of cgroups, one under each >> each active controller type. A TID enters the container atomically, >> joining all of the cgroups or none of the cgroups. >> >> container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar, >> /cgroup/io/default/foo/bar, /cgroup/cpuset/ >> >> This is an abstraction that we maintain in userspace (more or less) >> and we do actually have headaches from split hierarchies here >> (handling partial failures, non-atomic joins, etc) > > That'd separate out task organization from controllre config > hierarchies. Kay had a similar idea some time ago. I think it makes > things even more complex than it is right now. I'll continue on this > below. > >> I'm still a bit fuzzy - is all of this written somewhere? > > If you dig through cgroup ML, most are there. There'll be > "cgroup.controllers" file with which you can enable / disable > controllers. Enabling a controller in a cgroup implies that the > controller is enabled in all ancestors. Implies or requires? Cause or predicate? If controller C is enabled at level X but disabled at level X/Y, does that mean that X/Y uses the limits set in X? How about X/Y/Z? This will get rid of the bulk of the cpuset scaling problem, but not all of it. I think we still have the same problems with cpu as we do with io. Perhaps that should have been the example. >> It sounds like you're missing a layer of abstraction. Why not add the >> abstraction you want to expose on top of powerful primitives, instead >> of dumbing down the primitives? > > It sure would be possible build more and try to address the issues > we're seeing now; however, after looking at cgroups for some time now, > the underlying theme is failure to take reasonable trade-offs and > going for maximum flexibility in making each choice - the choice of > interface, multiple hierarchies, no restriction on hierarchical > behavior, splitting threads of the same process into separate cgroups, > semi-encouraging delegation through file permission without actually > pondering the consequences and so on. And each choice probably made > sense trying to serve each immediate requirement at the time but added > up it's a giant pile of mess which developed without direction. I am very sympathetic to this problem. You could have just described some of our internal problems too. The difference is that we are trying to make changes that provide more structure and boundaries in ways
Re: cgroup access daemon
On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn wrote: > Quoting Tim Hockin (thoc...@hockin.org): > >> For our use case this is a huge problem. We have people who access >> cgroup files in a fairly tight loops, polling for information. We >> have literally hundeds of jobs running on sub-second frequencies - >> plumbing all of that through a daemon is going to be a disaster. >> Either your daemon becomes a bottleneck, or we have to build something >> far more scalable than you really want to. Not to mention the >> inefficiency of inserting a layer. > > Currently you can trivially create a container which has the > container's cgroups bind-mounted to the expected places > (/sys/fs/cgroup/$controller) by uncommenting two lines in the > configuration file, and handle cgroups through cgroupfs there. > (This is what the management agent wants to be an alternative > for) The main deficiency there is that /proc/self/cgroups is > not filtered, so it will show /lxc/c1 for init's cgroup, while > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what > is seen under the container's /sys/fs/cgroup/devices (for > instance). Not ideal. I'm really saying that if your daemon is to provide a replacement for cgroupfs direct access, it needs to be designed to be scalable. If we're going to get away from bind mounting cgroupfs into user namespaces, then let's try to solve ALL the problems. >> We also need the ability to set up eventfds for users or to let them >> poll() on the socket from this daemon. > > So you'd want to be able to request updates when any cgroup value > is changed, right? Not necessarily ANY, but that's the terminus of this API facet. > That's currently not in my very limited set of commands, but I can > certainly add it, and yes it would be a simple unix sock so you can > set up eventfd, select/poll, etc. Assuming the protocol is basically a pass-through to basic filesystem ops, it should be pretty easy. You just need to add it to your protocol. But it brings up another point - access control. How do you decide which files a child agent should have access to? Does that ever change based on the child's configuration? In our world, the answer is almost certainly yes. >> >> > So then the idea would be that userspace (like libvirt and lxc) would >> >> > talk over /dev/cgroup to its manager. Userspace inside a container >> >> > (which can't actually mount cgroups itself) would talk to its own >> >> > manager which is talking over a passed-in socket to the host manager, >> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under >> >> > the requestor's cgroup). >> >> >> >> How do you handle updates of this agent? Suppose I have hundreds of >> >> running containers, and I want to release a new version of the cgroupd >> >> ? >> > >> > This may change (which is part of what I want to investigate with some >> > POC), but right now I'm building any controller-aware smarts into it. I >> > think that's what you're asking about? The agent doesn't do "slices" >> > etc. This may turn out to be insufficient, we'll see. >> >> No, what I am asking is a release-engineering problem. Suppose we >> need to roll out a new version of this daemon (some new feature or a >> bug or something). We have hundreds of these "child" agents running >> in the job containers. > > When I say "container" I mean an lxc container, with it's own isolated > rootfs and mntns. I'm not sure what your "containers" are, but I if > they're not that, then they shouldn't need to run a child agent. They > can just talk over the host cgroup agent's socket. If they talk over the host agent's socket, where is the access control and restriction done? Who decides how deep I can nest groups? Who says which files I may access? Who stops me from modifying someone else's container? Our containers are somewhat thinner and more managed than LXC, but not that much. If we're running a system agent in a user container, we need to manage that software. We can't just start up a version and leave it running until the user decides to upgrade - we force upgrades. >> How do I bring down all these children, and then bring them back up on >> a new version in a way that does not disrupt user jobs (much)? >> >> Similarly, what happens when one of these child agents crashes? Does >> someone restart it? Do user jobs just stop working? > > An upstart^W$init_system job will restart it... What happens when the main agent crashes? All those children on UNIX sockets need to reconnect, I guess.
cgroup access daemon
Changing the subject, so as not to mix two discussions On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn wrote: > >> > FWIW, the code is too embarassing yet to see daylight, but I'm playing >> > with a very lowlevel cgroup manager which supports nesting itself. >> > Access in this POC is low-level ("set freezer.state to THAWED for cgroup >> > /c1/c2", "Create /c3"), but the key feature is that it can run in two >> > modes - native mode in which it uses cgroupfs, and child mode where it >> > talks to a parent manager to make the changes. >> >> In this world, are users able to read cgroup files, or do they have to >> go through a central agent, too? > > The agent won't itself do anything to stop access through cgroupfs, but > the idea would be that cgroupfs would only be mounted in the agent's > mntns. My hope would be that the libcgroup commands (like cgexec, > cgcreate, etc) would know to talk to the agent when possible, and users > would use those. For our use case this is a huge problem. We have people who access cgroup files in a fairly tight loops, polling for information. We have literally hundeds of jobs running on sub-second frequencies - plumbing all of that through a daemon is going to be a disaster. Either your daemon becomes a bottleneck, or we have to build something far more scalable than you really want to. Not to mention the inefficiency of inserting a layer. We also need the ability to set up eventfds for users or to let them poll() on the socket from this daemon. >> > So then the idea would be that userspace (like libvirt and lxc) would >> > talk over /dev/cgroup to its manager. Userspace inside a container >> > (which can't actually mount cgroups itself) would talk to its own >> > manager which is talking over a passed-in socket to the host manager, >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under >> > the requestor's cgroup). >> >> How do you handle updates of this agent? Suppose I have hundreds of >> running containers, and I want to release a new version of the cgroupd >> ? > > This may change (which is part of what I want to investigate with some > POC), but right now I'm building any controller-aware smarts into it. I > think that's what you're asking about? The agent doesn't do "slices" > etc. This may turn out to be insufficient, we'll see. No, what I am asking is a release-engineering problem. Suppose we need to roll out a new version of this daemon (some new feature or a bug or something). We have hundreds of these "child" agents running in the job containers. How do I bring down all these children, and then bring them back up on a new version in a way that does not disrupt user jobs (much)? Similarly, what happens when one of these child agents crashes? Does someone restart it? Do user jobs just stop working? > > So the only state which the agent stores is a list of cgroup mounts (if > in native mode) or an open socket to the parent (if in child mode), and a > list of connected children sockets. > > HUPping the agent will cause it to reload the cgroupfs mounts (in case > you've mounted a new controller, living in "the old world" :). If you > just kill it and start a new one, it shouldn't matter. > >> (note: inquiries about the implementation do not denote acceptance of >> the model :) > > To put it another way, the problem I'm solving (for now) is not the "I > want a daemon to ensure that requested guarantees are correctly > implemented." In that sense I'm maintaining the status quo, i.e. the > admin needs to architect the layout correctly. > > The problem I'm solving is really that I want containers to be able to > handle cgroups even if they can't mount cgroupfs, and I want all > userspace to be able to behave the same whether they are in a container > or not. > > This isn't meant as a poke in the eye of anyone who wants to address the > other problem. If it turns out that we (meaning "the community of > cgroup users") really want such an agent, then we can add that. I'm not > convinced. > > What would probably be a better design, then, would be that the agent > I'm working on can plug into a resource allocation agent. Or, I > suppose, the other way around. > >> > At some point (probably soon) we might want to talk about a standard API >> > for these things. However I think it will have to come in the form of >> > a standard library, which knows to either send requests over dbus to >> > systemd, or over /dev/cgroup sock to the manager. >> > >> > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn wrote: > Quoting Mike Galbraith (bitbuc...@online.de): >> On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote: >> > Hello, Tim. >> > >> > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: >> > > I really want to understand why this is SO IMPORTANT that you have to >> > > break userspace compatibility? I mean, isn't Linux supposed to be the >> > > OS with the stable kernel interface? I've seen Linus rant time and >> > > time again about this - why is it OK now? >> > >> > What the hell are you talking about? Nobody is breaking userland >> > interface. A new version of interface is being phased in and the old >> > one will stay there for the foreseeable future. It will be phased out >> > eventually but that's gonna take a long time and it will have to be >> > something hardly noticeable. Of course new features will only be >> > available with the new interface and there will be efforts to nudge >> > people away from the old one but the existing interface will keep >> > working it does. >> >> I can understand some alarm. When I saw the below I started frothing at >> the face and howling at the moon, and I don't even use the things much. >> >> http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html >> >> Hierarchy layout aside, that "private property" bit says that the folks >> who currently own and use the cgroups interface will lose direct access >> to it. I can imagine folks who have become dependent upon an on the fly >> management agents of their own design becoming a tad alarmed. > > FWIW, the code is too embarassing yet to see daylight, but I'm playing > with a very lowlevel cgroup manager which supports nesting itself. > Access in this POC is low-level ("set freezer.state to THAWED for cgroup > /c1/c2", "Create /c3"), but the key feature is that it can run in two > modes - native mode in which it uses cgroupfs, and child mode where it > talks to a parent manager to make the changes. In this world, are users able to read cgroup files, or do they have to go through a central agent, too? > So then the idea would be that userspace (like libvirt and lxc) would > talk over /dev/cgroup to its manager. Userspace inside a container > (which can't actually mount cgroups itself) would talk to its own > manager which is talking over a passed-in socket to the host manager, > which in turn runs natively (uses cgroupfs, and nests "create /c1" under > the requestor's cgroup). How do you handle updates of this agent? Suppose I have hundreds of running containers, and I want to release a new version of the cgroupd ? (note: inquiries about the implementation do not denote acceptance of the model :) > At some point (probably soon) we might want to talk about a standard API > for these things. However I think it will have to come in the form of > a standard library, which knows to either send requests over dbus to > systemd, or over /dev/cgroup sock to the manager. > > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Mike Galbraith (bitbuc...@online.de): On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote: Hello, Tim. On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: I really want to understand why this is SO IMPORTANT that you have to break userspace compatibility? I mean, isn't Linux supposed to be the OS with the stable kernel interface? I've seen Linus rant time and time again about this - why is it OK now? What the hell are you talking about? Nobody is breaking userland interface. A new version of interface is being phased in and the old one will stay there for the foreseeable future. It will be phased out eventually but that's gonna take a long time and it will have to be something hardly noticeable. Of course new features will only be available with the new interface and there will be efforts to nudge people away from the old one but the existing interface will keep working it does. I can understand some alarm. When I saw the below I started frothing at the face and howling at the moon, and I don't even use the things much. http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html Hierarchy layout aside, that private property bit says that the folks who currently own and use the cgroups interface will lose direct access to it. I can imagine folks who have become dependent upon an on the fly management agents of their own design becoming a tad alarmed. FWIW, the code is too embarassing yet to see daylight, but I'm playing with a very lowlevel cgroup manager which supports nesting itself. Access in this POC is low-level (set freezer.state to THAWED for cgroup /c1/c2, Create /c3), but the key feature is that it can run in two modes - native mode in which it uses cgroupfs, and child mode where it talks to a parent manager to make the changes. In this world, are users able to read cgroup files, or do they have to go through a central agent, too? So then the idea would be that userspace (like libvirt and lxc) would talk over /dev/cgroup to its manager. Userspace inside a container (which can't actually mount cgroups itself) would talk to its own manager which is talking over a passed-in socket to the host manager, which in turn runs natively (uses cgroupfs, and nests create /c1 under the requestor's cgroup). How do you handle updates of this agent? Suppose I have hundreds of running containers, and I want to release a new version of the cgroupd ? (note: inquiries about the implementation do not denote acceptance of the model :) At some point (probably soon) we might want to talk about a standard API for these things. However I think it will have to come in the form of a standard library, which knows to either send requests over dbus to systemd, or over /dev/cgroup sock to the manager. -serge -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
cgroup access daemon
Changing the subject, so as not to mix two discussions On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: FWIW, the code is too embarassing yet to see daylight, but I'm playing with a very lowlevel cgroup manager which supports nesting itself. Access in this POC is low-level (set freezer.state to THAWED for cgroup /c1/c2, Create /c3), but the key feature is that it can run in two modes - native mode in which it uses cgroupfs, and child mode where it talks to a parent manager to make the changes. In this world, are users able to read cgroup files, or do they have to go through a central agent, too? The agent won't itself do anything to stop access through cgroupfs, but the idea would be that cgroupfs would only be mounted in the agent's mntns. My hope would be that the libcgroup commands (like cgexec, cgcreate, etc) would know to talk to the agent when possible, and users would use those. For our use case this is a huge problem. We have people who access cgroup files in a fairly tight loops, polling for information. We have literally hundeds of jobs running on sub-second frequencies - plumbing all of that through a daemon is going to be a disaster. Either your daemon becomes a bottleneck, or we have to build something far more scalable than you really want to. Not to mention the inefficiency of inserting a layer. We also need the ability to set up eventfds for users or to let them poll() on the socket from this daemon. So then the idea would be that userspace (like libvirt and lxc) would talk over /dev/cgroup to its manager. Userspace inside a container (which can't actually mount cgroups itself) would talk to its own manager which is talking over a passed-in socket to the host manager, which in turn runs natively (uses cgroupfs, and nests create /c1 under the requestor's cgroup). How do you handle updates of this agent? Suppose I have hundreds of running containers, and I want to release a new version of the cgroupd ? This may change (which is part of what I want to investigate with some POC), but right now I'm building any controller-aware smarts into it. I think that's what you're asking about? The agent doesn't do slices etc. This may turn out to be insufficient, we'll see. No, what I am asking is a release-engineering problem. Suppose we need to roll out a new version of this daemon (some new feature or a bug or something). We have hundreds of these child agents running in the job containers. How do I bring down all these children, and then bring them back up on a new version in a way that does not disrupt user jobs (much)? Similarly, what happens when one of these child agents crashes? Does someone restart it? Do user jobs just stop working? So the only state which the agent stores is a list of cgroup mounts (if in native mode) or an open socket to the parent (if in child mode), and a list of connected children sockets. HUPping the agent will cause it to reload the cgroupfs mounts (in case you've mounted a new controller, living in the old world :). If you just kill it and start a new one, it shouldn't matter. (note: inquiries about the implementation do not denote acceptance of the model :) To put it another way, the problem I'm solving (for now) is not the I want a daemon to ensure that requested guarantees are correctly implemented. In that sense I'm maintaining the status quo, i.e. the admin needs to architect the layout correctly. The problem I'm solving is really that I want containers to be able to handle cgroups even if they can't mount cgroupfs, and I want all userspace to be able to behave the same whether they are in a container or not. This isn't meant as a poke in the eye of anyone who wants to address the other problem. If it turns out that we (meaning the community of cgroup users) really want such an agent, then we can add that. I'm not convinced. What would probably be a better design, then, would be that the agent I'm working on can plug into a resource allocation agent. Or, I suppose, the other way around. At some point (probably soon) we might want to talk about a standard API for these things. However I think it will have to come in the form of a standard library, which knows to either send requests over dbus to systemd, or over /dev/cgroup sock to the manager. -serge -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup access daemon
On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tim Hockin (thoc...@hockin.org): For our use case this is a huge problem. We have people who access cgroup files in a fairly tight loops, polling for information. We have literally hundeds of jobs running on sub-second frequencies - plumbing all of that through a daemon is going to be a disaster. Either your daemon becomes a bottleneck, or we have to build something far more scalable than you really want to. Not to mention the inefficiency of inserting a layer. Currently you can trivially create a container which has the container's cgroups bind-mounted to the expected places (/sys/fs/cgroup/$controller) by uncommenting two lines in the configuration file, and handle cgroups through cgroupfs there. (This is what the management agent wants to be an alternative for) The main deficiency there is that /proc/self/cgroups is not filtered, so it will show /lxc/c1 for init's cgroup, while the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what is seen under the container's /sys/fs/cgroup/devices (for instance). Not ideal. I'm really saying that if your daemon is to provide a replacement for cgroupfs direct access, it needs to be designed to be scalable. If we're going to get away from bind mounting cgroupfs into user namespaces, then let's try to solve ALL the problems. We also need the ability to set up eventfds for users or to let them poll() on the socket from this daemon. So you'd want to be able to request updates when any cgroup value is changed, right? Not necessarily ANY, but that's the terminus of this API facet. That's currently not in my very limited set of commands, but I can certainly add it, and yes it would be a simple unix sock so you can set up eventfd, select/poll, etc. Assuming the protocol is basically a pass-through to basic filesystem ops, it should be pretty easy. You just need to add it to your protocol. But it brings up another point - access control. How do you decide which files a child agent should have access to? Does that ever change based on the child's configuration? In our world, the answer is almost certainly yes. So then the idea would be that userspace (like libvirt and lxc) would talk over /dev/cgroup to its manager. Userspace inside a container (which can't actually mount cgroups itself) would talk to its own manager which is talking over a passed-in socket to the host manager, which in turn runs natively (uses cgroupfs, and nests create /c1 under the requestor's cgroup). How do you handle updates of this agent? Suppose I have hundreds of running containers, and I want to release a new version of the cgroupd ? This may change (which is part of what I want to investigate with some POC), but right now I'm building any controller-aware smarts into it. I think that's what you're asking about? The agent doesn't do slices etc. This may turn out to be insufficient, we'll see. No, what I am asking is a release-engineering problem. Suppose we need to roll out a new version of this daemon (some new feature or a bug or something). We have hundreds of these child agents running in the job containers. When I say container I mean an lxc container, with it's own isolated rootfs and mntns. I'm not sure what your containers are, but I if they're not that, then they shouldn't need to run a child agent. They can just talk over the host cgroup agent's socket. If they talk over the host agent's socket, where is the access control and restriction done? Who decides how deep I can nest groups? Who says which files I may access? Who stops me from modifying someone else's container? Our containers are somewhat thinner and more managed than LXC, but not that much. If we're running a system agent in a user container, we need to manage that software. We can't just start up a version and leave it running until the user decides to upgrade - we force upgrades. How do I bring down all these children, and then bring them back up on a new version in a way that does not disrupt user jobs (much)? Similarly, what happens when one of these child agents crashes? Does someone restart it? Do user jobs just stop working? An upstart^W$init_system job will restart it... What happens when the main agent crashes? All those children on UNIX sockets need to reconnect, I guess. This means your UNIX socket needs to be a named socket, not just a socketpair(), making your auth model more complicated. What happens when the main agent hangs? Is someone health-checking it? How about all the child daemons? I guess my main point is that this SOUNDS like a simple project, but if you just do the simple obvious things, it will be woefully inadequate for anything but simple use-cases. If we get forced into such a model (and there are some good reasons to do it, even disregarding all the other chatter), we'd rather use
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 10:38 AM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote: OK, then what I don't know is what is the new interface? A new cgroupfs? It's gonna be a new mount option for cgroupfs. DTF and CPU and cpuset all have default groups for some tasks (and not others) in our world today. DTF actually has default, prio, and normal. I was simplifying before. I really wish it were as simple as you think it is. But if it were, do you think I'd still be arguing? How am I supposed to know when you don't communicate it but just wave your hands saying it's all very complicated? The cpuset / blkcg example is pretty bad because you can enforce any cpuset rules at the leaves. Modifying hundreds of cgroups is really painful, and yes, we do it often enough to be able to see it. This really doesn't scale when I have thousands of jobs running. Being able to disable at some levels on some controllers probably helps some, but I can't say for sure without knowing the new interface How does the number of jobs affect it? Does each job create a new cgroup? Well, in your model it does... We tried it in unified hierarchy. We had our Top People on the problem. The best we could get was bad enough that we embarked on a LITERAL 2 year transition to make it better. What didn't work? What part was so bad? I find it pretty difficult to believe that multiple orthogonal hierarchies is the only possible solution, so please elaborate the issues that you guys have experienced. I'm looping in more Google people. The hierarchy is for organization and enforcement of dynamic hierarchical resource distribution and that's it. If its expressive power is lacking, take compromise or tune the configuration according to the workloads. The latter is necessary in workloads which have clear distinction of foreground and background anyway - anything which interacts with human beings including androids. So what you're saying is that you don't care that this new thing is less capable than the old thing, despite it having real impact. In other words, define a container as a set of cgroups, one under each each active controller type. A TID enters the container atomically, joining all of the cgroups or none of the cgroups. container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar, /cgroup/io/default/foo/bar, /cgroup/cpuset/ This is an abstraction that we maintain in userspace (more or less) and we do actually have headaches from split hierarchies here (handling partial failures, non-atomic joins, etc) That'd separate out task organization from controllre config hierarchies. Kay had a similar idea some time ago. I think it makes things even more complex than it is right now. I'll continue on this below. I'm still a bit fuzzy - is all of this written somewhere? If you dig through cgroup ML, most are there. There'll be cgroup.controllers file with which you can enable / disable controllers. Enabling a controller in a cgroup implies that the controller is enabled in all ancestors. Implies or requires? Cause or predicate? If controller C is enabled at level X but disabled at level X/Y, does that mean that X/Y uses the limits set in X? How about X/Y/Z? This will get rid of the bulk of the cpuset scaling problem, but not all of it. I think we still have the same problems with cpu as we do with io. Perhaps that should have been the example. It sounds like you're missing a layer of abstraction. Why not add the abstraction you want to expose on top of powerful primitives, instead of dumbing down the primitives? It sure would be possible build more and try to address the issues we're seeing now; however, after looking at cgroups for some time now, the underlying theme is failure to take reasonable trade-offs and going for maximum flexibility in making each choice - the choice of interface, multiple hierarchies, no restriction on hierarchical behavior, splitting threads of the same process into separate cgroups, semi-encouraging delegation through file permission without actually pondering the consequences and so on. And each choice probably made sense trying to serve each immediate requirement at the time but added up it's a giant pile of mess which developed without direction. I am very sympathetic to this problem. You could have just described some of our internal problems too. The difference is that we are trying to make changes that provide more structure and boundaries in ways that retain the fundamental power, without tossing out the baby with the bathwater. So, at this point, I'm very skeptical about adding more flexibility. Once the basics are settled, we sure can look into the missing pieces but I don't think that's what we should be doing right now. Another thing is that the unified hierarchy can be implemented by using most of the constructs cgroup core already has in more
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 11:14 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Tejun Heo (t...@kernel.org): Hello, Serge. On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: At some point (probably soon) we might want to talk about a standard API for these things. However I think it will have to come in the form of a standard library, which knows to either send requests over dbus to systemd, or over /dev/cgroup sock to the manager. Yeah, eventually, I think we'll have a standardized way to configure resource distribution in the system. Maybe we'll agree on a standardized dbus protocol or there will be library, I don't know; however, whatever form it may be in, it abstraction level should be way higher than that of direct cgroupfs access. It's way too low level and very easy to end up in a complete nonsense configuration. e.g. enabling cpu on a cgroup whlie leaving other cgroups alone wouldn't enable fair scheduling on that cgroup but drastically reduce the amount of cpu share it gets as it now gets treated as single entity competing with all tasks at the parent level. Right. I *think* this can be offered as a daemon which sits as the sole consumer of my agent's API, and offers a higher level do what I want API. But designing that API is going to be interesting. This is something we have, partially, and are working to be able to open-source. We have a LOT of experience feeding into the semantics that actually make users happy. Today it leverages split-hierarchies, but that is not required in the generic case (only if you want to offer the semantics we do). It explicitly delegates some aspects of sub-cgroup control to users, but that could go away if your lowest-level agency can handle it. I should find a good, up-to-date summary of the current behaviors of each controller so I can talk more intelligently about it. (I'll start by looking at the kernel Documentation/cgroups, but don't feel too confident that they'll be uptodate :) At the moment, I'm not sure what the eventual abstraction would look like. systemd is extending its basic constructs by adding slices and scopes and it does make sense to integrate the general organization of the system (services, user sessions, VMs and so on) with resource management. Given some time, I'm hoping we'll be able to come up with and agree on some common constructs so that each workload can indicate its resource requirements in a unified way. That said, I really think we should experiment for a while before trying to settle down on things. We've now just started exploring how system-wide resource managment can be made widely available to systems without requiring extremely specialized hand-crafted configurations and I'm pretty sure we're getting and gonna get quite a few details wrong, so I don't think it'd be a good idea to try to agree on things right now. As far as such integration goes, I think it's time to play with things and observe the results. Right, I'm not attached to my toy implementation at all - except for the ability, in some fashion, to have nested agents which don't have cgroupfs access but talk to another agent to get the job done. -serge -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, Jun 26, 2013 at 6:04 PM, Tejun Heo wrote: > Hello, > > On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote: >> The first assertion, as I understood, was that (eventually) cgroupfs >> will not allow split hierarchies - that unified hierarchy would be the >> only mode. Is that not the case? > > No, unified hierarchy would be an optional thing for quite a while. > >> The second assertion, as I understood, was that (eventually) cgroupfs >> would not support granting access to some cgroup control files to >> users (through chown/chmod). Is that not the case? > > Again, it'll be an opt-in thing. The hierarchy controller would be > able to notice that and issue warnings if it wants to. > >> Hmm, so what exactly is changing then? If, as you say here, the >> existing interfaces will keep working - what is changing? > > New interface is being added and new features will be added only for > the new interface. The old one will eventually be deprecated and > removed, but that *years* away. OK, then what I don't know is what is the new interface? A new cgroupfs? >> As I said, it's controlled delegated access. And we have some patches >> that we carry to prevent some of these DoS situations. > > I don't know. You can probably hack around some of the most serious > problems but the whole thing isn't built for proper delgation and > that's not the direction the upstream kernel is headed at the moment. > >> I actually can not speak to the details of the default IO problem, as >> it happened before I really got involved. But just think through it. >> If one half of the split has 5 processes running and the other half >> has 200, the processes in the 200 set each get FAR less spindle time >> than those in the 5 set. That is NOT the semantic we need. We're >> trying to offer ~equal access for users of the non-DTF class of jobs. >> >> This is not the tail doing the wagging. This is your assertion that >> something should work, when it just doesn't. We have two, totally >> orthogonal classes of applications on two totally disjoint sets of >> resources. Conjoining them is the wrong answer. > > As I've said multiple times, there sure are things that you cannot > achieve without orthogonal multiple hierarchies, but given the options > we have at hands, compromising inside a unified hierarchy seems like > the best trade-off. Please take a step back from the immediate detail > and think of the general hierarchical organization of workloads. If > DTF / non-DTF is a fundamental part of your workload classfication, > that should go above. DTF and CPU and cpuset all have "default" groups for some tasks (and not others) in our world today. DTF actually has default, prio, and "normal". I was simplifying before. I really wish it were as simple as you think it is. But if it were, do you think I'd still be arguing? > I don't really understand your example anyway because you can classify > by DTF / non-DTF first and then just propagate cpuset settings along. > You won't lose anything that way, right? This really doesn't scale when I have thousands of jobs running. Being able to disable at some levels on some controllers probably helps some, but I can't say for sure without knowing the new interface > Again, in general, you might not be able to achieve *exactly* what > you've been doing, but, an acceptable compromise should be possible > and not doing so leads to complete mess. We tried it in unified hierarchy. We had our Top People on the problem. The best we could get was bad enough that we embarked on a LITERAL 2 year transition to make it better. >> > But I don't follow the conclusion here. For short term workaround, >> > sure, but having that dictate the whole architecture decision seems >> > completely backwards to me. >> >> My point is that the orthogonality of resources is intrinsic. Letting >> "it's hard to make it work" dictate the architecture is what's >> backwards. > > No, it's not "it's hard to make it work". It's more "it's > fundamentally broken". You can't identify a resource to be belonging > to a cgroup independent of who's looking at the resource. What if you could ensure that for a given TID (or PID if required) in dir X of controller C, all of the other TIDs in that cgroup were in the same group, but maybe not the same sub-path, under every controller? This gives you what it sounds like you wanted elsewhere - a container abstraction. In other words, define a container as a set of cgroups, one under each each active controller type. A TID enters the container atomically, joining all of the cgroups or none of the cgroups. cont
Re: cgroup: status-quo and userland efforts
On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo wrote: > Hello, Tim. > > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: >> I really want to understand why this is SO IMPORTANT that you have to >> break userspace compatibility? I mean, isn't Linux supposed to be the >> OS with the stable kernel interface? I've seen Linus rant time and >> time again about this - why is it OK now? > > What the hell are you talking about? Nobody is breaking userland > interface. A new version of interface is being phased in and the old The first assertion, as I understood, was that (eventually) cgroupfs will not allow split hierarchies - that unified hierarchy would be the only mode. Is that not the case? The second assertion, as I understood, was that (eventually) cgroupfs would not support granting access to some cgroup control files to users (through chown/chmod). Is that not the case? > one will stay there for the foreseeable future. It will be phased out > eventually but that's gonna take a long time and it will have to be > something hardly noticeable. Of course new features will only be > available with the new interface and there will be efforts to nudge > people away from the old one but the existing interface will keep > working it does. Hmm, so what exactly is changing then? If, as you say here, the existing interfaces will keep working - what is changing? >> Examples? we obviously don't grant full access, but our kernel gang >> and security gang seem to trust the bits we're enabling well enough... > > Then the security gang doesn't have any clue what's going on, or at > least operating on very different assumptions (ie. the workloads are > trusted by default). You can OOM the whole kernel by creating many > cgroups, completely mess up controllers by creating deep hierarchies, > affect your siblings by adjusting your weight and so on. It's really > easy to DoS the whole system if you have write access to a cgroup > directory. As I said, it's controlled delegated access. And we have some patches that we carry to prevent some of these DoS situations. >> The non-DTF jobs have a combined share that is small but non-trivial. >> If we cut that share in half, giving one slice to prod and one slice >> to batch, we get bad sharing under contention. We tried this. We > > Why is that tho? It *should* work fine and I can't think of a reason > why that would behave particularly badly off the top of my head. > Maybe I forgot too much of the iosched modification used in google. > Anyways, if there's a problem, that should be fixable, right? And > controller-specific issues like that should really dictate the > architectural design too much. I actually can not speak to the details of the default IO problem, as it happened before I really got involved. But just think through it. If one half of the split has 5 processes running and the other half has 200, the processes in the 200 set each get FAR less spindle time than those in the 5 set. That is NOT the semantic we need. We're trying to offer ~equal access for users of the non-DTF class of jobs. This is not the tail doing the wagging. This is your assertion that something should work, when it just doesn't. We have two, totally orthogonal classes of applications on two totally disjoint sets of resources. Conjoining them is the wrong answer. >> could add control loops in userspace code which try to balance the >> shares in proportion to the load. We did that with CPU, and it's sort > > Yeah, that is horrible. Yeah, I would love to explain some of the really nasty things we have done and are moving away from. I am not sure I am allowed to, though :) >> of horrible. We're moving AWAY from all this craziness in favor of >> well-defined hierarchical behaviors. > > But I don't follow the conclusion here. For short term workaround, > sure, but having that dictate the whole architecture decision seems > completely backwards to me. My point is that the orthogonality of resources is intrinsic. Letting "it's hard to make it work" dictate the architecture is what's backwards. >> It's a bit naive to think that this is some absolute truth, don't you >> think? It just isn't so. You should know better than most what >> craziness our users do, and what (legit) rationales they can produce. >> I have $large_number of machines running $huge_number of jobs from >> thousands of developers running for years upon years backing up my >> worldview. > > If so, you aren't communicating it very well. I've talked with quite > a few people about multiple orthogonal hierarchies including people > inside google. Sure, some are using it as it is there but I couldn't > find strong enough rationale to continue that way given
Re: cgroup: status-quo and userland efforts
On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: I really want to understand why this is SO IMPORTANT that you have to break userspace compatibility? I mean, isn't Linux supposed to be the OS with the stable kernel interface? I've seen Linus rant time and time again about this - why is it OK now? What the hell are you talking about? Nobody is breaking userland interface. A new version of interface is being phased in and the old The first assertion, as I understood, was that (eventually) cgroupfs will not allow split hierarchies - that unified hierarchy would be the only mode. Is that not the case? The second assertion, as I understood, was that (eventually) cgroupfs would not support granting access to some cgroup control files to users (through chown/chmod). Is that not the case? one will stay there for the foreseeable future. It will be phased out eventually but that's gonna take a long time and it will have to be something hardly noticeable. Of course new features will only be available with the new interface and there will be efforts to nudge people away from the old one but the existing interface will keep working it does. Hmm, so what exactly is changing then? If, as you say here, the existing interfaces will keep working - what is changing? Examples? we obviously don't grant full access, but our kernel gang and security gang seem to trust the bits we're enabling well enough... Then the security gang doesn't have any clue what's going on, or at least operating on very different assumptions (ie. the workloads are trusted by default). You can OOM the whole kernel by creating many cgroups, completely mess up controllers by creating deep hierarchies, affect your siblings by adjusting your weight and so on. It's really easy to DoS the whole system if you have write access to a cgroup directory. As I said, it's controlled delegated access. And we have some patches that we carry to prevent some of these DoS situations. The non-DTF jobs have a combined share that is small but non-trivial. If we cut that share in half, giving one slice to prod and one slice to batch, we get bad sharing under contention. We tried this. We Why is that tho? It *should* work fine and I can't think of a reason why that would behave particularly badly off the top of my head. Maybe I forgot too much of the iosched modification used in google. Anyways, if there's a problem, that should be fixable, right? And controller-specific issues like that should really dictate the architectural design too much. I actually can not speak to the details of the default IO problem, as it happened before I really got involved. But just think through it. If one half of the split has 5 processes running and the other half has 200, the processes in the 200 set each get FAR less spindle time than those in the 5 set. That is NOT the semantic we need. We're trying to offer ~equal access for users of the non-DTF class of jobs. This is not the tail doing the wagging. This is your assertion that something should work, when it just doesn't. We have two, totally orthogonal classes of applications on two totally disjoint sets of resources. Conjoining them is the wrong answer. could add control loops in userspace code which try to balance the shares in proportion to the load. We did that with CPU, and it's sort Yeah, that is horrible. Yeah, I would love to explain some of the really nasty things we have done and are moving away from. I am not sure I am allowed to, though :) of horrible. We're moving AWAY from all this craziness in favor of well-defined hierarchical behaviors. But I don't follow the conclusion here. For short term workaround, sure, but having that dictate the whole architecture decision seems completely backwards to me. My point is that the orthogonality of resources is intrinsic. Letting it's hard to make it work dictate the architecture is what's backwards. It's a bit naive to think that this is some absolute truth, don't you think? It just isn't so. You should know better than most what craziness our users do, and what (legit) rationales they can produce. I have $large_number of machines running $huge_number of jobs from thousands of developers running for years upon years backing up my worldview. If so, you aren't communicating it very well. I've talked with quite a few people about multiple orthogonal hierarchies including people inside google. Sure, some are using it as it is there but I couldn't find strong enough rationale to continue that way given the amount of crazys it implies / encourages. On the other hand, most people agreed that having a unified hierarchy with differing level of granularities would serve their cases well enough while not being crazy. I'm not sure what differing level of granularities means? But that aside, who have you
Re: cgroup: status-quo and userland efforts
On Wed, Jun 26, 2013 at 6:04 PM, Tejun Heo t...@kernel.org wrote: Hello, On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote: The first assertion, as I understood, was that (eventually) cgroupfs will not allow split hierarchies - that unified hierarchy would be the only mode. Is that not the case? No, unified hierarchy would be an optional thing for quite a while. The second assertion, as I understood, was that (eventually) cgroupfs would not support granting access to some cgroup control files to users (through chown/chmod). Is that not the case? Again, it'll be an opt-in thing. The hierarchy controller would be able to notice that and issue warnings if it wants to. Hmm, so what exactly is changing then? If, as you say here, the existing interfaces will keep working - what is changing? New interface is being added and new features will be added only for the new interface. The old one will eventually be deprecated and removed, but that *years* away. OK, then what I don't know is what is the new interface? A new cgroupfs? As I said, it's controlled delegated access. And we have some patches that we carry to prevent some of these DoS situations. I don't know. You can probably hack around some of the most serious problems but the whole thing isn't built for proper delgation and that's not the direction the upstream kernel is headed at the moment. I actually can not speak to the details of the default IO problem, as it happened before I really got involved. But just think through it. If one half of the split has 5 processes running and the other half has 200, the processes in the 200 set each get FAR less spindle time than those in the 5 set. That is NOT the semantic we need. We're trying to offer ~equal access for users of the non-DTF class of jobs. This is not the tail doing the wagging. This is your assertion that something should work, when it just doesn't. We have two, totally orthogonal classes of applications on two totally disjoint sets of resources. Conjoining them is the wrong answer. As I've said multiple times, there sure are things that you cannot achieve without orthogonal multiple hierarchies, but given the options we have at hands, compromising inside a unified hierarchy seems like the best trade-off. Please take a step back from the immediate detail and think of the general hierarchical organization of workloads. If DTF / non-DTF is a fundamental part of your workload classfication, that should go above. DTF and CPU and cpuset all have default groups for some tasks (and not others) in our world today. DTF actually has default, prio, and normal. I was simplifying before. I really wish it were as simple as you think it is. But if it were, do you think I'd still be arguing? I don't really understand your example anyway because you can classify by DTF / non-DTF first and then just propagate cpuset settings along. You won't lose anything that way, right? This really doesn't scale when I have thousands of jobs running. Being able to disable at some levels on some controllers probably helps some, but I can't say for sure without knowing the new interface Again, in general, you might not be able to achieve *exactly* what you've been doing, but, an acceptable compromise should be possible and not doing so leads to complete mess. We tried it in unified hierarchy. We had our Top People on the problem. The best we could get was bad enough that we embarked on a LITERAL 2 year transition to make it better. But I don't follow the conclusion here. For short term workaround, sure, but having that dictate the whole architecture decision seems completely backwards to me. My point is that the orthogonality of resources is intrinsic. Letting it's hard to make it work dictate the architecture is what's backwards. No, it's not it's hard to make it work. It's more it's fundamentally broken. You can't identify a resource to be belonging to a cgroup independent of who's looking at the resource. What if you could ensure that for a given TID (or PID if required) in dir X of controller C, all of the other TIDs in that cgroup were in the same group, but maybe not the same sub-path, under every controller? This gives you what it sounds like you wanted elsewhere - a container abstraction. In other words, define a container as a set of cgroups, one under each each active controller type. A TID enters the container atomically, joining all of the cgroups or none of the cgroups. container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar, /cgroup/io/default/foo/bar, /cgroup/cpuset/ This is an abstraction that we maintain in userspace (more or less) and we do actually have headaches from split hierarchies here (handling partial failures, non-atomic joins, etc) I'm not sure what differing level of granularities means? But that It means that you'll be able to ignore subtrees depending on controllers. I'm still a bit
Re: cgroup: status-quo and userland efforts
On Mon, Jun 24, 2013 at 5:01 PM, Tejun Heo wrote: > Hello, Tim. > > On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote: >> I'm very sorry I let this fall off my plate. I was pointed at a >> systemd-devel message indicating that this is done. Is it so? It > > It's progressing pretty fast. > >> seems so completely ass-backwards to me. Below is one of our use-cases >> that I just don't see how we can reproduce in a single-heierarchy. > > Configurations which depend on orthogonal multiple hierarchies of > course won't be replicated under unified hierarchy. It's unfortunate > but those just have to go. More on this later. I really want to understand why this is SO IMPORTANT that you have to break userspace compatibility? I mean, isn't Linux supposed to be the OS with the stable kernel interface? I've seen Linus rant time and time again about this - why is it OK now? >> We're also long into the model that users can control their own >> sub-cgroups (moderated by permissions decided by admin SW up front). > > If you're in control of the base system, nothing prevents you from > doing so. It's utterly broken security and policy-enforcement point > of view but if you can trust each software running on your system to > do the right thing, it's gonna be fine. Examples? we obviously don't grant full access, but our kernel gang and security gang seem to trust the bits we're enabling well enough... >> This gives us 4 combinations: >> 1) { production, DTF } >> 2) { production, non-DTF } >> 3) { batch, DTF } >> 4) { batch non-DTF } >> >> Of these, (3) is sort of nonsense, but the others are actually used >> and needed. This is only >> possible because of split hierarchies. In fact, we undertook a very painful >> process to move from a unified cgroup hierarchy to split hierarchies in large >> part _because of_ these examples. > > You can create three sibling cgroups and configure cpuset and blkio > accordingly. For cpuset, the setup wouldn't make any different. For > blkio, the two non-DTFs would now belong to different cgroups and > compete with each other as two groups, which won't matter at all as > non-DTFs are given what's left over after serving DTFs anyway, IIRC. The non-DTF jobs have a combined share that is small but non-trivial. If we cut that share in half, giving one slice to prod and one slice to batch, we get bad sharing under contention. We tried this. We could add control loops in userspace code which try to balance the shares in proportion to the load. We did that with CPU, and it's sort of horrible. We're moving AWAY from all this craziness in favor of well-defined hierarchical behaviors. >> Making cgroups composable allows us to build a higher level abstraction that >> is very powerful and flexible. Moving back to unified hierarchies goes >> against everything that we're doing here, and will cause us REAL pain. > > Categorizing processes into hierarchical groups of tasks is a > fundamental idea and a fundamental idea is something to base things on > top of as it's something people can agree upon relatively easily and > establish a structure by. I'd go as far as saying that it's the > failure on the part of workload design if they in general can't be > categorized hierarchically. It's a bit naive to think that this is some absolute truth, don't you think? It just isn't so. You should know better than most what craziness our users do, and what (legit) rationales they can produce. I have $large_number of machines running $huge_number of jobs from thousands of developers running for years upon years backing up my worldview. > Even at the practical level, the orthogonal hierarchy encouraged, at > the very least, the blkcg writeback support which can't be upstreamed > in any reasonable manner because it is impossible to say that a > resource can't be said to belong to a cgroup irrespective of who's > looking at it. I'm not sure I really grok that statement. I'm OK with defining new rules that bring some order to the chaos. Give us new rules to live by. All-or-nothing would be fine. What if mounting cgroupfs gives me N sub-dirs, one for each compiled-in controller? You could make THAT the mount option - you can have either a unified hierarchy of all controllers or fully disjoint hierarchies. Or some other rule. > It's something fundamentally broken and I have very difficult time > believing google's workload is so different that it can't be > categorized in a single hierarchy for the purpose of resource > distribution. I'm sure there are cases where some compromises are > necessary but the laternative is much worse here. As I wrote multiple > times now, multiple orthogonal hierarchy support is gonna be around > for some t
Re: cgroup: status-quo and userland efforts
On Mon, Jun 24, 2013 at 5:01 PM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote: I'm very sorry I let this fall off my plate. I was pointed at a systemd-devel message indicating that this is done. Is it so? It It's progressing pretty fast. seems so completely ass-backwards to me. Below is one of our use-cases that I just don't see how we can reproduce in a single-heierarchy. Configurations which depend on orthogonal multiple hierarchies of course won't be replicated under unified hierarchy. It's unfortunate but those just have to go. More on this later. I really want to understand why this is SO IMPORTANT that you have to break userspace compatibility? I mean, isn't Linux supposed to be the OS with the stable kernel interface? I've seen Linus rant time and time again about this - why is it OK now? We're also long into the model that users can control their own sub-cgroups (moderated by permissions decided by admin SW up front). If you're in control of the base system, nothing prevents you from doing so. It's utterly broken security and policy-enforcement point of view but if you can trust each software running on your system to do the right thing, it's gonna be fine. Examples? we obviously don't grant full access, but our kernel gang and security gang seem to trust the bits we're enabling well enough... This gives us 4 combinations: 1) { production, DTF } 2) { production, non-DTF } 3) { batch, DTF } 4) { batch non-DTF } Of these, (3) is sort of nonsense, but the others are actually used and needed. This is only possible because of split hierarchies. In fact, we undertook a very painful process to move from a unified cgroup hierarchy to split hierarchies in large part _because of_ these examples. You can create three sibling cgroups and configure cpuset and blkio accordingly. For cpuset, the setup wouldn't make any different. For blkio, the two non-DTFs would now belong to different cgroups and compete with each other as two groups, which won't matter at all as non-DTFs are given what's left over after serving DTFs anyway, IIRC. The non-DTF jobs have a combined share that is small but non-trivial. If we cut that share in half, giving one slice to prod and one slice to batch, we get bad sharing under contention. We tried this. We could add control loops in userspace code which try to balance the shares in proportion to the load. We did that with CPU, and it's sort of horrible. We're moving AWAY from all this craziness in favor of well-defined hierarchical behaviors. Making cgroups composable allows us to build a higher level abstraction that is very powerful and flexible. Moving back to unified hierarchies goes against everything that we're doing here, and will cause us REAL pain. Categorizing processes into hierarchical groups of tasks is a fundamental idea and a fundamental idea is something to base things on top of as it's something people can agree upon relatively easily and establish a structure by. I'd go as far as saying that it's the failure on the part of workload design if they in general can't be categorized hierarchically. It's a bit naive to think that this is some absolute truth, don't you think? It just isn't so. You should know better than most what craziness our users do, and what (legit) rationales they can produce. I have $large_number of machines running $huge_number of jobs from thousands of developers running for years upon years backing up my worldview. Even at the practical level, the orthogonal hierarchy encouraged, at the very least, the blkcg writeback support which can't be upstreamed in any reasonable manner because it is impossible to say that a resource can't be said to belong to a cgroup irrespective of who's looking at it. I'm not sure I really grok that statement. I'm OK with defining new rules that bring some order to the chaos. Give us new rules to live by. All-or-nothing would be fine. What if mounting cgroupfs gives me N sub-dirs, one for each compiled-in controller? You could make THAT the mount option - you can have either a unified hierarchy of all controllers or fully disjoint hierarchies. Or some other rule. It's something fundamentally broken and I have very difficult time believing google's workload is so different that it can't be categorized in a single hierarchy for the purpose of resource distribution. I'm sure there are cases where some compromises are necessary but the laternative is much worse here. As I wrote multiple times now, multiple orthogonal hierarchy support is gonna be around for some time, so I don't think there's any rason for panic; that said, please at least plan to move on. The time frame you talk about IS reason for panic. If I know that you're going to completely screw me in a a year and a half, I have to start moving NOW to find new ways to hack around the mess you're making
Re: cgroup: status-quo and userland efforts
I'm very sorry I let this fall off my plate. I was pointed at a systemd-devel message indicating that this is done. Is it so? It seems so completely ass-backwards to me. Below is one of our use-cases that I just don't see how we can reproduce in a single-heierarchy. We're also long into the model that users can control their own sub-cgroups (moderated by permissions decided by admin SW up front). We have classes of jobs which can run together on shared machines. This is VERY important to us, and is a key part of how we run things. Over the years we have evolved from very little isolation to fairly strong isolation, and cgroups are a large part of that. We have experienced and adapted to a number of problems around isolation over time. I won't go into the history of all of these, because it's not so relevant, but here is how we set things up today. >From a CPU perspective, we have two classes of jobs: production and batch. Production jobs can (but don't always) ask for exclusive cores, which ensures that no batch work runs on those CPUs. We manage this with the cpuset cgroup. Batch jobs are relegated to the set of CPUs that are "left-over" after exclusivity rules are applied. This is implemented with a shared subdirectory of the cpuset cgroup called "batch". Production jobs get their own subdirectories under cpuset. >From an IO perspective we also have two classes of jobs: normal and DTF-approved. Normal jobs do not get strong isolation for IO, whereas DTF-enabled jobs do. The vast majority of jobs are NOT DTF-enabled, and they share a nominal amount of IO bandwidth. This is implemented with a shared subdirectory of the io cgroup called "default". Jobs that are DTF-enabled get their own subdirectories under IO. This gives us 4 combinations: 1) { production, DTF } 2) { production, non-DTF } 3) { batch, DTF } 4) { batch non-DTF } Of these, (3) is sort of nonsense, but the others are actually used and needed. This is only possible because of split hierarchies. In fact, we undertook a very painful process to move from a unified cgroup hierarchy to split hierarchies in large part _because of_ these examples. And for more fun, I am simplifying this all. Batch jobs are actually bound to NUMA-node specific cpuset cgroups when possible. And we have a similar concept for the cpu cgroup as for cpuset. And we have a third tier of IO jobs. We don't do all of this for fun - it is in direct response to REAL problems we have experienced. Making cgroups composable allows us to build a higher level abstraction that is very powerful and flexible. Moving back to unified hierarchies goes against everything that we're doing here, and will cause us REAL pain. On Mon, Apr 22, 2013 at 3:33 PM, Tim Hockin wrote: > On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo wrote: >> Hello, Tim. >> >> On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: >>> We absolutely depend on the ability to split cgroup hierarchies. It >>> pretty much saved our fleet from imploding, in a way that a unified >>> hierarchy just could not do. A mandated unified hierarchy is madness. >>> Please step away from the ledge. >> >> You need to be a lot more specific about why unified hierarchy can't >> be implemented. The last time I asked around blk/memcg people in >> google, while they said that they'll need different levels of >> granularities for different controllers, google's use of cgroup >> doesn't require multiple orthogonal classifications of the same group >> of tasks. > > I'll pull some concrete examples together. I don't have them on hand, > and I am out of country this week. I have looped in the gang at work > (though some are here with me). > >> Also, cgroup isn't dropping multiple hierarchy support over-night. >> What has been working till now will continue to work for very long >> time. If there is no fundamental conflict with the future changes, >> there should be enough time to migrate gradually as desired. >> >>> More, going towards a unified hierarchy really limits what we can >>> delegate, and that is the word of the day. We've got a central >>> authority agent running which manages cgroups, and we want out of this >>> business. At least, we want to be able to grant users a set of >>> constraints, and then let them run wild within those constraints. >>> Forcing all such work to go through a daemon has proven to be very >>> problematic, and it has been great now that users can have DIY >>> sub-cgroups. >> >> Sorry, but that doesn't work properly now. It gives you the illusion >> of proper delegation but it's inherently dangerous. If that sort of >> illusion has been / is good enough for your setup, fine. Delegate at &g
Re: cgroup: status-quo and userland efforts
I'm very sorry I let this fall off my plate. I was pointed at a systemd-devel message indicating that this is done. Is it so? It seems so completely ass-backwards to me. Below is one of our use-cases that I just don't see how we can reproduce in a single-heierarchy. We're also long into the model that users can control their own sub-cgroups (moderated by permissions decided by admin SW up front). We have classes of jobs which can run together on shared machines. This is VERY important to us, and is a key part of how we run things. Over the years we have evolved from very little isolation to fairly strong isolation, and cgroups are a large part of that. We have experienced and adapted to a number of problems around isolation over time. I won't go into the history of all of these, because it's not so relevant, but here is how we set things up today. From a CPU perspective, we have two classes of jobs: production and batch. Production jobs can (but don't always) ask for exclusive cores, which ensures that no batch work runs on those CPUs. We manage this with the cpuset cgroup. Batch jobs are relegated to the set of CPUs that are left-over after exclusivity rules are applied. This is implemented with a shared subdirectory of the cpuset cgroup called batch. Production jobs get their own subdirectories under cpuset. From an IO perspective we also have two classes of jobs: normal and DTF-approved. Normal jobs do not get strong isolation for IO, whereas DTF-enabled jobs do. The vast majority of jobs are NOT DTF-enabled, and they share a nominal amount of IO bandwidth. This is implemented with a shared subdirectory of the io cgroup called default. Jobs that are DTF-enabled get their own subdirectories under IO. This gives us 4 combinations: 1) { production, DTF } 2) { production, non-DTF } 3) { batch, DTF } 4) { batch non-DTF } Of these, (3) is sort of nonsense, but the others are actually used and needed. This is only possible because of split hierarchies. In fact, we undertook a very painful process to move from a unified cgroup hierarchy to split hierarchies in large part _because of_ these examples. And for more fun, I am simplifying this all. Batch jobs are actually bound to NUMA-node specific cpuset cgroups when possible. And we have a similar concept for the cpu cgroup as for cpuset. And we have a third tier of IO jobs. We don't do all of this for fun - it is in direct response to REAL problems we have experienced. Making cgroups composable allows us to build a higher level abstraction that is very powerful and flexible. Moving back to unified hierarchies goes against everything that we're doing here, and will cause us REAL pain. On Mon, Apr 22, 2013 at 3:33 PM, Tim Hockin thoc...@hockin.org wrote: On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: We absolutely depend on the ability to split cgroup hierarchies. It pretty much saved our fleet from imploding, in a way that a unified hierarchy just could not do. A mandated unified hierarchy is madness. Please step away from the ledge. You need to be a lot more specific about why unified hierarchy can't be implemented. The last time I asked around blk/memcg people in google, while they said that they'll need different levels of granularities for different controllers, google's use of cgroup doesn't require multiple orthogonal classifications of the same group of tasks. I'll pull some concrete examples together. I don't have them on hand, and I am out of country this week. I have looped in the gang at work (though some are here with me). Also, cgroup isn't dropping multiple hierarchy support over-night. What has been working till now will continue to work for very long time. If there is no fundamental conflict with the future changes, there should be enough time to migrate gradually as desired. More, going towards a unified hierarchy really limits what we can delegate, and that is the word of the day. We've got a central authority agent running which manages cgroups, and we want out of this business. At least, we want to be able to grant users a set of constraints, and then let them run wild within those constraints. Forcing all such work to go through a daemon has proven to be very problematic, and it has been great now that users can have DIY sub-cgroups. Sorry, but that doesn't work properly now. It gives you the illusion of proper delegation but it's inherently dangerous. If that sort of illusion has been / is good enough for your setup, fine. Delegate at your own risks, but cgroup in itself doesn't support delegation to lesser security domains and it won't in the foreseeable future. We've had great success letting users create sub-cgroups in a few specific controller types (cpu, cpuacct, memory). This is, of course, with some restrictions. We do not just give them blanket
Re: cgroup: status-quo and userland efforts
On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo wrote: > Hello, Tim. > > On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: >> We absolutely depend on the ability to split cgroup hierarchies. It >> pretty much saved our fleet from imploding, in a way that a unified >> hierarchy just could not do. A mandated unified hierarchy is madness. >> Please step away from the ledge. > > You need to be a lot more specific about why unified hierarchy can't > be implemented. The last time I asked around blk/memcg people in > google, while they said that they'll need different levels of > granularities for different controllers, google's use of cgroup > doesn't require multiple orthogonal classifications of the same group > of tasks. I'll pull some concrete examples together. I don't have them on hand, and I am out of country this week. I have looped in the gang at work (though some are here with me). > Also, cgroup isn't dropping multiple hierarchy support over-night. > What has been working till now will continue to work for very long > time. If there is no fundamental conflict with the future changes, > there should be enough time to migrate gradually as desired. > >> More, going towards a unified hierarchy really limits what we can >> delegate, and that is the word of the day. We've got a central >> authority agent running which manages cgroups, and we want out of this >> business. At least, we want to be able to grant users a set of >> constraints, and then let them run wild within those constraints. >> Forcing all such work to go through a daemon has proven to be very >> problematic, and it has been great now that users can have DIY >> sub-cgroups. > > Sorry, but that doesn't work properly now. It gives you the illusion > of proper delegation but it's inherently dangerous. If that sort of > illusion has been / is good enough for your setup, fine. Delegate at > your own risks, but cgroup in itself doesn't support delegation to > lesser security domains and it won't in the foreseeable future. We've had great success letting users create sub-cgroups in a few specific controller types (cpu, cpuacct, memory). This is, of course, with some restrictions. We do not just give them blanket access to all knobs. We don't need ALL cgroups, just the important ones. For a simple example, letting users create sub-groups in freezer or job (we have a job group that we've been carrying) lets them launch sub-tasks and manage them in a very clean way. We've been doing a LOT of development internally to make user-defined sub-memcgs work in our cluster scheduling system, and it's made some of our biggest, more insane users very happy. And for some cgroups, like cpuset, hierarchy just doesn't really make sense to me. I just don't care if that never works, though I have no problem with others wanting it. :) Aside: if the last CPU in your cpuset goes offline, you should go into a state akin to freezer. Running on any other CPU is an overt violation of policy that the user, or worse - the admin, set up. Just my 2cents. >> Strong disagreement, here. We use split hierarchies to great effect. >> Containment should be composable. If your users or abstractions can't >> handle it, please feel free to co-mount the universe, but please >> PLEASE don't force us to. >> >> I'm happy to talk more about what we do and why. > > Please do so. Why do you need multiple orthogonal hierarchies? Look for this in the next few days/weeks. From our point of view, cgroups are the ideal match for how we want to manage things (no surprise, really, since Mr. Menage worked on both). Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hi Tejun, This email worries me. A lot. It sounds very much like retrograde motion from our (Google's) point of view. We absolutely depend on the ability to split cgroup hierarchies. It pretty much saved our fleet from imploding, in a way that a unified hierarchy just could not do. A mandated unified hierarchy is madness. Please step away from the ledge. More, going towards a unified hierarchy really limits what we can delegate, and that is the word of the day. We've got a central authority agent running which manages cgroups, and we want out of this business. At least, we want to be able to grant users a set of constraints, and then let them run wild within those constraints. Forcing all such work to go through a daemon has proven to be very problematic, and it has been great now that users can have DIY sub-cgroups. berra...@redhat.com said, downthread: > We ultimately do need the ability to delegate hierarchy creation to > unprivileged users / programs, in order to allow containerized OS to > have the ability to use cgroups. Requiring any applications inside a > container to talk to a cgroups "authority" existing on the host OS is > not a satisfactory architecture. We need to allow for a container to > be self-contained in its usage of cgroups. This! A thousand times, this! > At the same time, we don't need/want to give them unrestricted ability > to create arbitarily complex hiearchies - we need some limits on it > to avoid them exposing pathelogically bad kernel behaviour. > > This could be as simple as saying that each cgroup controller directory > has a tunable "cgroups.max_children" and/or "cgroups.max_depth" which > allow limits to be placed when delegating administration of part of a >cgroups tree to an unprivileged user. We've been bitten by this, and more limitations would be great. We've got some less-than-perfect patches that impose limits for us now. > I've no disagreement that we need a unified hiearchy. The workman > app explicitly does /not/ expose the concept of differing hiearchies > per controller. Likewise libvirt will not allow the user to configure > non-unified hiearchies. Strong disagreement, here. We use split hierarchies to great effect. Containment should be composable. If your users or abstractions can't handle it, please feel free to co-mount the universe, but please PLEASE don't force us to. I'm happy to talk more about what we do and why. Tim On Sat, Apr 6, 2013 at 3:21 AM, Tejun Heo wrote: > Hello, guys. > > Status-quo > == > > It's been about a year since I wrote up a summary on cgroup status quo > and future plans. We're not there yet but much closer than we were > before. At least the locking and object life-time management aren't > crazy anymore and most controllers now support proper hierarchy > although not all of them agree on how to treat inheritance. > > IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu > needs to be updated so that it at least supports a similar mechanism > as cfq-iosched for configuring ratio between tasks on an internal > cgroup and its children. Also, we really should update how cpuset > handles a cgroup becoming empty (no cpus or memory node left due to > hot-unplug). It currently transfers all its tasks to the nearest > ancestor with executing resources, which is an irreversible process > which would affect all other co-mounted controllers. We probably want > it to just take on the masks of the ancestor until its own executing > resources become online again, and the new behavior should be gated > behind a switch (Li, can you please look into this?). > > While we have still ways to go, I feel relatively confident saying > that we aren't too far out now, well, except for the writeback mess > that still needs to be tackled. Anyways, once the remaining bits are > settled, we can proceed to implement the unified hierarchy mode I've > been talking about forever. I can't think of any fundamental > roadblocks at the moment but who knows? The devil usually is in the > details. Let's hope it goes okay. > > So, while we aren't moving as fast as we wish we were, the kernel side > of things are falling into places. At least, that's how I see it. > From now on, I think how to make it actually useable to userland > deserves a bit more focus, and by "useable to userland", I don't mean > some group hacking up an elaborate, manual configuration which is > tailored to the point of being eccentric to suit the needs of the said > group. There's nothing wrong with that and they can continue to do > so, but it just isn't generically useable or useful. It should be > possible to generically and automatically split resources among, say, > several servers and a couple users sharing a system without resorting > to indecipherable ad-hoc shell script running off rc.local. > > > Userland efforts > > > There are currently a few userland efforts trying to make interfacing > with cgroup
Re: cgroup: status-quo and userland efforts
Hi Tejun, This email worries me. A lot. It sounds very much like retrograde motion from our (Google's) point of view. We absolutely depend on the ability to split cgroup hierarchies. It pretty much saved our fleet from imploding, in a way that a unified hierarchy just could not do. A mandated unified hierarchy is madness. Please step away from the ledge. More, going towards a unified hierarchy really limits what we can delegate, and that is the word of the day. We've got a central authority agent running which manages cgroups, and we want out of this business. At least, we want to be able to grant users a set of constraints, and then let them run wild within those constraints. Forcing all such work to go through a daemon has proven to be very problematic, and it has been great now that users can have DIY sub-cgroups. berra...@redhat.com said, downthread: We ultimately do need the ability to delegate hierarchy creation to unprivileged users / programs, in order to allow containerized OS to have the ability to use cgroups. Requiring any applications inside a container to talk to a cgroups authority existing on the host OS is not a satisfactory architecture. We need to allow for a container to be self-contained in its usage of cgroups. This! A thousand times, this! At the same time, we don't need/want to give them unrestricted ability to create arbitarily complex hiearchies - we need some limits on it to avoid them exposing pathelogically bad kernel behaviour. This could be as simple as saying that each cgroup controller directory has a tunable cgroups.max_children and/or cgroups.max_depth which allow limits to be placed when delegating administration of part of a cgroups tree to an unprivileged user. We've been bitten by this, and more limitations would be great. We've got some less-than-perfect patches that impose limits for us now. I've no disagreement that we need a unified hiearchy. The workman app explicitly does /not/ expose the concept of differing hiearchies per controller. Likewise libvirt will not allow the user to configure non-unified hiearchies. Strong disagreement, here. We use split hierarchies to great effect. Containment should be composable. If your users or abstractions can't handle it, please feel free to co-mount the universe, but please PLEASE don't force us to. I'm happy to talk more about what we do and why. Tim On Sat, Apr 6, 2013 at 3:21 AM, Tejun Heo t...@kernel.org wrote: Hello, guys. Status-quo == It's been about a year since I wrote up a summary on cgroup status quo and future plans. We're not there yet but much closer than we were before. At least the locking and object life-time management aren't crazy anymore and most controllers now support proper hierarchy although not all of them agree on how to treat inheritance. IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu needs to be updated so that it at least supports a similar mechanism as cfq-iosched for configuring ratio between tasks on an internal cgroup and its children. Also, we really should update how cpuset handles a cgroup becoming empty (no cpus or memory node left due to hot-unplug). It currently transfers all its tasks to the nearest ancestor with executing resources, which is an irreversible process which would affect all other co-mounted controllers. We probably want it to just take on the masks of the ancestor until its own executing resources become online again, and the new behavior should be gated behind a switch (Li, can you please look into this?). While we have still ways to go, I feel relatively confident saying that we aren't too far out now, well, except for the writeback mess that still needs to be tackled. Anyways, once the remaining bits are settled, we can proceed to implement the unified hierarchy mode I've been talking about forever. I can't think of any fundamental roadblocks at the moment but who knows? The devil usually is in the details. Let's hope it goes okay. So, while we aren't moving as fast as we wish we were, the kernel side of things are falling into places. At least, that's how I see it. From now on, I think how to make it actually useable to userland deserves a bit more focus, and by useable to userland, I don't mean some group hacking up an elaborate, manual configuration which is tailored to the point of being eccentric to suit the needs of the said group. There's nothing wrong with that and they can continue to do so, but it just isn't generically useable or useful. It should be possible to generically and automatically split resources among, say, several servers and a couple users sharing a system without resorting to indecipherable ad-hoc shell script running off rc.local. Userland efforts There are currently a few userland efforts trying to make interfacing with cgroup less painful. * libcg: Make cgroup interface accessible
Re: cgroup: status-quo and userland efforts
On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: We absolutely depend on the ability to split cgroup hierarchies. It pretty much saved our fleet from imploding, in a way that a unified hierarchy just could not do. A mandated unified hierarchy is madness. Please step away from the ledge. You need to be a lot more specific about why unified hierarchy can't be implemented. The last time I asked around blk/memcg people in google, while they said that they'll need different levels of granularities for different controllers, google's use of cgroup doesn't require multiple orthogonal classifications of the same group of tasks. I'll pull some concrete examples together. I don't have them on hand, and I am out of country this week. I have looped in the gang at work (though some are here with me). Also, cgroup isn't dropping multiple hierarchy support over-night. What has been working till now will continue to work for very long time. If there is no fundamental conflict with the future changes, there should be enough time to migrate gradually as desired. More, going towards a unified hierarchy really limits what we can delegate, and that is the word of the day. We've got a central authority agent running which manages cgroups, and we want out of this business. At least, we want to be able to grant users a set of constraints, and then let them run wild within those constraints. Forcing all such work to go through a daemon has proven to be very problematic, and it has been great now that users can have DIY sub-cgroups. Sorry, but that doesn't work properly now. It gives you the illusion of proper delegation but it's inherently dangerous. If that sort of illusion has been / is good enough for your setup, fine. Delegate at your own risks, but cgroup in itself doesn't support delegation to lesser security domains and it won't in the foreseeable future. We've had great success letting users create sub-cgroups in a few specific controller types (cpu, cpuacct, memory). This is, of course, with some restrictions. We do not just give them blanket access to all knobs. We don't need ALL cgroups, just the important ones. For a simple example, letting users create sub-groups in freezer or job (we have a job group that we've been carrying) lets them launch sub-tasks and manage them in a very clean way. We've been doing a LOT of development internally to make user-defined sub-memcgs work in our cluster scheduling system, and it's made some of our biggest, more insane users very happy. And for some cgroups, like cpuset, hierarchy just doesn't really make sense to me. I just don't care if that never works, though I have no problem with others wanting it. :) Aside: if the last CPU in your cpuset goes offline, you should go into a state akin to freezer. Running on any other CPU is an overt violation of policy that the user, or worse - the admin, set up. Just my 2cents. Strong disagreement, here. We use split hierarchies to great effect. Containment should be composable. If your users or abstractions can't handle it, please feel free to co-mount the universe, but please PLEASE don't force us to. I'm happy to talk more about what we do and why. Please do so. Why do you need multiple orthogonal hierarchies? Look for this in the next few days/weeks. From our point of view, cgroups are the ideal match for how we want to manage things (no surprise, really, since Mr. Menage worked on both). Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo wrote: > Hello, > > On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote: >> I am not limited by kernel memory, I am limited by PIDs, and I need to >> be able to manage them. memory.kmem.usage_in_bytes seems to be far >> too noisy to be useful for this purpose. It may work fine for "just >> stop a fork bomb" but not for any sort of finer-grained control. > > So, why are you limited by PIDs other than the arcane / weird > limitation that you have whereever that limitation is? Does anyone anywhere actually set PID_MAX > 64K? As far as I can tell, distros default it to 32K or 64K because there's a lot of stuff out there that assumes this to be true. This is the problem we have - deep down in the bowels of code that is taking literally years to overhaul, we have identified a bad assumption that PIDs are always 5 characters long. I can't fix it any faster. That said, we also identified other software that make similar assumptions, though they are less critical to us. >> > If you think you can tilt it the other way, please feel free to try. >> >> Just because others caved, doesn't make it less of a hack. And I will >> cave, too, because I don't have time to bang my head against a wall, >> especially when I can see the remnants of other people who have tried. >> >> We'll work around it, or we'll hack around it, or we'll carry this >> patch in our own tree and just grumble about ridiculous hacks every >> time we have to forward port it. >> >> I was just hoping that things had worked themselves out in the last year. > > It's kinda weird getting this response, as I don't think it has been > particularly walley. The arguments were pretty sound from what I > recall and Frederic's use case was actually better covered by kmemcg, > so where's the said wall? And I asked you why your use case is > different and the only reason you gave me is some arbitrary PID > limitation on whatever thing you're using, which you gotta agree is a > pretty hard sell. So, if you think you have a valid case, please just > explain it. Why go passive agressive on it? If you don't have a > valid case for pushing it, yes, you'll have to hack around it - carry > the patches in your tree, whatever, or better, fix the weird PID > problem. Sorry Tejun, you're being very reasonable, I was not. The history of this patch is what makes me frustrated. It seems like such an obvious thing to support that it blows my mind that people argue it. You know our environment. Users can use their memory budgets however they like - kernel or userspace. We have good accounting, but we are PID limited. We've even implemented some hacks of our own to make that hurt less because the previously-mentioned assumptions are just NOT going away any time soon. I literally have user bugs every week on this. Hopefully the hacks we have put in place will make the users stop hurting. But we're left with some residual problems, some of which are because the only limits we can apply are per-user rather than per-container. >From our POV building the cluster, cgroups are strictly superior to most other control interfaces because they work at the same granularity that we do. I want more things to support cgroup control. This particular one was double-tasty because the ability to set the limit to 0 would actually solve a different problem we have in teardown. But as I said, we can mostly work around that. So I am frustrated because I don't think my use case will convince you (at the root of it, it is a problem of our own making, but it LONG predates me), despite my belief that it is obviously a good feature. I find myself hoping that someone else comes along and says "me too" rather than using a totally different hack for this. Oh well. Thanks for the update. Off to do our own thing again. > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo wrote: > Hey, > > On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote: >> > U so that's why you guys can't use kernel memory limit? :( >> >> Because it is completely non-obvious how to map between the two in a >> way that is safe across kernel versions and not likely to blow up in >> our faces. It's a hack, in other words. > > Now we're repeating the argument Frederic and Johannes had, so I'd > suggest going back the thread and reading the discussion and if you > still think using kmemcg is a bad idea, please explain why that is so. > For the specific point that you just raised, the scale tilted toward > thread/process count is a hacky and unreliable representation of > kernel memory resource than the other way around, at least back then. I am not limited by kernel memory, I am limited by PIDs, and I need to be able to manage them. memory.kmem.usage_in_bytes seems to be far too noisy to be useful for this purpose. It may work fine for "just stop a fork bomb" but not for any sort of finer-grained control. > If you think you can tilt it the other way, please feel free to try. Just because others caved, doesn't make it less of a hack. And I will cave, too, because I don't have time to bang my head against a wall, especially when I can see the remnants of other people who have tried. We'll work around it, or we'll hack around it, or we'll carry this patch in our own tree and just grumble about ridiculous hacks every time we have to forward port it. I was just hoping that things had worked themselves out in the last year. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo wrote: > Hello, Tim. > > On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote: >> We run dozens of jobs from dozens users on a single machine. We >> regularly experience users who leak threads, running into the tens of >> thousands. We are unable to raise the PID_MAX significantly due to >> some bad, but really thoroughly baked-in decisions that were made a >> long time ago. What we experience on a daily basis is users > > U so that's why you guys can't use kernel memory limit? :( Because it is completely non-obvious how to map between the two in a way that is safe across kernel versions and not likely to blow up in our faces. It's a hack, in other words. >> complaining about getting a "pthread_create(): resource unavailable" >> error because someone on the machine has leaked. > ... >> What I really don't understand is why so much push back? We have this >> nicely structured cgroup system. Each cgroup controller's code is >> pretty well partitioned - why would we not want more complete >> functionality built around it? We accept device drivers for the most >> random, useless crap on the assertion that "if you don't need it, >> don't compile it in". I can think of a half dozen more really useful, >> cool things we can do with cgroups, but I know the pushback will be >> tremendous, and I just don't grok why. > > In general, because it adds to maintenance overhead. e.g. We've been > trying to make all cgroups follow consistent nesting rules. We're now > almost there with a couple controllers left. This one would have been > another thing to do, which is fine if it's necessary but if it isn't > we're just adding up work for no good reason. > > More importantly, because cgroup is already plagued with so many bad > design decisions - some from core design decisions - e.g. not being > able to actually identify a resource outside of a context of a task. > Others are added on by each controller going out doing whatever it > wants without thinking about how the whole thing would come together > afterwards - e.g. double accounting between cpu and cpuacct, > completely illogical and unusable hierarchy implementations in > anything other than cpu controllers (they're getting better), and so > on. Right now it's in a state where there's not many things coherent > about it. Sure, every controller and feature supports the ones their > makers intended them to but when collected together it's just a mess, > which is one of the common complaints against cgroup. > > So, no free-for-all, please. > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo wrote: > On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote: >> Pardon my ignorance, but... what? Use kernel memory limits as a proxy >> for process/thread counts? That sounds terrible - I hope I am > > Well, the argument was that process / thread counts were a poor and > unnecessary proxy for kernel memory consumption limit. IIRC, Johannes > put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread > worth of component". > >> misunderstanding? This task counter patch had several properties that >> mapped very well to what we want. >> >> Is it dead in the water? > > After some discussion, Frederic agreed that at least his use case can > be served well by kmemcg, maybe even better - IIRC it was container > fork bomb scenario, so you'll have to argue your way in why kmemcg > isn't a suitable solution for your use case if you wanna revive this. We run dozens of jobs from dozens users on a single machine. We regularly experience users who leak threads, running into the tens of thousands. We are unable to raise the PID_MAX significantly due to some bad, but really thoroughly baked-in decisions that were made a long time ago. What we experience on a daily basis is users complaining about getting a "pthread_create(): resource unavailable" error because someone on the machine has leaked. Today we use RLIMIT_NPROC to lock most users down to a smaller max. But this is a per-user setting, not a per-container setting, and users do not control where their jobs land. Scheduling decisions often put multiple thread-heavy but non-leaking jobs from one user onto the same machine, which again causes problems. Further, it does not help for some of our use cases where a logical job can run as multiple UIDs for different processes within. >From the end-user point of view this is an isolation leak which is totally non-deterministic for them. They can not know what to plan for. Getting cgroup-level control of this limit is important for a saner SLA for our users. In addition, the behavior around locking-out new tasks seems like a nice way to simplify and clean up end-life work for the administrative system. Admittedly, we can mostly work around this with freezer instead. What I really don't understand is why so much push back? We have this nicely structured cgroup system. Each cgroup controller's code is pretty well partitioned - why would we not want more complete functionality built around it? We accept device drivers for the most random, useless crap on the assertion that "if you don't need it, don't compile it in". I can think of a half dozen more really useful, cool things we can do with cgroups, but I know the pushback will be tremendous, and I just don't grok why. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo wrote: > On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote: >> A year later - what ever happened with this? I want it more than ever >> for Google's use. > > I think the conclusion was "use kmemcg instead". Pardon my ignorance, but... what? Use kernel memory limits as a proxy for process/thread counts? That sounds terrible - I hope I am misunderstanding? This task counter patch had several properties that mapped very well to what we want. Is it dead in the water? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
A year later - what ever happened with this? I want it more than ever for Google's use. On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker wrote: > Hi, > > Changes In this version: > > - Split 32/64 bits version of res_counter_write_u64() [1/10] > Courtesy of Kirill A. Shutemov > > - Added Kirill's ack [8/10] > > - Added selftests [9/10], [10/10] > > Please consider for merging. At least two users want this feature: > https://lkml.org/lkml/2011/12/13/309 > https://lkml.org/lkml/2011/12/13/364 > > More general details provided in the last version posting: > https://lkml.org/lkml/2012/1/13/230 > > Thanks! > > > Frederic Weisbecker (9): > cgroups: add res_counter_write_u64() API > cgroups: new resource counter inheritance API > cgroups: ability to stop res charge propagation on bounded ancestor > res_counter: allow charge failure pointer to be null > cgroups: pull up res counter charge failure interpretation to caller > cgroups: allow subsystems to cancel a fork > cgroups: Add a task counter subsystem > selftests: Enter each directories before executing selftests > selftests: Add a new task counter selftest > > Kirill A. Shutemov (1): > cgroups: add res counter common ancestor searching > > Documentation/cgroups/resource_counter.txt | 20 ++- > Documentation/cgroups/task_counter.txt | 153 +++ > include/linux/cgroup.h | 20 +- > include/linux/cgroup_subsys.h |5 + > include/linux/res_counter.h| 27 ++- > init/Kconfig |9 + > kernel/Makefile|1 + > kernel/cgroup.c| 23 ++- > kernel/cgroup_freezer.c|6 +- > kernel/cgroup_task_counter.c | 272 > > kernel/exit.c |2 +- > kernel/fork.c |7 +- > kernel/res_counter.c | 103 +++- > tools/testing/selftests/Makefile |2 +- > tools/testing/selftests/run_tests |6 +- > tools/testing/selftests/task_counter/Makefile |8 + > tools/testing/selftests/task_counter/fork.c| 40 +++ > tools/testing/selftests/task_counter/forkbomb.c| 40 +++ > tools/testing/selftests/task_counter/multithread.c | 68 + > tools/testing/selftests/task_counter/run_test | 198 ++ > .../selftests/task_counter/spread_thread_group.c | 82 ++ > 21 files changed, 1056 insertions(+), 36 deletions(-) > create mode 100644 Documentation/cgroups/task_counter.txt > create mode 100644 kernel/cgroup_task_counter.c > create mode 100644 tools/testing/selftests/task_counter/Makefile > create mode 100644 tools/testing/selftests/task_counter/fork.c > create mode 100644 tools/testing/selftests/task_counter/forkbomb.c > create mode 100644 tools/testing/selftests/task_counter/multithread.c > create mode 100755 tools/testing/selftests/task_counter/run_test > create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c > > -- > 1.7.5.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
A year later - what ever happened with this? I want it more than ever for Google's use. On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker fweis...@gmail.com wrote: Hi, Changes In this version: - Split 32/64 bits version of res_counter_write_u64() [1/10] Courtesy of Kirill A. Shutemov - Added Kirill's ack [8/10] - Added selftests [9/10], [10/10] Please consider for merging. At least two users want this feature: https://lkml.org/lkml/2011/12/13/309 https://lkml.org/lkml/2011/12/13/364 More general details provided in the last version posting: https://lkml.org/lkml/2012/1/13/230 Thanks! Frederic Weisbecker (9): cgroups: add res_counter_write_u64() API cgroups: new resource counter inheritance API cgroups: ability to stop res charge propagation on bounded ancestor res_counter: allow charge failure pointer to be null cgroups: pull up res counter charge failure interpretation to caller cgroups: allow subsystems to cancel a fork cgroups: Add a task counter subsystem selftests: Enter each directories before executing selftests selftests: Add a new task counter selftest Kirill A. Shutemov (1): cgroups: add res counter common ancestor searching Documentation/cgroups/resource_counter.txt | 20 ++- Documentation/cgroups/task_counter.txt | 153 +++ include/linux/cgroup.h | 20 +- include/linux/cgroup_subsys.h |5 + include/linux/res_counter.h| 27 ++- init/Kconfig |9 + kernel/Makefile|1 + kernel/cgroup.c| 23 ++- kernel/cgroup_freezer.c|6 +- kernel/cgroup_task_counter.c | 272 kernel/exit.c |2 +- kernel/fork.c |7 +- kernel/res_counter.c | 103 +++- tools/testing/selftests/Makefile |2 +- tools/testing/selftests/run_tests |6 +- tools/testing/selftests/task_counter/Makefile |8 + tools/testing/selftests/task_counter/fork.c| 40 +++ tools/testing/selftests/task_counter/forkbomb.c| 40 +++ tools/testing/selftests/task_counter/multithread.c | 68 + tools/testing/selftests/task_counter/run_test | 198 ++ .../selftests/task_counter/spread_thread_group.c | 82 ++ 21 files changed, 1056 insertions(+), 36 deletions(-) create mode 100644 Documentation/cgroups/task_counter.txt create mode 100644 kernel/cgroup_task_counter.c create mode 100644 tools/testing/selftests/task_counter/Makefile create mode 100644 tools/testing/selftests/task_counter/fork.c create mode 100644 tools/testing/selftests/task_counter/forkbomb.c create mode 100644 tools/testing/selftests/task_counter/multithread.c create mode 100755 tools/testing/selftests/task_counter/run_test create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo t...@kernel.org wrote: On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote: A year later - what ever happened with this? I want it more than ever for Google's use. I think the conclusion was use kmemcg instead. Pardon my ignorance, but... what? Use kernel memory limits as a proxy for process/thread counts? That sounds terrible - I hope I am misunderstanding? This task counter patch had several properties that mapped very well to what we want. Is it dead in the water? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo t...@kernel.org wrote: On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote: Pardon my ignorance, but... what? Use kernel memory limits as a proxy for process/thread counts? That sounds terrible - I hope I am Well, the argument was that process / thread counts were a poor and unnecessary proxy for kernel memory consumption limit. IIRC, Johannes put it as (I'm paraphrasing) you can't go to Fry's and buy 4k thread worth of component. misunderstanding? This task counter patch had several properties that mapped very well to what we want. Is it dead in the water? After some discussion, Frederic agreed that at least his use case can be served well by kmemcg, maybe even better - IIRC it was container fork bomb scenario, so you'll have to argue your way in why kmemcg isn't a suitable solution for your use case if you wanna revive this. We run dozens of jobs from dozens users on a single machine. We regularly experience users who leak threads, running into the tens of thousands. We are unable to raise the PID_MAX significantly due to some bad, but really thoroughly baked-in decisions that were made a long time ago. What we experience on a daily basis is users complaining about getting a pthread_create(): resource unavailable error because someone on the machine has leaked. Today we use RLIMIT_NPROC to lock most users down to a smaller max. But this is a per-user setting, not a per-container setting, and users do not control where their jobs land. Scheduling decisions often put multiple thread-heavy but non-leaking jobs from one user onto the same machine, which again causes problems. Further, it does not help for some of our use cases where a logical job can run as multiple UIDs for different processes within. From the end-user point of view this is an isolation leak which is totally non-deterministic for them. They can not know what to plan for. Getting cgroup-level control of this limit is important for a saner SLA for our users. In addition, the behavior around locking-out new tasks seems like a nice way to simplify and clean up end-life work for the administrative system. Admittedly, we can mostly work around this with freezer instead. What I really don't understand is why so much push back? We have this nicely structured cgroup system. Each cgroup controller's code is pretty well partitioned - why would we not want more complete functionality built around it? We accept device drivers for the most random, useless crap on the assertion that if you don't need it, don't compile it in. I can think of a half dozen more really useful, cool things we can do with cgroups, but I know the pushback will be tremendous, and I just don't grok why. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo t...@kernel.org wrote: Hello, Tim. On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote: We run dozens of jobs from dozens users on a single machine. We regularly experience users who leak threads, running into the tens of thousands. We are unable to raise the PID_MAX significantly due to some bad, but really thoroughly baked-in decisions that were made a long time ago. What we experience on a daily basis is users U so that's why you guys can't use kernel memory limit? :( Because it is completely non-obvious how to map between the two in a way that is safe across kernel versions and not likely to blow up in our faces. It's a hack, in other words. complaining about getting a pthread_create(): resource unavailable error because someone on the machine has leaked. ... What I really don't understand is why so much push back? We have this nicely structured cgroup system. Each cgroup controller's code is pretty well partitioned - why would we not want more complete functionality built around it? We accept device drivers for the most random, useless crap on the assertion that if you don't need it, don't compile it in. I can think of a half dozen more really useful, cool things we can do with cgroups, but I know the pushback will be tremendous, and I just don't grok why. In general, because it adds to maintenance overhead. e.g. We've been trying to make all cgroups follow consistent nesting rules. We're now almost there with a couple controllers left. This one would have been another thing to do, which is fine if it's necessary but if it isn't we're just adding up work for no good reason. More importantly, because cgroup is already plagued with so many bad design decisions - some from core design decisions - e.g. not being able to actually identify a resource outside of a context of a task. Others are added on by each controller going out doing whatever it wants without thinking about how the whole thing would come together afterwards - e.g. double accounting between cpu and cpuacct, completely illogical and unusable hierarchy implementations in anything other than cpu controllers (they're getting better), and so on. Right now it's in a state where there's not many things coherent about it. Sure, every controller and feature supports the ones their makers intended them to but when collected together it's just a mess, which is one of the common complaints against cgroup. So, no free-for-all, please. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo t...@kernel.org wrote: Hey, On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote: U so that's why you guys can't use kernel memory limit? :( Because it is completely non-obvious how to map between the two in a way that is safe across kernel versions and not likely to blow up in our faces. It's a hack, in other words. Now we're repeating the argument Frederic and Johannes had, so I'd suggest going back the thread and reading the discussion and if you still think using kmemcg is a bad idea, please explain why that is so. For the specific point that you just raised, the scale tilted toward thread/process count is a hacky and unreliable representation of kernel memory resource than the other way around, at least back then. I am not limited by kernel memory, I am limited by PIDs, and I need to be able to manage them. memory.kmem.usage_in_bytes seems to be far too noisy to be useful for this purpose. It may work fine for just stop a fork bomb but not for any sort of finer-grained control. If you think you can tilt it the other way, please feel free to try. Just because others caved, doesn't make it less of a hack. And I will cave, too, because I don't have time to bang my head against a wall, especially when I can see the remnants of other people who have tried. We'll work around it, or we'll hack around it, or we'll carry this patch in our own tree and just grumble about ridiculous hacks every time we have to forward port it. I was just hoping that things had worked themselves out in the last year. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo t...@kernel.org wrote: Hello, On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote: I am not limited by kernel memory, I am limited by PIDs, and I need to be able to manage them. memory.kmem.usage_in_bytes seems to be far too noisy to be useful for this purpose. It may work fine for just stop a fork bomb but not for any sort of finer-grained control. So, why are you limited by PIDs other than the arcane / weird limitation that you have whereever that limitation is? Does anyone anywhere actually set PID_MAX 64K? As far as I can tell, distros default it to 32K or 64K because there's a lot of stuff out there that assumes this to be true. This is the problem we have - deep down in the bowels of code that is taking literally years to overhaul, we have identified a bad assumption that PIDs are always 5 characters long. I can't fix it any faster. That said, we also identified other software that make similar assumptions, though they are less critical to us. If you think you can tilt it the other way, please feel free to try. Just because others caved, doesn't make it less of a hack. And I will cave, too, because I don't have time to bang my head against a wall, especially when I can see the remnants of other people who have tried. We'll work around it, or we'll hack around it, or we'll carry this patch in our own tree and just grumble about ridiculous hacks every time we have to forward port it. I was just hoping that things had worked themselves out in the last year. It's kinda weird getting this response, as I don't think it has been particularly walley. The arguments were pretty sound from what I recall and Frederic's use case was actually better covered by kmemcg, so where's the said wall? And I asked you why your use case is different and the only reason you gave me is some arbitrary PID limitation on whatever thing you're using, which you gotta agree is a pretty hard sell. So, if you think you have a valid case, please just explain it. Why go passive agressive on it? If you don't have a valid case for pushing it, yes, you'll have to hack around it - carry the patches in your tree, whatever, or better, fix the weird PID problem. Sorry Tejun, you're being very reasonable, I was not. The history of this patch is what makes me frustrated. It seems like such an obvious thing to support that it blows my mind that people argue it. You know our environment. Users can use their memory budgets however they like - kernel or userspace. We have good accounting, but we are PID limited. We've even implemented some hacks of our own to make that hurt less because the previously-mentioned assumptions are just NOT going away any time soon. I literally have user bugs every week on this. Hopefully the hacks we have put in place will make the users stop hurting. But we're left with some residual problems, some of which are because the only limits we can apply are per-user rather than per-container. From our POV building the cluster, cgroups are strictly superior to most other control interfaces because they work at the same granularity that we do. I want more things to support cgroup control. This particular one was double-tasty because the ability to set the limit to 0 would actually solve a different problem we have in teardown. But as I said, we can mostly work around that. So I am frustrated because I don't think my use case will convince you (at the root of it, it is a problem of our own making, but it LONG predates me), despite my belief that it is obviously a good feature. I find myself hoping that someone else comes along and says me too rather than using a totally different hack for this. Oh well. Thanks for the update. Off to do our own thing again. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic on resume from S3 - stumped
On Sun, Dec 30, 2012 at 2:55 PM, Rafael J. Wysocki wrote: > On Saturday, December 29, 2012 11:17:11 PM Tim Hockin wrote: >> Best guess: >> >> With 'noapic', I see the "irq 5: nobody cared" message on resume, >> along with 1 IRQ5 counts in /proc/interrupts (the devices claiming >> that IRQ are quiescent). >> >> Without 'noapic' that must be triggering something else to go haywire, >> perhaps the AER logic (though that is all MSI, so probably not). I'm >> flying blind on those boots. >> >> I bet that, if I can recall how to re-enable IRQ5, I'll see it >> continuously asserting. Chipset or BIOS bug maybe. I don't know if I >> had AER enabled under Lucid, so that might be the difference. >> >> I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I >> can make it progress. > > I wonder what happens if you simply disable AER for starters? > > There is the pci=noaer kernel command line switch for that. That still panics on resume. Damn. I really think it is down to that interrupt storm at resume. Something somewhere is getting stuck asserting, and we don't know how to EOI it. PIC vs APIC is just changing the operating mode. Now the question is whether I am going to track through Intel errata (more than I have already) and through chipset docs to figure out what it could be, or just leave it at noapic. I've already got one new PCI quirk to code up. > Thanks, > Rafael > > >> On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin wrote: >> > Quick update: booting with 'noapic' on the commandline seems to make >> > it resume successfully. >> > >> > The main dmesg diffs, other than the obvious "Skipping IOAPIC probe" >> > and IRG number diffs) are: >> > >> > -nr_irqs_gsi: 40 >> > +nr_irqs_gsi: 16 >> > >> > -NR_IRQS:16640 nr_irqs:776 16 >> > +NR_IRQS:16640 nr_irqs:368 16 >> > >> > -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved >> > +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved >> > >> > and a new warning about irq 5: nobody cared (try booting with the >> > "irqpoll" option) >> > >> > I'll see if I can sort out further differences, but I thought it was >> > worth sending this new info along, anyway. >> > >> > It did not require 'noapic' on the Lucid (2.6.32?) kernel >> > >> > >> > On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin wrote: >> >> Running a suspend with pm_trace set, I get: >> >> >> >> aer :00:03.0:pcie02: hash matches >> >> >> >> I don't know what magic might be needed here, though. >> >> >> >> I guess next step is to try to build a non-distro kernel. >> >> >> >> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki wrote: >> >>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: >> >>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and >> >>>> resume worked flawlessly every time. >> >>>> >> >>>> Then I upgraded to Ubuntu Precise. >> >>> >> >>> Well, do you use a distro kernel or a kernel.org kernel? >> >>> >> >>>> Suspend seems to work, but resume >> >>>> fails every time. The video never initializes. By the flashing >> >>>> keyboard lights, I guess it's a kernel panic. It fails from the Live >> >>>> CD and from a fresh install. >> >>>> >> >>>> Here is my debug so far. >> >>>> >> >>>> Install all updates (3.2 kernel, nouveau driver) >> >>>> Reboot >> >>>> Try suspend = fails >> >>>> >> >>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) >> >>>> Reboot >> >>>> Try suspend = fails >> >>>> >> >>>> Install nVidia's 304 driver >> >>>> Reboot >> >>>> Try suspend = fails >> >>>> >> >>>> From within X: >> >>>> echo core > /sys/power/pm_test >> >>>> echo mem > /sys/power/state >> >>>> The system acts like it is going to sleep, and then wakes up a few >> >>>> seconds later. dmesg shows: >> >>>> >> >>>> [ 1230.083404] [ cut here ] >> >>>> [ 1230.083410] WARNING: at >> >>>&g
Re: kernel panic on resume from S3 - stumped
On Sun, Dec 30, 2012 at 2:55 PM, Rafael J. Wysocki r...@sisk.pl wrote: On Saturday, December 29, 2012 11:17:11 PM Tim Hockin wrote: Best guess: With 'noapic', I see the irq 5: nobody cared message on resume, along with 1 IRQ5 counts in /proc/interrupts (the devices claiming that IRQ are quiescent). Without 'noapic' that must be triggering something else to go haywire, perhaps the AER logic (though that is all MSI, so probably not). I'm flying blind on those boots. I bet that, if I can recall how to re-enable IRQ5, I'll see it continuously asserting. Chipset or BIOS bug maybe. I don't know if I had AER enabled under Lucid, so that might be the difference. I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I can make it progress. I wonder what happens if you simply disable AER for starters? There is the pci=noaer kernel command line switch for that. That still panics on resume. Damn. I really think it is down to that interrupt storm at resume. Something somewhere is getting stuck asserting, and we don't know how to EOI it. PIC vs APIC is just changing the operating mode. Now the question is whether I am going to track through Intel errata (more than I have already) and through chipset docs to figure out what it could be, or just leave it at noapic. I've already got one new PCI quirk to code up. Thanks, Rafael On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin thoc...@hockin.org wrote: Quick update: booting with 'noapic' on the commandline seems to make it resume successfully. The main dmesg diffs, other than the obvious Skipping IOAPIC probe and IRG number diffs) are: -nr_irqs_gsi: 40 +nr_irqs_gsi: 16 -NR_IRQS:16640 nr_irqs:776 16 +NR_IRQS:16640 nr_irqs:368 16 -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved and a new warning about irq 5: nobody cared (try booting with the irqpoll option) I'll see if I can sort out further differences, but I thought it was worth sending this new info along, anyway. It did not require 'noapic' on the Lucid (2.6.32?) kernel On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin thoc...@hockin.org wrote: Running a suspend with pm_trace set, I get: aer :00:03.0:pcie02: hash matches I don't know what magic might be needed here, though. I guess next step is to try to build a non-distro kernel. On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote: On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: 4 days ago I had Ubuntu Lucid running on this computer. Suspend and resume worked flawlessly every time. Then I upgraded to Ubuntu Precise. Well, do you use a distro kernel or a kernel.org kernel? Suspend seems to work, but resume fails every time. The video never initializes. By the flashing keyboard lights, I guess it's a kernel panic. It fails from the Live CD and from a fresh install. Here is my debug so far. Install all updates (3.2 kernel, nouveau driver) Reboot Try suspend = fails Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) Reboot Try suspend = fails Install nVidia's 304 driver Reboot Try suspend = fails From within X: echo core /sys/power/pm_test echo mem /sys/power/state The system acts like it is going to sleep, and then wakes up a few seconds later. dmesg shows: [ 1230.083404] [ cut here ] [ 1230.083410] WARNING: at /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 suspend_test_finish+0x86/0x90() [ 1230.083411] Hardware name: To Be Filled By O.E.M. [ 1230.083412] Component: resume devices, time: 14424 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 pata_marvell [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic #32~precise1-Ubuntu [ 1230.083446] Call Trace: [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200 [ 1230.083460] [8109b701] enter_state+0xd1/0x100 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60 [ 1230.083465] [8109a7a5] state_store+0x45/0x70 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30 [ 1230.083471] [811f77ff] sysfs_write_file
Re: kernel panic on resume from S3 - stumped
Best guess: With 'noapic', I see the "irq 5: nobody cared" message on resume, along with 1 IRQ5 counts in /proc/interrupts (the devices claiming that IRQ are quiescent). Without 'noapic' that must be triggering something else to go haywire, perhaps the AER logic (though that is all MSI, so probably not). I'm flying blind on those boots. I bet that, if I can recall how to re-enable IRQ5, I'll see it continuously asserting. Chipset or BIOS bug maybe. I don't know if I had AER enabled under Lucid, so that might be the difference. I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I can make it progress. On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin wrote: > Quick update: booting with 'noapic' on the commandline seems to make > it resume successfully. > > The main dmesg diffs, other than the obvious "Skipping IOAPIC probe" > and IRG number diffs) are: > > -nr_irqs_gsi: 40 > +nr_irqs_gsi: 16 > > -NR_IRQS:16640 nr_irqs:776 16 > +NR_IRQS:16640 nr_irqs:368 16 > > -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved > +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved > > and a new warning about irq 5: nobody cared (try booting with the > "irqpoll" option) > > I'll see if I can sort out further differences, but I thought it was > worth sending this new info along, anyway. > > It did not require 'noapic' on the Lucid (2.6.32?) kernel > > > On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin wrote: >> Running a suspend with pm_trace set, I get: >> >> aer :00:03.0:pcie02: hash matches >> >> I don't know what magic might be needed here, though. >> >> I guess next step is to try to build a non-distro kernel. >> >> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki wrote: >>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: >>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and >>>> resume worked flawlessly every time. >>>> >>>> Then I upgraded to Ubuntu Precise. >>> >>> Well, do you use a distro kernel or a kernel.org kernel? >>> >>>> Suspend seems to work, but resume >>>> fails every time. The video never initializes. By the flashing >>>> keyboard lights, I guess it's a kernel panic. It fails from the Live >>>> CD and from a fresh install. >>>> >>>> Here is my debug so far. >>>> >>>> Install all updates (3.2 kernel, nouveau driver) >>>> Reboot >>>> Try suspend = fails >>>> >>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) >>>> Reboot >>>> Try suspend = fails >>>> >>>> Install nVidia's 304 driver >>>> Reboot >>>> Try suspend = fails >>>> >>>> From within X: >>>> echo core > /sys/power/pm_test >>>> echo mem > /sys/power/state >>>> The system acts like it is going to sleep, and then wakes up a few >>>> seconds later. dmesg shows: >>>> >>>> [ 1230.083404] [ cut here ] >>>> [ 1230.083410] WARNING: at >>>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 >>>> suspend_test_finish+0x86/0x90() >>>> [ 1230.083411] Hardware name: To Be Filled By O.E.M. >>>> [ 1230.083412] Component: resume devices, time: 14424 >>>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth >>>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev >>>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc >>>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event >>>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd >>>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 >>>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi >>>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 >>>> pata_marvell >>>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic >>>> #32~precise1-Ubuntu >>>> [ 1230.083446] Call Trace: >>>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0 >>>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50 >>>> [ 1230.083455] [] suspend_test_finish+0x86/0x90 >>>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200 >>>> [ 1230.083460] [] enter_state+0xd1/0x100 >>>> [ 1230.083463] [] pm_suspend+0x1b/0x60 >>>> [ 1230.083465] [] state_store+0x4
Re: kernel panic on resume from S3 - stumped
Quick update: booting with 'noapic' on the commandline seems to make it resume successfully. The main dmesg diffs, other than the obvious "Skipping IOAPIC probe" and IRG number diffs) are: -nr_irqs_gsi: 40 +nr_irqs_gsi: 16 -NR_IRQS:16640 nr_irqs:776 16 +NR_IRQS:16640 nr_irqs:368 16 -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved and a new warning about irq 5: nobody cared (try booting with the "irqpoll" option) I'll see if I can sort out further differences, but I thought it was worth sending this new info along, anyway. It did not require 'noapic' on the Lucid (2.6.32?) kernel On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin wrote: > Running a suspend with pm_trace set, I get: > > aer :00:03.0:pcie02: hash matches > > I don't know what magic might be needed here, though. > > I guess next step is to try to build a non-distro kernel. > > On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki wrote: >> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: >>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and >>> resume worked flawlessly every time. >>> >>> Then I upgraded to Ubuntu Precise. >> >> Well, do you use a distro kernel or a kernel.org kernel? >> >>> Suspend seems to work, but resume >>> fails every time. The video never initializes. By the flashing >>> keyboard lights, I guess it's a kernel panic. It fails from the Live >>> CD and from a fresh install. >>> >>> Here is my debug so far. >>> >>> Install all updates (3.2 kernel, nouveau driver) >>> Reboot >>> Try suspend = fails >>> >>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) >>> Reboot >>> Try suspend = fails >>> >>> Install nVidia's 304 driver >>> Reboot >>> Try suspend = fails >>> >>> From within X: >>> echo core > /sys/power/pm_test >>> echo mem > /sys/power/state >>> The system acts like it is going to sleep, and then wakes up a few >>> seconds later. dmesg shows: >>> >>> [ 1230.083404] [ cut here ] >>> [ 1230.083410] WARNING: at >>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 >>> suspend_test_finish+0x86/0x90() >>> [ 1230.083411] Hardware name: To Be Filled By O.E.M. >>> [ 1230.083412] Component: resume devices, time: 14424 >>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth >>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev >>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc >>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event >>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd >>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 >>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi >>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 >>> pata_marvell >>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic >>> #32~precise1-Ubuntu >>> [ 1230.083446] Call Trace: >>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0 >>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50 >>> [ 1230.083455] [] suspend_test_finish+0x86/0x90 >>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200 >>> [ 1230.083460] [] enter_state+0xd1/0x100 >>> [ 1230.083463] [] pm_suspend+0x1b/0x60 >>> [ 1230.083465] [] state_store+0x45/0x70 >>> [ 1230.083467] [] kobj_attr_store+0xf/0x30 >>> [ 1230.083471] [] sysfs_write_file+0xef/0x170 >>> [ 1230.083476] [] vfs_write+0xb3/0x180 >>> [ 1230.083480] [] sys_write+0x4a/0x90 >>> [ 1230.083483] [] system_call_fastpath+0x16/0x1b >>> [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- >>> >>> Boot with init=/bin/bash >>> unload all modules except USBHID >>> echo core > /sys/power/pm_test >>> echo mem > /sys/power/state >>> system acts like it is going to sleep, and then wakes up a few seconds later >>> echo none > /sys/power/pm_test >>> echo mem > /sys/power/state >>> system goes to sleep >>> press power to resume = fails >>> >>> At this point I am stumped on how to debug. This is a "modern" >>> computer with no serial ports. It worked under Lucid, so I know it is >>> POSSIBLE. >>> >>> Mobo: ASRock X58 single-socket >>> CPU: West
Re: kernel panic on resume from S3 - stumped
Running a suspend with pm_trace set, I get: aer :00:03.0:pcie02: hash matches I don't know what magic might be needed here, though. I guess next step is to try to build a non-distro kernel. On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki wrote: > On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: >> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and >> resume worked flawlessly every time. >> >> Then I upgraded to Ubuntu Precise. > > Well, do you use a distro kernel or a kernel.org kernel? > >> Suspend seems to work, but resume >> fails every time. The video never initializes. By the flashing >> keyboard lights, I guess it's a kernel panic. It fails from the Live >> CD and from a fresh install. >> >> Here is my debug so far. >> >> Install all updates (3.2 kernel, nouveau driver) >> Reboot >> Try suspend = fails >> >> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) >> Reboot >> Try suspend = fails >> >> Install nVidia's 304 driver >> Reboot >> Try suspend = fails >> >> From within X: >> echo core > /sys/power/pm_test >> echo mem > /sys/power/state >> The system acts like it is going to sleep, and then wakes up a few >> seconds later. dmesg shows: >> >> [ 1230.083404] [ cut here ] >> [ 1230.083410] WARNING: at >> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 >> suspend_test_finish+0x86/0x90() >> [ 1230.083411] Hardware name: To Be Filled By O.E.M. >> [ 1230.083412] Component: resume devices, time: 14424 >> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth >> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev >> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc >> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event >> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd >> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 >> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi >> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 >> pata_marvell >> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic >> #32~precise1-Ubuntu >> [ 1230.083446] Call Trace: >> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0 >> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50 >> [ 1230.083455] [] suspend_test_finish+0x86/0x90 >> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200 >> [ 1230.083460] [] enter_state+0xd1/0x100 >> [ 1230.083463] [] pm_suspend+0x1b/0x60 >> [ 1230.083465] [] state_store+0x45/0x70 >> [ 1230.083467] [] kobj_attr_store+0xf/0x30 >> [ 1230.083471] [] sysfs_write_file+0xef/0x170 >> [ 1230.083476] [] vfs_write+0xb3/0x180 >> [ 1230.083480] [] sys_write+0x4a/0x90 >> [ 1230.083483] [] system_call_fastpath+0x16/0x1b >> [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- >> >> Boot with init=/bin/bash >> unload all modules except USBHID >> echo core > /sys/power/pm_test >> echo mem > /sys/power/state >> system acts like it is going to sleep, and then wakes up a few seconds later >> echo none > /sys/power/pm_test >> echo mem > /sys/power/state >> system goes to sleep >> press power to resume = fails >> >> At this point I am stumped on how to debug. This is a "modern" >> computer with no serial ports. It worked under Lucid, so I know it is >> POSSIBLE. >> >> Mobo: ASRock X58 single-socket >> CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz >> RAM: 12 GB ECC >> Disk: sda = Intel SSD, mounted on / >> Disk: sdb = Intel SSD, not mounted >> Disk: sdc = Seagate HDD, not mounted >> Disk: sdd = Seagate HDD, not mounted >> NIC = Onboard RTL8168e/8111e >> Sound = EMU1212 (emu10k1, not even configured yet) >> Video = nVidia GeForce 7600 GT >> KB = PS2 (also tried USB) >> Mouse = USB >> >> I have not updated to a more current kernel than 3.5, but I will if >> there's evidence that this is resolved. Any other clever trick to >> try? > > There is no evidence and there won't be if you don't try a newer kernel. > > Thanks, > Rafael > > > -- > I speak only for myself. > Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel panic on resume from S3 - stumped
4 days ago I had Ubuntu Lucid running on this computer. Suspend and resume worked flawlessly every time. Then I upgraded to Ubuntu Precise. Suspend seems to work, but resume fails every time. The video never initializes. By the flashing keyboard lights, I guess it's a kernel panic. It fails from the Live CD and from a fresh install. Here is my debug so far. Install all updates (3.2 kernel, nouveau driver) Reboot Try suspend = fails Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) Reboot Try suspend = fails Install nVidia's 304 driver Reboot Try suspend = fails >From within X: echo core > /sys/power/pm_test echo mem > /sys/power/state The system acts like it is going to sleep, and then wakes up a few seconds later. dmesg shows: [ 1230.083404] [ cut here ] [ 1230.083410] WARNING: at /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 suspend_test_finish+0x86/0x90() [ 1230.083411] Hardware name: To Be Filled By O.E.M. [ 1230.083412] Component: resume devices, time: 14424 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 pata_marvell [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic #32~precise1-Ubuntu [ 1230.083446] Call Trace: [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0 [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50 [ 1230.083455] [] suspend_test_finish+0x86/0x90 [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200 [ 1230.083460] [] enter_state+0xd1/0x100 [ 1230.083463] [] pm_suspend+0x1b/0x60 [ 1230.083465] [] state_store+0x45/0x70 [ 1230.083467] [] kobj_attr_store+0xf/0x30 [ 1230.083471] [] sysfs_write_file+0xef/0x170 [ 1230.083476] [] vfs_write+0xb3/0x180 [ 1230.083480] [] sys_write+0x4a/0x90 [ 1230.083483] [] system_call_fastpath+0x16/0x1b [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- Boot with init=/bin/bash unload all modules except USBHID echo core > /sys/power/pm_test echo mem > /sys/power/state system acts like it is going to sleep, and then wakes up a few seconds later echo none > /sys/power/pm_test echo mem > /sys/power/state system goes to sleep press power to resume = fails At this point I am stumped on how to debug. This is a "modern" computer with no serial ports. It worked under Lucid, so I know it is POSSIBLE. Mobo: ASRock X58 single-socket CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz RAM: 12 GB ECC Disk: sda = Intel SSD, mounted on / Disk: sdb = Intel SSD, not mounted Disk: sdc = Seagate HDD, not mounted Disk: sdd = Seagate HDD, not mounted NIC = Onboard RTL8168e/8111e Sound = EMU1212 (emu10k1, not even configured yet) Video = nVidia GeForce 7600 GT KB = PS2 (also tried USB) Mouse = USB I have not updated to a more current kernel than 3.5, but I will if there's evidence that this is resolved. Any other clever trick to try? Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel panic on resume from S3 - stumped
4 days ago I had Ubuntu Lucid running on this computer. Suspend and resume worked flawlessly every time. Then I upgraded to Ubuntu Precise. Suspend seems to work, but resume fails every time. The video never initializes. By the flashing keyboard lights, I guess it's a kernel panic. It fails from the Live CD and from a fresh install. Here is my debug so far. Install all updates (3.2 kernel, nouveau driver) Reboot Try suspend = fails Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) Reboot Try suspend = fails Install nVidia's 304 driver Reboot Try suspend = fails From within X: echo core /sys/power/pm_test echo mem /sys/power/state The system acts like it is going to sleep, and then wakes up a few seconds later. dmesg shows: [ 1230.083404] [ cut here ] [ 1230.083410] WARNING: at /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 suspend_test_finish+0x86/0x90() [ 1230.083411] Hardware name: To Be Filled By O.E.M. [ 1230.083412] Component: resume devices, time: 14424 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 pata_marvell [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic #32~precise1-Ubuntu [ 1230.083446] Call Trace: [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200 [ 1230.083460] [8109b701] enter_state+0xd1/0x100 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60 [ 1230.083465] [8109a7a5] state_store+0x45/0x70 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- Boot with init=/bin/bash unload all modules except USBHID echo core /sys/power/pm_test echo mem /sys/power/state system acts like it is going to sleep, and then wakes up a few seconds later echo none /sys/power/pm_test echo mem /sys/power/state system goes to sleep press power to resume = fails At this point I am stumped on how to debug. This is a modern computer with no serial ports. It worked under Lucid, so I know it is POSSIBLE. Mobo: ASRock X58 single-socket CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz RAM: 12 GB ECC Disk: sda = Intel SSD, mounted on / Disk: sdb = Intel SSD, not mounted Disk: sdc = Seagate HDD, not mounted Disk: sdd = Seagate HDD, not mounted NIC = Onboard RTL8168e/8111e Sound = EMU1212 (emu10k1, not even configured yet) Video = nVidia GeForce 7600 GT KB = PS2 (also tried USB) Mouse = USB I have not updated to a more current kernel than 3.5, but I will if there's evidence that this is resolved. Any other clever trick to try? Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic on resume from S3 - stumped
Running a suspend with pm_trace set, I get: aer :00:03.0:pcie02: hash matches I don't know what magic might be needed here, though. I guess next step is to try to build a non-distro kernel. On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote: On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: 4 days ago I had Ubuntu Lucid running on this computer. Suspend and resume worked flawlessly every time. Then I upgraded to Ubuntu Precise. Well, do you use a distro kernel or a kernel.org kernel? Suspend seems to work, but resume fails every time. The video never initializes. By the flashing keyboard lights, I guess it's a kernel panic. It fails from the Live CD and from a fresh install. Here is my debug so far. Install all updates (3.2 kernel, nouveau driver) Reboot Try suspend = fails Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) Reboot Try suspend = fails Install nVidia's 304 driver Reboot Try suspend = fails From within X: echo core /sys/power/pm_test echo mem /sys/power/state The system acts like it is going to sleep, and then wakes up a few seconds later. dmesg shows: [ 1230.083404] [ cut here ] [ 1230.083410] WARNING: at /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 suspend_test_finish+0x86/0x90() [ 1230.083411] Hardware name: To Be Filled By O.E.M. [ 1230.083412] Component: resume devices, time: 14424 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 pata_marvell [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic #32~precise1-Ubuntu [ 1230.083446] Call Trace: [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200 [ 1230.083460] [8109b701] enter_state+0xd1/0x100 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60 [ 1230.083465] [8109a7a5] state_store+0x45/0x70 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- Boot with init=/bin/bash unload all modules except USBHID echo core /sys/power/pm_test echo mem /sys/power/state system acts like it is going to sleep, and then wakes up a few seconds later echo none /sys/power/pm_test echo mem /sys/power/state system goes to sleep press power to resume = fails At this point I am stumped on how to debug. This is a modern computer with no serial ports. It worked under Lucid, so I know it is POSSIBLE. Mobo: ASRock X58 single-socket CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz RAM: 12 GB ECC Disk: sda = Intel SSD, mounted on / Disk: sdb = Intel SSD, not mounted Disk: sdc = Seagate HDD, not mounted Disk: sdd = Seagate HDD, not mounted NIC = Onboard RTL8168e/8111e Sound = EMU1212 (emu10k1, not even configured yet) Video = nVidia GeForce 7600 GT KB = PS2 (also tried USB) Mouse = USB I have not updated to a more current kernel than 3.5, but I will if there's evidence that this is resolved. Any other clever trick to try? There is no evidence and there won't be if you don't try a newer kernel. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic on resume from S3 - stumped
Quick update: booting with 'noapic' on the commandline seems to make it resume successfully. The main dmesg diffs, other than the obvious Skipping IOAPIC probe and IRG number diffs) are: -nr_irqs_gsi: 40 +nr_irqs_gsi: 16 -NR_IRQS:16640 nr_irqs:776 16 +NR_IRQS:16640 nr_irqs:368 16 -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved and a new warning about irq 5: nobody cared (try booting with the irqpoll option) I'll see if I can sort out further differences, but I thought it was worth sending this new info along, anyway. It did not require 'noapic' on the Lucid (2.6.32?) kernel On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin thoc...@hockin.org wrote: Running a suspend with pm_trace set, I get: aer :00:03.0:pcie02: hash matches I don't know what magic might be needed here, though. I guess next step is to try to build a non-distro kernel. On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote: On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: 4 days ago I had Ubuntu Lucid running on this computer. Suspend and resume worked flawlessly every time. Then I upgraded to Ubuntu Precise. Well, do you use a distro kernel or a kernel.org kernel? Suspend seems to work, but resume fails every time. The video never initializes. By the flashing keyboard lights, I guess it's a kernel panic. It fails from the Live CD and from a fresh install. Here is my debug so far. Install all updates (3.2 kernel, nouveau driver) Reboot Try suspend = fails Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) Reboot Try suspend = fails Install nVidia's 304 driver Reboot Try suspend = fails From within X: echo core /sys/power/pm_test echo mem /sys/power/state The system acts like it is going to sleep, and then wakes up a few seconds later. dmesg shows: [ 1230.083404] [ cut here ] [ 1230.083410] WARNING: at /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 suspend_test_finish+0x86/0x90() [ 1230.083411] Hardware name: To Be Filled By O.E.M. [ 1230.083412] Component: resume devices, time: 14424 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 pata_marvell [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic #32~precise1-Ubuntu [ 1230.083446] Call Trace: [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200 [ 1230.083460] [8109b701] enter_state+0xd1/0x100 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60 [ 1230.083465] [8109a7a5] state_store+0x45/0x70 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- Boot with init=/bin/bash unload all modules except USBHID echo core /sys/power/pm_test echo mem /sys/power/state system acts like it is going to sleep, and then wakes up a few seconds later echo none /sys/power/pm_test echo mem /sys/power/state system goes to sleep press power to resume = fails At this point I am stumped on how to debug. This is a modern computer with no serial ports. It worked under Lucid, so I know it is POSSIBLE. Mobo: ASRock X58 single-socket CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz RAM: 12 GB ECC Disk: sda = Intel SSD, mounted on / Disk: sdb = Intel SSD, not mounted Disk: sdc = Seagate HDD, not mounted Disk: sdd = Seagate HDD, not mounted NIC = Onboard RTL8168e/8111e Sound = EMU1212 (emu10k1, not even configured yet) Video = nVidia GeForce 7600 GT KB = PS2 (also tried USB) Mouse = USB I have not updated to a more current kernel than 3.5, but I will if there's evidence that this is resolved. Any other clever trick to try? There is no evidence and there won't be if you don't try a newer kernel. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info
Re: kernel panic on resume from S3 - stumped
Best guess: With 'noapic', I see the irq 5: nobody cared message on resume, along with 1 IRQ5 counts in /proc/interrupts (the devices claiming that IRQ are quiescent). Without 'noapic' that must be triggering something else to go haywire, perhaps the AER logic (though that is all MSI, so probably not). I'm flying blind on those boots. I bet that, if I can recall how to re-enable IRQ5, I'll see it continuously asserting. Chipset or BIOS bug maybe. I don't know if I had AER enabled under Lucid, so that might be the difference. I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I can make it progress. On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin thoc...@hockin.org wrote: Quick update: booting with 'noapic' on the commandline seems to make it resume successfully. The main dmesg diffs, other than the obvious Skipping IOAPIC probe and IRG number diffs) are: -nr_irqs_gsi: 40 +nr_irqs_gsi: 16 -NR_IRQS:16640 nr_irqs:776 16 +NR_IRQS:16640 nr_irqs:368 16 -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved and a new warning about irq 5: nobody cared (try booting with the irqpoll option) I'll see if I can sort out further differences, but I thought it was worth sending this new info along, anyway. It did not require 'noapic' on the Lucid (2.6.32?) kernel On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin thoc...@hockin.org wrote: Running a suspend with pm_trace set, I get: aer :00:03.0:pcie02: hash matches I don't know what magic might be needed here, though. I guess next step is to try to build a non-distro kernel. On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote: On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote: 4 days ago I had Ubuntu Lucid running on this computer. Suspend and resume worked flawlessly every time. Then I upgraded to Ubuntu Precise. Well, do you use a distro kernel or a kernel.org kernel? Suspend seems to work, but resume fails every time. The video never initializes. By the flashing keyboard lights, I guess it's a kernel panic. It fails from the Live CD and from a fresh install. Here is my debug so far. Install all updates (3.2 kernel, nouveau driver) Reboot Try suspend = fails Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver) Reboot Try suspend = fails Install nVidia's 304 driver Reboot Try suspend = fails From within X: echo core /sys/power/pm_test echo mem /sys/power/state The system acts like it is going to sleep, and then wakes up a few seconds later. dmesg shows: [ 1230.083404] [ cut here ] [ 1230.083410] WARNING: at /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53 suspend_test_finish+0x86/0x90() [ 1230.083411] Hardware name: To Be Filled By O.E.M. [ 1230.083412] Component: resume devices, time: 14424 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169 pata_marvell [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic #32~precise1-Ubuntu [ 1230.083446] Call Trace: [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200 [ 1230.083460] [8109b701] enter_state+0xd1/0x100 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60 [ 1230.083465] [8109a7a5] state_store+0x45/0x70 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]--- Boot with init=/bin/bash unload all modules except USBHID echo core /sys/power/pm_test echo mem /sys/power/state system acts like it is going to sleep, and then wakes up a few seconds later echo none /sys/power/pm_test echo mem /sys/power/state system goes to sleep press power to resume = fails At this point I am stumped on how to debug. This is a modern computer with no serial ports. It worked under Lucid, so I know it is POSSIBLE. Mobo: ASRock X58 single-socket CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz RAM: 12 GB ECC Disk: sda = Intel SSD, mounted on / Disk: sdb
Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)
On 7/12/07, Andi Kleen <[EMAIL PROTECTED]> wrote: > -- there may be other edge cases other than > this one. I'm actually surprised that this wasn't a ring buffer to start > with -- it certainly seems like it wanted to be one. The problem with a ring buffer is that it would lose old entries; but for machine checks you really want the first entries because the later ones might be just junk. Couldn't the ring just have logic to detect an overrun and stop logging until that is alleviated? Similar to what is done now. Maybe I am underestimating it.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)
On 7/12/07, Andi Kleen [EMAIL PROTECTED] wrote: -- there may be other edge cases other than this one. I'm actually surprised that this wasn't a ring buffer to start with -- it certainly seems like it wanted to be one. The problem with a ring buffer is that it would lose old entries; but for machine checks you really want the first entries because the later ones might be just junk. Couldn't the ring just have logic to detect an overrun and stop logging until that is alleviated? Similar to what is done now. Maybe I am underestimating it.. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kconfig variable "COBALT" is not defined anywhere
That sounds correct. On 6/3/07, Robert P. J. Day <[EMAIL PROTECTED]> wrote: On Sun, 3 Jun 2007, Tim Hockin wrote: > I think the nvram is the only place left that uses CONFIG_COBALT sure, but once you remove this snippet near the top of drivers/char/nvram.c: ... # if defined(CONFIG_COBALT) #include #define MACH COBALT # else #define MACH PC # endif ... then everything else COBALT-related in that file should be tossed as well, which would include stuff conditional on: #if MACH == COBALT and so on. just making sure that what you're saying is that *all* COBALT-related content in that file can be thrown out. i'll submit a patch shortly and you can pass judgment. rday -- Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kconfig variable "COBALT" is not defined anywhere
I think the nvram is the only place left that uses CONFIG_COBALT On 6/3/07, Robert P. J. Day <[EMAIL PROTECTED]> wrote: On Sun, 3 Jun 2007, Tim Hockin wrote: > There were other patches which added more COBALT support, but they > were dropped or lost or whatever. > > I would not balk at having that code yanked. I never got around to > doing proper Cobalt support for modern kernels. :( > > On 6/3/07, Roland Dreier <[EMAIL PROTECTED]> wrote: > > > > > there is no Kconfig file which defines the selectable option > > > > > "COBALT", which means that this snippet from drivers/char/nvram.c: > > > > > > > > > > # if defined(CONFIG_COBALT) > > > > > #include > > > > > #define MACH COBALT > > > > > # else > > > > > #define MACH PC > > > > > # endif > > > > > never evaluates to true, therefore making > > > > > fairly useless, at least under the circumstances. > > > > > > Maybe it should be MIPS_COBALT ? > > > > > that's the first thing that occurred to me, but that header file is > > > copyright sun microsystems and says nothing about MIPS, so that didn't > > > really settle the issue. that's why i'd rather someone else resolve > > > this one way or the other. > > > > Actually, looking through the old kernel history, it looks like this > > was added by Tim Hockin's (CCed) patch "Add Cobalt Networks support to > > nvram driver". Which added this to drivers/cobalt: > > > > +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT > > > > I guess Tim can clear up what's intended... ok, that sounds like it might be a bigger issue than just a dead CONFIG variable. if that's all it is, i can submit a patch. if it's more than that, i'll leave it to someone higher up the food chain to figure out what cobalt-related stuff should be yanked. rday -- Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kconfig variable "COBALT" is not defined anywhere
There were other patches which added more COBALT support, but they were dropped or lost or whatever. I would not balk at having that code yanked. I never got around to doing proper Cobalt support for modern kernels. :( On 6/3/07, Roland Dreier <[EMAIL PROTECTED]> wrote: > > > there is no Kconfig file which defines the selectable option > > > "COBALT", which means that this snippet from drivers/char/nvram.c: > > > > > > # if defined(CONFIG_COBALT) > > > #include > > > #define MACH COBALT > > > # else > > > #define MACH PC > > > # endif > > > never evaluates to true, therefore making > > > fairly useless, at least under the circumstances. > > Maybe it should be MIPS_COBALT ? > that's the first thing that occurred to me, but that header file is > copyright sun microsystems and says nothing about MIPS, so that didn't > really settle the issue. that's why i'd rather someone else resolve > this one way or the other. Actually, looking through the old kernel history, it looks like this was added by Tim Hockin's (CCed) patch "Add Cobalt Networks support to nvram driver". Which added this to drivers/cobalt: +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT I guess Tim can clear up what's intended... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kconfig variable COBALT is not defined anywhere
There were other patches which added more COBALT support, but they were dropped or lost or whatever. I would not balk at having that code yanked. I never got around to doing proper Cobalt support for modern kernels. :( On 6/3/07, Roland Dreier [EMAIL PROTECTED] wrote: there is no Kconfig file which defines the selectable option COBALT, which means that this snippet from drivers/char/nvram.c: # if defined(CONFIG_COBALT) #include linux/cobalt-nvram.h #define MACH COBALT # else #define MACH PC # endif never evaluates to true, therefore making linux/cobalt-nvram.h fairly useless, at least under the circumstances. Maybe it should be MIPS_COBALT ? that's the first thing that occurred to me, but that header file is copyright sun microsystems and says nothing about MIPS, so that didn't really settle the issue. that's why i'd rather someone else resolve this one way or the other. Actually, looking through the old kernel history, it looks like this was added by Tim Hockin's (CCed) patch Add Cobalt Networks support to nvram driver. Which added this to drivers/cobalt: +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT I guess Tim can clear up what's intended... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kconfig variable COBALT is not defined anywhere
I think the nvram is the only place left that uses CONFIG_COBALT On 6/3/07, Robert P. J. Day [EMAIL PROTECTED] wrote: On Sun, 3 Jun 2007, Tim Hockin wrote: There were other patches which added more COBALT support, but they were dropped or lost or whatever. I would not balk at having that code yanked. I never got around to doing proper Cobalt support for modern kernels. :( On 6/3/07, Roland Dreier [EMAIL PROTECTED] wrote: there is no Kconfig file which defines the selectable option COBALT, which means that this snippet from drivers/char/nvram.c: # if defined(CONFIG_COBALT) #include linux/cobalt-nvram.h #define MACH COBALT # else #define MACH PC # endif never evaluates to true, therefore making linux/cobalt-nvram.h fairly useless, at least under the circumstances. Maybe it should be MIPS_COBALT ? that's the first thing that occurred to me, but that header file is copyright sun microsystems and says nothing about MIPS, so that didn't really settle the issue. that's why i'd rather someone else resolve this one way or the other. Actually, looking through the old kernel history, it looks like this was added by Tim Hockin's (CCed) patch Add Cobalt Networks support to nvram driver. Which added this to drivers/cobalt: +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT I guess Tim can clear up what's intended... ok, that sounds like it might be a bigger issue than just a dead CONFIG variable. if that's all it is, i can submit a patch. if it's more than that, i'll leave it to someone higher up the food chain to figure out what cobalt-related stuff should be yanked. rday -- Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kconfig variable COBALT is not defined anywhere
That sounds correct. On 6/3/07, Robert P. J. Day [EMAIL PROTECTED] wrote: On Sun, 3 Jun 2007, Tim Hockin wrote: I think the nvram is the only place left that uses CONFIG_COBALT sure, but once you remove this snippet near the top of drivers/char/nvram.c: ... # if defined(CONFIG_COBALT) #include linux/cobalt-nvram.h #define MACH COBALT # else #define MACH PC # endif ... then everything else COBALT-related in that file should be tossed as well, which would include stuff conditional on: #if MACH == COBALT and so on. just making sure that what you're saying is that *all* COBALT-related content in that file can be thrown out. i'll submit a patch shortly and you can pass judgment. rday -- Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: mcelog tolerant level cleanup
On 5/18/07, Andi Kleen <[EMAIL PROTECTED]> wrote: > * If RIPV is set it is not safe to restart, so set the 'no way out' >flag rather than the 'kill it' flag. Why? It is not PCC. We cannot return of course, but killing isn't returning. My understanding is that the absence of RIPV indicates that it is not safe to restart, period. Not that the running *task* is not safe* but that the IP on the stack is not valid to restart at all. > * Don't panic() on correctable MCEs. The idea behind this was that if you get an exception it is always a bit risky because there are a few potential deadlocks that cannot be avoided. And normally non UC is just polled which will never cause a panic. So I don't quite see the value of this change. It will still always panic when tolerant == 0, and of course you're right correctable errors would skip over the panic() path anyway. I can roll back the "<0" part, though I don't see the difference now :) > This patch also calls nonseekable_open() in mce_open (as suggested by akpm). That should be a separate patch Andrew already sucked it into -mm - do you want me to break it out, and re-submit? > + 0: always panic on uncorrected errors, log corrected errors > + 1: panic or SIGBUS on uncorrected errors, log corrected errors > + 2: SIGBUS or log uncorrected errors, log corrected errors Just saying SIGBUS is misleading because it isn't a catchable signal. should I change that to "kill" ? Why did you remove the idle special case? Because once the other tolerant rules are clarified, it's redundant for tolerant < 2, and I think it's a bad special case for tolerant == 2, and it's definately wrong for tolerant == 3. Shall I re-roll? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: mcelog tolerant level cleanup
On 5/18/07, Andi Kleen [EMAIL PROTECTED] wrote: * If RIPV is set it is not safe to restart, so set the 'no way out' flag rather than the 'kill it' flag. Why? It is not PCC. We cannot return of course, but killing isn't returning. My understanding is that the absence of RIPV indicates that it is not safe to restart, period. Not that the running *task* is not safe* but that the IP on the stack is not valid to restart at all. * Don't panic() on correctable MCEs. The idea behind this was that if you get an exception it is always a bit risky because there are a few potential deadlocks that cannot be avoided. And normally non UC is just polled which will never cause a panic. So I don't quite see the value of this change. It will still always panic when tolerant == 0, and of course you're right correctable errors would skip over the panic() path anyway. I can roll back the 0 part, though I don't see the difference now :) This patch also calls nonseekable_open() in mce_open (as suggested by akpm). That should be a separate patch Andrew already sucked it into -mm - do you want me to break it out, and re-submit? + 0: always panic on uncorrected errors, log corrected errors + 1: panic or SIGBUS on uncorrected errors, log corrected errors + 2: SIGBUS or log uncorrected errors, log corrected errors Just saying SIGBUS is misleading because it isn't a catchable signal. should I change that to kill ? Why did you remove the idle special case? Because once the other tolerant rules are clarified, it's redundant for tolerant 2, and I think it's a bad special case for tolerant == 2, and it's definately wrong for tolerant == 3. Shall I re-roll? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/