from:"Tim Hockin"

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Tim Hockin

In the example above:

   root
   /\
 A  D
 / \
   B   C

Does oom_group allow me to express "compare A and D; if A is chosen
compare B and C; kill the loser" ?  As I understand the proposal (from
reading thread, not patch) it does not.

On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko  wrote:
> On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
>> > I am sorry to cut the rest of your proposal because it simply goes over
>> > the scope of the proposed solution while the usecase you are mentioning
>> > is still possible. If we want to compare intermediate nodes (which seems
>> > to be the case) then we can always provide a knob to opt-in - be it your
>> > oom_gang or others.
>>
>> In the Roman's proposed solution we can already force the comparison
>> of intermediate nodes using 'oom_group', I am just requesting to
>> separate the killall semantics from it.
>
> oom_group _is_ about killall semantic.  And comparing killable entities
> is just a natural thing to do. So I am not sure what you mean
>
> --
> Michal Hocko
> SUSE Labs

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Tim Hockin

In the example above:

   root
   /\
 A  D
 / \
   B   C

Does oom_group allow me to express "compare A and D; if A is chosen
compare B and C; kill the loser" ?  As I understand the proposal (from
reading thread, not patch) it does not.

On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko  wrote:
> On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
>> > I am sorry to cut the rest of your proposal because it simply goes over
>> > the scope of the proposed solution while the usecase you are mentioning
>> > is still possible. If we want to compare intermediate nodes (which seems
>> > to be the case) then we can always provide a knob to opt-in - be it your
>> > oom_gang or others.
>>
>> In the Roman's proposed solution we can already force the comparison
>> of intermediate nodes using 'oom_group', I am just requesting to
>> separate the killall semantics from it.
>
> oom_group _is_ about killall semantic.  And comparing killable entities
> is just a natural thing to do. So I am not sure what you mean
>
> --
> Michal Hocko
> SUSE Labs

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-27 Thread Tim Hockin

On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin <g...@fb.com> wrote:
> On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote:
>> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mho...@kernel.org> wrote:
>> > On Tue 26-09-17 20:37:37, Tim Hockin wrote:
>> > [...]
>> >> I feel like David has offered examples here, and many of us at Google
>> >> have offered examples as long ago as 2013 (if I recall) of cases where
>> >> the proposed heuristic is EXACTLY WRONG.
>> >
>> > I do not think we have discussed anything resembling the current
>> > approach. And I would really appreciate some more examples where
>> > decisions based on leaf nodes would be EXACTLY WRONG.
>> >
>> >> We need OOM behavior to kill in a deterministic order configured by
>> >> policy.
>> >
>> > And nobody is objecting to this usecase. I think we can build a priority
>> > policy on top of leaf-based decision as well. The main point we are
>> > trying to sort out here is a reasonable semantic that would work for
>> > most workloads. Sibling based selection will simply not work on those
>> > that have to use deeper hierarchies for organizational purposes. I
>> > haven't heard a counter argument for that example yet.
>>
>
> Hi, Tim!
>
>> We have a priority-based, multi-user cluster.  That cluster runs a
>> variety of work, including critical things like search and gmail, as
>> well as non-critical things like batch work.  We try to offer our
>> users an SLA around how often they will be killed by factors outside
>> themselves, but we also want to get higher utilization.  We know for a
>> fact (data, lots of data) that most jobs have spare memory capacity,
>> set aside for spikes or simply because accurate sizing is hard.  We
>> can sell "guaranteed" resources to critical jobs, with a high SLA.  We
>> can sell "best effort" resources to non-critical jobs with a low SLA.
>> We achieve much better overall utilization this way.
>
> This is well understood.
>
>>
>> I need to represent the priority of these tasks in a way that gives me
>> a very strong promise that, in case of system OOM, the non-critical
>> jobs will be chosen before the critical jobs.  Regardless of size.
>> Regardless of how many non-critical jobs have to die.  I'd rather kill
>> *all* of the non-critical jobs than a single critical job.  Size of
>> the process or cgroup is simply not a factor, and honestly given 2
>> options of equal priority I'd say age matters more than size.
>>
>> So concretely I have 2 first-level cgroups, one for "guaranteed" and
>> one for "best effort" classes.  I always want to kill from "best
>> effort", even if that means killing 100 small cgroups, before touching
>> "guaranteed".
>>
>> I apologize if this is not as thorough as the rest of the thread - I
>> am somewhat out of touch with the guts of it all these days.  I just
>> feel compelled to indicate that, as a historical user (via Google
>> systems) and current user (via Kubernetes), some of the assertions
>> being made here do not ring true for our very real use cases.  I
>> desperately want cgroup-aware OOM handing, but it has to be
>> policy-based or it is just not useful to us.
>
> A policy-based approach was suggested by Michal at a very beginning of
> this discussion. Although nobody had any strong objections against it,
> we've agreed that this is out of scope of this patchset.
>
> The idea of this patchset is to introduce an ability to select a memcg
> as an OOM victim with the following optional killing of all belonging tasks.
> I believe, it's absolutely mandatory for _any_ further development
> of the OOM killer, which wants to deal with memory cgroups as OOM entities.
>
> If you think that it makes impossible to support some use cases in the future,
> let's discuss it. Otherwise, I'd prefer to finish this part of the work,
> and proceed to the following improvements on top of it.
>
> Thank you!

I am 100% in favor of killing whole groups.  We want that too.  I just
needed to express disagreement with statements that size-based
decisions could not produce bad results.  They can and do.

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-27 Thread Tim Hockin

On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin  wrote:
> On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote:
>> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko  wrote:
>> > On Tue 26-09-17 20:37:37, Tim Hockin wrote:
>> > [...]
>> >> I feel like David has offered examples here, and many of us at Google
>> >> have offered examples as long ago as 2013 (if I recall) of cases where
>> >> the proposed heuristic is EXACTLY WRONG.
>> >
>> > I do not think we have discussed anything resembling the current
>> > approach. And I would really appreciate some more examples where
>> > decisions based on leaf nodes would be EXACTLY WRONG.
>> >
>> >> We need OOM behavior to kill in a deterministic order configured by
>> >> policy.
>> >
>> > And nobody is objecting to this usecase. I think we can build a priority
>> > policy on top of leaf-based decision as well. The main point we are
>> > trying to sort out here is a reasonable semantic that would work for
>> > most workloads. Sibling based selection will simply not work on those
>> > that have to use deeper hierarchies for organizational purposes. I
>> > haven't heard a counter argument for that example yet.
>>
>
> Hi, Tim!
>
>> We have a priority-based, multi-user cluster.  That cluster runs a
>> variety of work, including critical things like search and gmail, as
>> well as non-critical things like batch work.  We try to offer our
>> users an SLA around how often they will be killed by factors outside
>> themselves, but we also want to get higher utilization.  We know for a
>> fact (data, lots of data) that most jobs have spare memory capacity,
>> set aside for spikes or simply because accurate sizing is hard.  We
>> can sell "guaranteed" resources to critical jobs, with a high SLA.  We
>> can sell "best effort" resources to non-critical jobs with a low SLA.
>> We achieve much better overall utilization this way.
>
> This is well understood.
>
>>
>> I need to represent the priority of these tasks in a way that gives me
>> a very strong promise that, in case of system OOM, the non-critical
>> jobs will be chosen before the critical jobs.  Regardless of size.
>> Regardless of how many non-critical jobs have to die.  I'd rather kill
>> *all* of the non-critical jobs than a single critical job.  Size of
>> the process or cgroup is simply not a factor, and honestly given 2
>> options of equal priority I'd say age matters more than size.
>>
>> So concretely I have 2 first-level cgroups, one for "guaranteed" and
>> one for "best effort" classes.  I always want to kill from "best
>> effort", even if that means killing 100 small cgroups, before touching
>> "guaranteed".
>>
>> I apologize if this is not as thorough as the rest of the thread - I
>> am somewhat out of touch with the guts of it all these days.  I just
>> feel compelled to indicate that, as a historical user (via Google
>> systems) and current user (via Kubernetes), some of the assertions
>> being made here do not ring true for our very real use cases.  I
>> desperately want cgroup-aware OOM handing, but it has to be
>> policy-based or it is just not useful to us.
>
> A policy-based approach was suggested by Michal at a very beginning of
> this discussion. Although nobody had any strong objections against it,
> we've agreed that this is out of scope of this patchset.
>
> The idea of this patchset is to introduce an ability to select a memcg
> as an OOM victim with the following optional killing of all belonging tasks.
> I believe, it's absolutely mandatory for _any_ further development
> of the OOM killer, which wants to deal with memory cgroups as OOM entities.
>
> If you think that it makes impossible to support some use cases in the future,
> let's discuss it. Otherwise, I'd prefer to finish this part of the work,
> and proceed to the following improvements on top of it.
>
> Thank you!

I am 100% in favor of killing whole groups.  We want that too.  I just
needed to express disagreement with statements that size-based
decisions could not produce bad results.  They can and do.

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-27 Thread Tim Hockin

On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Tue 26-09-17 20:37:37, Tim Hockin wrote:
> [...]
>> I feel like David has offered examples here, and many of us at Google
>> have offered examples as long ago as 2013 (if I recall) of cases where
>> the proposed heuristic is EXACTLY WRONG.
>
> I do not think we have discussed anything resembling the current
> approach. And I would really appreciate some more examples where
> decisions based on leaf nodes would be EXACTLY WRONG.
>
>> We need OOM behavior to kill in a deterministic order configured by
>> policy.
>
> And nobody is objecting to this usecase. I think we can build a priority
> policy on top of leaf-based decision as well. The main point we are
> trying to sort out here is a reasonable semantic that would work for
> most workloads. Sibling based selection will simply not work on those
> that have to use deeper hierarchies for organizational purposes. I
> haven't heard a counter argument for that example yet.

We have a priority-based, multi-user cluster.  That cluster runs a
variety of work, including critical things like search and gmail, as
well as non-critical things like batch work.  We try to offer our
users an SLA around how often they will be killed by factors outside
themselves, but we also want to get higher utilization.  We know for a
fact (data, lots of data) that most jobs have spare memory capacity,
set aside for spikes or simply because accurate sizing is hard.  We
can sell "guaranteed" resources to critical jobs, with a high SLA.  We
can sell "best effort" resources to non-critical jobs with a low SLA.
We achieve much better overall utilization this way.

I need to represent the priority of these tasks in a way that gives me
a very strong promise that, in case of system OOM, the non-critical
jobs will be chosen before the critical jobs.  Regardless of size.
Regardless of how many non-critical jobs have to die.  I'd rather kill
*all* of the non-critical jobs than a single critical job.  Size of
the process or cgroup is simply not a factor, and honestly given 2
options of equal priority I'd say age matters more than size.

So concretely I have 2 first-level cgroups, one for "guaranteed" and
one for "best effort" classes.  I always want to kill from "best
effort", even if that means killing 100 small cgroups, before touching
"guaranteed".

I apologize if this is not as thorough as the rest of the thread - I
am somewhat out of touch with the guts of it all these days.  I just
feel compelled to indicate that, as a historical user (via Google
systems) and current user (via Kubernetes), some of the assertions
being made here do not ring true for our very real use cases.  I
desperately want cgroup-aware OOM handing, but it has to be
policy-based or it is just not useful to us.

Thanks.

Tim

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-27 Thread Tim Hockin

On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko  wrote:
> On Tue 26-09-17 20:37:37, Tim Hockin wrote:
> [...]
>> I feel like David has offered examples here, and many of us at Google
>> have offered examples as long ago as 2013 (if I recall) of cases where
>> the proposed heuristic is EXACTLY WRONG.
>
> I do not think we have discussed anything resembling the current
> approach. And I would really appreciate some more examples where
> decisions based on leaf nodes would be EXACTLY WRONG.
>
>> We need OOM behavior to kill in a deterministic order configured by
>> policy.
>
> And nobody is objecting to this usecase. I think we can build a priority
> policy on top of leaf-based decision as well. The main point we are
> trying to sort out here is a reasonable semantic that would work for
> most workloads. Sibling based selection will simply not work on those
> that have to use deeper hierarchies for organizational purposes. I
> haven't heard a counter argument for that example yet.

We have a priority-based, multi-user cluster.  That cluster runs a
variety of work, including critical things like search and gmail, as
well as non-critical things like batch work.  We try to offer our
users an SLA around how often they will be killed by factors outside
themselves, but we also want to get higher utilization.  We know for a
fact (data, lots of data) that most jobs have spare memory capacity,
set aside for spikes or simply because accurate sizing is hard.  We
can sell "guaranteed" resources to critical jobs, with a high SLA.  We
can sell "best effort" resources to non-critical jobs with a low SLA.
We achieve much better overall utilization this way.

I need to represent the priority of these tasks in a way that gives me
a very strong promise that, in case of system OOM, the non-critical
jobs will be chosen before the critical jobs.  Regardless of size.
Regardless of how many non-critical jobs have to die.  I'd rather kill
*all* of the non-critical jobs than a single critical job.  Size of
the process or cgroup is simply not a factor, and honestly given 2
options of equal priority I'd say age matters more than size.

So concretely I have 2 first-level cgroups, one for "guaranteed" and
one for "best effort" classes.  I always want to kill from "best
effort", even if that means killing 100 small cgroups, before touching
"guaranteed".

I apologize if this is not as thorough as the rest of the thread - I
am somewhat out of touch with the guts of it all these days.  I just
feel compelled to indicate that, as a historical user (via Google
systems) and current user (via Kubernetes), some of the assertions
being made here do not ring true for our very real use cases.  I
desperately want cgroup-aware OOM handing, but it has to be
policy-based or it is just not useful to us.

Thanks.

Tim

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-26 Thread Tim Hockin

I'm excited to see this being discussed again - it's been years since
the last attempt.  I've tried to stay out of the conversation, but I
feel obligated say something and then go back to lurking.

On Tue, Sep 26, 2017 at 10:26 AM, Johannes Weiner  wrote:
> On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote:
>> On Tue 26-09-17 13:13:00, Roman Gushchin wrote:
>> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
>> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
>> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
>> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
>> > > > > [...]
>> > > > > > I'm not against this model, as I've said before. It feels logical,
>> > > > > > and will work fine in most cases.
>> > > > > >
>> > > > > > In this case we can drop any mount/boot options, because it 
>> > > > > > preserves
>> > > > > > the existing behavior in the default configuration. A big 
>> > > > > > advantage.
>> > > > >
>> > > > > I am not sure about this. We still need an opt-in, ragardless, 
>> > > > > because
>> > > > > selecting the largest process from the largest memcg != selecting the
>> > > > > largest task (just consider memcgs with many processes example).
>> > > >
>> > > > As I understand Johannes, he suggested to compare individual processes 
>> > > > with
>> > > > group_oom mem cgroups. In other words, always select a killable entity 
>> > > > with
>> > > > the biggest memory footprint.
>> > > >
>> > > > This is slightly different from my v8 approach, where I treat leaf 
>> > > > memcgs
>> > > > as indivisible memory consumers independent on group_oom setting, so
>> > > > by default I'm selecting the biggest task in the biggest memcg.
>> > >
>> > > My reading is that he is actually proposing the same thing I've been
>> > > mentioning. Simply select the biggest killable entity (leaf memcg or
>> > > group_oom hierarchy) and either kill the largest task in that entity
>> > > (for !group_oom) or the whole memcg/hierarchy otherwise.
>> >
>> > He wrote the following:
>> > "So I'm leaning toward the second model: compare all oomgroups and
>> > standalone tasks in the system with each other, independent of the
>> > failed hierarchical control structure. Then kill the biggest of them."
>>
>> I will let Johannes to comment but I believe this is just a
>> misunderstanding. If we compared only the biggest task from each memcg
>> then we are basically losing our fairness objective, aren't we?
>
> Sorry about the confusion.
>
> Yeah I was making the case for what Michal proposed, to kill the
> biggest terminal consumer, which is either a task or an oomgroup.
>
> You'd basically iterate through all the tasks and cgroups in the
> system and pick the biggest task that isn't in an oom group or the
> biggest oom group and then kill that.
>
> Yeah, you'd have to compare the memory footprints of tasks with the
> memory footprints of cgroups. These aren't defined identically, and
> tasks don't get attributed every type of allocation that a cgroup
> would. But it should get us in the ballpark, and I cannot picture a
> scenario where this would lead to a completely undesirable outcome.

That last sentence:

> I cannot picture a scenario where this would lead to a completely undesirable 
> outcome.

I feel like David has offered examples here, and many of us at Google
have offered examples as long ago as 2013 (if I recall) of cases where
the proposed heuristic is EXACTLY WRONG.  We need OOM behavior to kill
in a deterministic order configured by policy.  Sometimes, I would
literally prefer to kill every other cgroup before killing "the big
one".  The policy is *all* that matters for shared clusters of varying
users and priorities.

We did this in Borg, and it works REALLY well.  Has for years.  Now
that the world is adopting Kubernetes we need it again, only it's much
harder to carry a kernel patch in this case.

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-26 Thread Tim Hockin

I'm excited to see this being discussed again - it's been years since
the last attempt.  I've tried to stay out of the conversation, but I
feel obligated say something and then go back to lurking.

On Tue, Sep 26, 2017 at 10:26 AM, Johannes Weiner  wrote:
> On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote:
>> On Tue 26-09-17 13:13:00, Roman Gushchin wrote:
>> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
>> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
>> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
>> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
>> > > > > [...]
>> > > > > > I'm not against this model, as I've said before. It feels logical,
>> > > > > > and will work fine in most cases.
>> > > > > >
>> > > > > > In this case we can drop any mount/boot options, because it 
>> > > > > > preserves
>> > > > > > the existing behavior in the default configuration. A big 
>> > > > > > advantage.
>> > > > >
>> > > > > I am not sure about this. We still need an opt-in, ragardless, 
>> > > > > because
>> > > > > selecting the largest process from the largest memcg != selecting the
>> > > > > largest task (just consider memcgs with many processes example).
>> > > >
>> > > > As I understand Johannes, he suggested to compare individual processes 
>> > > > with
>> > > > group_oom mem cgroups. In other words, always select a killable entity 
>> > > > with
>> > > > the biggest memory footprint.
>> > > >
>> > > > This is slightly different from my v8 approach, where I treat leaf 
>> > > > memcgs
>> > > > as indivisible memory consumers independent on group_oom setting, so
>> > > > by default I'm selecting the biggest task in the biggest memcg.
>> > >
>> > > My reading is that he is actually proposing the same thing I've been
>> > > mentioning. Simply select the biggest killable entity (leaf memcg or
>> > > group_oom hierarchy) and either kill the largest task in that entity
>> > > (for !group_oom) or the whole memcg/hierarchy otherwise.
>> >
>> > He wrote the following:
>> > "So I'm leaning toward the second model: compare all oomgroups and
>> > standalone tasks in the system with each other, independent of the
>> > failed hierarchical control structure. Then kill the biggest of them."
>>
>> I will let Johannes to comment but I believe this is just a
>> misunderstanding. If we compared only the biggest task from each memcg
>> then we are basically losing our fairness objective, aren't we?
>
> Sorry about the confusion.
>
> Yeah I was making the case for what Michal proposed, to kill the
> biggest terminal consumer, which is either a task or an oomgroup.
>
> You'd basically iterate through all the tasks and cgroups in the
> system and pick the biggest task that isn't in an oom group or the
> biggest oom group and then kill that.
>
> Yeah, you'd have to compare the memory footprints of tasks with the
> memory footprints of cgroups. These aren't defined identically, and
> tasks don't get attributed every type of allocation that a cgroup
> would. But it should get us in the ballpark, and I cannot picture a
> scenario where this would lead to a completely undesirable outcome.

That last sentence:

> I cannot picture a scenario where this would lead to a completely undesirable 
> outcome.

I feel like David has offered examples here, and many of us at Google
have offered examples as long ago as 2013 (if I recall) of cases where
the proposed heuristic is EXACTLY WRONG.  We need OOM behavior to kill
in a deterministic order configured by policy.  Sometimes, I would
literally prefer to kill every other cgroup before killing "the big
one".  The policy is *all* that matters for shared clusters of varying
users and priorities.

We did this in Borg, and it works REALLY well.  Has for years.  Now
that the world is adopting Kubernetes we need it again, only it's much
harder to carry a kernel patch in this case.

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tim Hockin

On Feb 28, 2015 2:50 PM, "Tejun Heo"  wrote:
>
> On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> > Wow, so much anger.  I'm not even sure how to respond, so I'll just
> > say this and sign off.  All I want is a better, friendlier, more
> > useful system overall.  We clearly have different ways of looking at
> > the problem.
>
> Can you communicate anything w/o passive aggression?  If you have a
> technical point, just state that.  Can you at least agree that we
> shouldn't be making design decisions based on 16bit pid_t?

Hmm, I have screwed this thread up, I think.  I've made some remarks
that did not come through with the proper tongue-in-cheek slant.  I'm
not being passive aggressive - we DO look at this problem differently.
OF COURSE we should not make decisions based on ancient artifacts of
history.  My point was that there are secondary considerations here -
PIDs are more than just the memory that backs them.  They _ARE_ a
constrained resource, and you shouldn't assume the constraint is just
physical memory.  It is a piece of policy that is outside the control
of the kernel proper - we handed those keys to userspace along time
ago.

Given that, I believe and have believed that the solution should model
the problem as the user perceives it - limiting PIDs - rather than
attaching to a solution-by-proxy.

Yes a solution here partially overlaps with kmemcg, but I don't think
that is a significant problem.  They are different policies governing
behavior that may result in the same condition, but for very different
reasons.  I do not think that is particularly bad for overall
comprehension, and I think the fact that this popped up yet again
indicates the existence of some nugget of user experience that is
worth paying consideration to.

I appreciate your promised consideration through a slightly refocused
lens.  I will go back to my cave and do something I hope is more
productive and less antagonistic.  I did not mean to bring out so much
vitriol.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tim Hockin

On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo  wrote:
>
> On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > I am sorry that real-user problems are not perceived as substantial.  This
> > was/is a real issue for us.  Being in limbo for years on end might not be a
> > technical point, but I do think it matters, and that was my point.
>
> It's a problem which is localized to you and caused by the specific
> problems of your setup.  This isn't a wide-spread problem at all and
> the world doesn't revolve around you.  If your setup is so messed up
> as to require sticking to 16bit pids, handle that locally.  If
> something at larger scale eases that handling, you get lucky.  If not,
> it's *your* predicament to deal with.  The rest of the world doesn't
> exist to wipe your ass.

Wow, so much anger.  I'm not even sure how to respond, so I'll just
say this and sign off.  All I want is a better, friendlier, more
useful system overall.  We clearly have different ways of looking at
the problem.

No antagonism intended

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tim Hockin

On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo t...@kernel.org wrote:

 On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
  I am sorry that real-user problems are not perceived as substantial.  This
  was/is a real issue for us.  Being in limbo for years on end might not be a
  technical point, but I do think it matters, and that was my point.

 It's a problem which is localized to you and caused by the specific
 problems of your setup.  This isn't a wide-spread problem at all and
 the world doesn't revolve around you.  If your setup is so messed up
 as to require sticking to 16bit pids, handle that locally.  If
 something at larger scale eases that handling, you get lucky.  If not,
 it's *your* predicament to deal with.  The rest of the world doesn't
 exist to wipe your ass.

Wow, so much anger.  I'm not even sure how to respond, so I'll just
say this and sign off.  All I want is a better, friendlier, more
useful system overall.  We clearly have different ways of looking at
the problem.

No antagonism intended

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tim Hockin

On Feb 28, 2015 2:50 PM, Tejun Heo t...@kernel.org wrote:

 On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
  Wow, so much anger.  I'm not even sure how to respond, so I'll just
  say this and sign off.  All I want is a better, friendlier, more
  useful system overall.  We clearly have different ways of looking at
  the problem.

 Can you communicate anything w/o passive aggression?  If you have a
 technical point, just state that.  Can you at least agree that we
 shouldn't be making design decisions based on 16bit pid_t?

Hmm, I have screwed this thread up, I think.  I've made some remarks
that did not come through with the proper tongue-in-cheek slant.  I'm
not being passive aggressive - we DO look at this problem differently.
OF COURSE we should not make decisions based on ancient artifacts of
history.  My point was that there are secondary considerations here -
PIDs are more than just the memory that backs them.  They _ARE_ a
constrained resource, and you shouldn't assume the constraint is just
physical memory.  It is a piece of policy that is outside the control
of the kernel proper - we handed those keys to userspace along time
ago.

Given that, I believe and have believed that the solution should model
the problem as the user perceives it - limiting PIDs - rather than
attaching to a solution-by-proxy.

Yes a solution here partially overlaps with kmemcg, but I don't think
that is a significant problem.  They are different policies governing
behavior that may result in the same condition, but for very different
reasons.  I do not think that is particularly bad for overall
comprehension, and I think the fact that this popped up yet again
indicates the existence of some nugget of user experience that is
worth paying consideration to.

I appreciate your promised consideration through a slightly refocused
lens.  I will go back to my cave and do something I hope is more
productive and less antagonistic.  I did not mean to bring out so much
vitriol.

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin

On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo  wrote:
> On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
>> > In general, I'm pretty strongly against adding controllers for things
>> > which aren't fundamental resources in the system.  What's next?  Open
>> > files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> > program groups?
>>
>> Yes to some or all of those.  We do exactly this internally and it has
>> greatly added to the stability of our overall container management
>> system.  and while you have been telling everyone to wait for kmemcg,
>> we have had an extra 3+ years of stability.
>
> Yeah, good job.  I totally get why kernel part of memory consumption
> needs protection.  I'm not arguing against that at all.

You keep shifting the focus to be about memory, but that's not what
people are asking for.  You're letting the desire for a perfect
solution (which is years late) block good solutions that exist NOW.

>> > If you want to prevent a certain class of jobs from exhausting a given
>> > resource, protecting that resource is the obvious thing to do.
>>
>> I don't follow your argument - isn't this exactly what this patch set
>> is doing - protecting resources?
>
> If you have proper protection over kernel memory consumption, this is
> completely covered because memory is the fundamental resource here.
> Controlling distribution of those fundamental resources is what
> cgroups are primarily about.

You say that's what cgroups are about, but it's not at all obvious
that you are right.  What users, admins, systems people want is
building blocks that are usable and make sense.  Limiting kernel
memory is NOT the logical building block, here.  It's not something
people can reason about or quantify easily.  if you need to implement
the interfaces in terms of memory, go nuts, but making users think
liek that is just not right.

>> > Wasn't it like a year ago?  Yeah, it's taking longer than everybody
>> > hoped but seriously kmemcg reclaimer just got merged and also did the
>> > new memcg interface which will tie kmemcg and memcg together.
>>
>> By my email it was almost 2 years ago, and that was the second or
>> third incarnation of this patch.
>
> Again, I agree this is taking a while.  Memory people had to retool
> the whole reclamation path to make this work, which is the pattern
> being repeated across the different controllers - we're refactoring a
> lot of infrastructure code so that resource control can integrate with
> the regular operation of the kernel, which BTW is what we should have
> been doing from the beginning.
>
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

At least 3 or 4 people have INDEPENDENTLY decided this is what is
causing them pain and tried to fix it and invested the time to send a
patch says that it is actually a thing.  There exists a problem that
you are disallowing to be fixed.  Do you recognize that users are
experiencing pain?  Why do you hate your users? :)

> And as for the different incarnations of this patchset.  Reposting the
> same stuff repeatedly doesn't really change anything.  Why would it?

Because reasonable people might survey the ecosystem and say "humm,
things have changed over the years - isolation has become a pretty
serious topic".  or maybe they hope that you'll finally agree that
fixing the problem NOW is worthwhile, even if the solution is
imperfect, and that a more perfect solution will arrive.

>> >> Something like this is long overdue, IMO, and is still more
>> >> appropriate and obvious than kmemcg anyway.
>> >
>> > Thanks for chiming in again but if you aren't bringing out anything
>> > new to the table (I don't remember you doing that last time either),
>> > I'm not sure why the decision would be different this time.
>>
>> I'm just vocalizing my support for this idea in defense of practical
>> solutions that work NOW instead of "engineering ideals" that never
>> actually arrive.
>>
>> As containers take the server world by storm, stuff like this gets
>> more and more important.
>
> Again, protection of kernel side memory consumption is important.
> There's no question about that.  As for the never-arriving part, well,
> it is arriving.  If you still can't believe, just take a look at the
> code.

Are you willing to put a drop-dead date on it?  If we don't have
kmemcg working well enough to _actually_ bound PID usage and F

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin

On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo  wrote:
> Hello,
>
> On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
>> Kernel memory consumption isn't the only valid reason to want to limit the
>> number of processes in a cgroup.  Limiting the number of processes is very
>> useful to ensure that a program is working correctly (for example, the NTP
>> daemon should (usually) have an _exact_ number of children if it is
>> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
>> children), to prevent PID number exhaustion, to head off DoS attacks against
>> forking network servers before they get to the point of causing kmem
>> exhaustion, and to limit the number of processes in a cgroup that uses lots
>> of kernel memory very infrequently.
>
> All the use cases you're listing are extremely niche and can be
> trivially achieved without introducing another cgroup controller.  Not
> only that, they're actually pretty silly.  Let's say NTP daemon is
> misbehaving (or its code changed w/o you knowing or there are corner
> cases which trigger extremely infrequently).  What do you exactly
> achieve by rejecting its fork call?  It's just adding another
> variation to the misbehavior.  It was misbehaving before and would now
> be continuing to misbehave after a failed fork.
>
> In general, I'm pretty strongly against adding controllers for things
> which aren't fundamental resources in the system.  What's next?  Open
> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> program groups?

Yes to some or all of those.  We do exactly this internally and it has
greatly added to the stability of our overall container management
system.  and while you have been telling everyone to wait for kmemcg,
we have had an extra 3+ years of stability.

> If you want to prevent a certain class of jobs from exhausting a given
> resource, protecting that resource is the obvious thing to do.

I don't follow your argument - isn't this exactly what this patch set
is doing - protecting resources?

> Wasn't it like a year ago?  Yeah, it's taking longer than everybody
> hoped but seriously kmemcg reclaimer just got merged and also did the
> new memcg interface which will tie kmemcg and memcg together.

By my email it was almost 2 years ago, and that was the second or
third incarnation of this patch.

>> Something like this is long overdue, IMO, and is still more
>> appropriate and obvious than kmemcg anyway.
>
> Thanks for chiming in again but if you aren't bringing out anything
> new to the table (I don't remember you doing that last time either),
> I'm not sure why the decision would be different this time.

I'm just vocalizing my support for this idea in defense of practical
solutions that work NOW instead of "engineering ideals" that never
actually arrive.

As containers take the server world by storm, stuff like this gets
more and more important.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin

On Fri, Feb 27, 2015 at 8:42 AM, Austin S Hemmelgarn
 wrote:
> On 2015-02-27 06:49, Tejun Heo wrote:
>>
>> Hello,
>>
>> On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
>>>
>>> The current state of resource limitation for the number of open
>>> processes (as well as the number of open file descriptors) requires you
>>> to use setrlimit(2), which means that you are limited to resource
>>> limiting process trees rather than resource limiting cgroups (which is
>>> the point of cgroups).
>>>
>>> There was a patch to implement this in 2011[1], but that was rejected
>>> because it implemented a general-purpose rlimit subsystem -- which meant
>>> that you couldn't control distinct resource limits in different
>>> heirarchies. This patch implements a resource controller *specifically*
>>> for the number of processes in a cgroup, overcoming this issue.
>>>
>>> There has been a similar attempt to implement a resource controller for
>>> the number of open file descriptors[2], which has not been merged
>>> becasue the reasons were dubious. Merely from a "sane interface"
>>> perspective, it should be possible to utilise cgroups to do such
>>> rudimentary resource management (which currently only exists for process
>>> trees).
>>
>>
>> This isn't a proper resource to control.  kmemcg just grew proper
>> reclaim support and will be useable to control kernel side of memory
>> consumption.

I was told that the plan was to use kmemcg - but I was told that YEARS
AGO.  In the mean time we all either do our own thing or we do nothing
and suffer.

Something like this is long overdue, IMO, and is still more
appropriate and obvious than kmemcg anyway.


>> Thanks.
>>
> Kernel memory consumption isn't the only valid reason to want to limit the
> number of processes in a cgroup.  Limiting the number of processes is very
> useful to ensure that a program is working correctly (for example, the NTP
> daemon should (usually) have an _exact_ number of children if it is
> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
> children), to prevent PID number exhaustion, to head off DoS attacks against
> forking network servers before they get to the point of causing kmem
> exhaustion, and to limit the number of processes in a cgroup that uses lots
> of kernel memory very infrequently.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin

On Fri, Feb 27, 2015 at 8:42 AM, Austin S Hemmelgarn
ahferro...@gmail.com wrote:
 On 2015-02-27 06:49, Tejun Heo wrote:

 Hello,

 On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:

 The current state of resource limitation for the number of open
 processes (as well as the number of open file descriptors) requires you
 to use setrlimit(2), which means that you are limited to resource
 limiting process trees rather than resource limiting cgroups (which is
 the point of cgroups).

 There was a patch to implement this in 2011[1], but that was rejected
 because it implemented a general-purpose rlimit subsystem -- which meant
 that you couldn't control distinct resource limits in different
 heirarchies. This patch implements a resource controller *specifically*
 for the number of processes in a cgroup, overcoming this issue.

 There has been a similar attempt to implement a resource controller for
 the number of open file descriptors[2], which has not been merged
 becasue the reasons were dubious. Merely from a sane interface
 perspective, it should be possible to utilise cgroups to do such
 rudimentary resource management (which currently only exists for process
 trees).


 This isn't a proper resource to control.  kmemcg just grew proper
 reclaim support and will be useable to control kernel side of memory
 consumption.

I was told that the plan was to use kmemcg - but I was told that YEARS
AGO.  In the mean time we all either do our own thing or we do nothing
and suffer.

Something like this is long overdue, IMO, and is still more
appropriate and obvious than kmemcg anyway.


 Thanks.

 Kernel memory consumption isn't the only valid reason to want to limit the
 number of processes in a cgroup.  Limiting the number of processes is very
 useful to ensure that a program is working correctly (for example, the NTP
 daemon should (usually) have an _exact_ number of children if it is
 functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
 children), to prevent PID number exhaustion, to head off DoS attacks against
 forking network servers before they get to the point of causing kmem
 exhaustion, and to limit the number of processes in a cgroup that uses lots
 of kernel memory very infrequently.

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin

On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
 Kernel memory consumption isn't the only valid reason to want to limit the
 number of processes in a cgroup.  Limiting the number of processes is very
 useful to ensure that a program is working correctly (for example, the NTP
 daemon should (usually) have an _exact_ number of children if it is
 functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
 children), to prevent PID number exhaustion, to head off DoS attacks against
 forking network servers before they get to the point of causing kmem
 exhaustion, and to limit the number of processes in a cgroup that uses lots
 of kernel memory very infrequently.

 All the use cases you're listing are extremely niche and can be
 trivially achieved without introducing another cgroup controller.  Not
 only that, they're actually pretty silly.  Let's say NTP daemon is
 misbehaving (or its code changed w/o you knowing or there are corner
 cases which trigger extremely infrequently).  What do you exactly
 achieve by rejecting its fork call?  It's just adding another
 variation to the misbehavior.  It was misbehaving before and would now
 be continuing to misbehave after a failed fork.

 In general, I'm pretty strongly against adding controllers for things
 which aren't fundamental resources in the system.  What's next?  Open
 files?  Pipe buffer?  Number of flocks?  Number of session leaders or
 program groups?

Yes to some or all of those.  We do exactly this internally and it has
greatly added to the stability of our overall container management
system.  and while you have been telling everyone to wait for kmemcg,
we have had an extra 3+ years of stability.

 If you want to prevent a certain class of jobs from exhausting a given
 resource, protecting that resource is the obvious thing to do.

I don't follow your argument - isn't this exactly what this patch set
is doing - protecting resources?

 Wasn't it like a year ago?  Yeah, it's taking longer than everybody
 hoped but seriously kmemcg reclaimer just got merged and also did the
 new memcg interface which will tie kmemcg and memcg together.

By my email it was almost 2 years ago, and that was the second or
third incarnation of this patch.

 Something like this is long overdue, IMO, and is still more
 appropriate and obvious than kmemcg anyway.

 Thanks for chiming in again but if you aren't bringing out anything
 new to the table (I don't remember you doing that last time either),
 I'm not sure why the decision would be different this time.

I'm just vocalizing my support for this idea in defense of practical
solutions that work NOW instead of engineering ideals that never
actually arrive.

As containers take the server world by storm, stuff like this gets
more and more important.

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin

On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo t...@kernel.org wrote:
 On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
  In general, I'm pretty strongly against adding controllers for things
  which aren't fundamental resources in the system.  What's next?  Open
  files?  Pipe buffer?  Number of flocks?  Number of session leaders or
  program groups?

 Yes to some or all of those.  We do exactly this internally and it has
 greatly added to the stability of our overall container management
 system.  and while you have been telling everyone to wait for kmemcg,
 we have had an extra 3+ years of stability.

 Yeah, good job.  I totally get why kernel part of memory consumption
 needs protection.  I'm not arguing against that at all.

You keep shifting the focus to be about memory, but that's not what
people are asking for.  You're letting the desire for a perfect
solution (which is years late) block good solutions that exist NOW.

  If you want to prevent a certain class of jobs from exhausting a given
  resource, protecting that resource is the obvious thing to do.

 I don't follow your argument - isn't this exactly what this patch set
 is doing - protecting resources?

 If you have proper protection over kernel memory consumption, this is
 completely covered because memory is the fundamental resource here.
 Controlling distribution of those fundamental resources is what
 cgroups are primarily about.

You say that's what cgroups are about, but it's not at all obvious
that you are right.  What users, admins, systems people want is
building blocks that are usable and make sense.  Limiting kernel
memory is NOT the logical building block, here.  It's not something
people can reason about or quantify easily.  if you need to implement
the interfaces in terms of memory, go nuts, but making users think
liek that is just not right.

  Wasn't it like a year ago?  Yeah, it's taking longer than everybody
  hoped but seriously kmemcg reclaimer just got merged and also did the
  new memcg interface which will tie kmemcg and memcg together.

 By my email it was almost 2 years ago, and that was the second or
 third incarnation of this patch.

 Again, I agree this is taking a while.  Memory people had to retool
 the whole reclamation path to make this work, which is the pattern
 being repeated across the different controllers - we're refactoring a
 lot of infrastructure code so that resource control can integrate with
 the regular operation of the kernel, which BTW is what we should have
 been doing from the beginning.

 If your complaint is that this is taking too long, I hear you, and
 there's a certain amount of validity in arguing that upstreaming a
 temporary measure is the better trade-off, but the rationale for nproc
 (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

At least 3 or 4 people have INDEPENDENTLY decided this is what is
causing them pain and tried to fix it and invested the time to send a
patch says that it is actually a thing.  There exists a problem that
you are disallowing to be fixed.  Do you recognize that users are
experiencing pain?  Why do you hate your users? :)

 And as for the different incarnations of this patchset.  Reposting the
 same stuff repeatedly doesn't really change anything.  Why would it?

Because reasonable people might survey the ecosystem and say humm,
things have changed over the years - isolation has become a pretty
serious topic.  or maybe they hope that you'll finally agree that
fixing the problem NOW is worthwhile, even if the solution is
imperfect, and that a more perfect solution will arrive.

  Something like this is long overdue, IMO, and is still more
  appropriate and obvious than kmemcg anyway.
 
  Thanks for chiming in again but if you aren't bringing out anything
  new to the table (I don't remember you doing that last time either),
  I'm not sure why the decision would be different this time.

 I'm just vocalizing my support for this idea in defense of practical
 solutions that work NOW instead of engineering ideals that never
 actually arrive.

 As containers take the server world by storm, stuff like this gets
 more and more important.

 Again, protection of kernel side memory consumption is important.
 There's no question about that.  As for the never-arriving part, well,
 it is arriving.  If you still can't believe, just take a look at the
 code.

Are you willing to put a drop-dead date on it?  If we don't have
kmemcg working well enough to _actually_ bound PID usage and FD usage
by, say, June 1st, will you then accept a patch to this effect?  If
the answer is no, then I have zero faith that it's coming any time
soon - I heard this 2 years ago.  I believed you then.

I see further downthread that you said you'll think about it.  Thank
you.  Just because our use cases are not normal does not mean we're
not valid :)

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord

Re: Cache Allocation Technology Design

2014-10-30 Thread Tim Hockin

On Thu, Oct 30, 2014 at 10:12 AM, Tejun Heo  wrote:
> On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote:
>> Another reason unified hierarchy is a bad model.
>
> Things wrong with this message.
>
> 1. Top posted.  It isn't clear which part you're referring to and this
>was pointed out to you multiple times in the past.

I occasionally fall victim to gmail's defaults.  I apologize for that.

> 2. No real thoughts or technical details.  Maybe you had some in your
>head but nothing was elaborated.  This forces me to guess what you
>had on mind when you produced the above sentence and of course me
>not being you this takes a considerable amount of brain cycle and
>I'd still end up with multiple alternative scenarios that I'll have
>to cover.

I think the conversation is well enough understood by the people for
whom this bit of snark was intended that reading my mind was not that
hard.  That said, it was overly snark-tastic, and sent in haste.

My point, of course, was that here is an example of something which
maps very well to the idea of cgroups (a set of processes that share
some controller) but DOES NOT map well to the unified hierarchy model.
It must be managed more carefully than arbitrary hierarchy can
enforce.  The result is the mish-mash of workarounds proposed in this
thread to force it into arbitrary hierarchy mode, including this
no-win situation of running out of hardware resources - it is going to
fail.  Will it fail at cgroup creation time (doesn't scale to
arbitrary hierarchy) or will it fail when you add processes to it
(awkward at best) or will it fail when you flip some control file to
enable the feature?

I know the unified hierarchy ship has sailed, so there's not
non-snarky way to argue the point any further, but this is such an
obvious case, to me, that I had to say something.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Cache Allocation Technology Design

2014-10-30 Thread Tim Hockin

On Thu, Oct 30, 2014 at 10:12 AM, Tejun Heo t...@kernel.org wrote:
 On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote:
 Another reason unified hierarchy is a bad model.

 Things wrong with this message.

 1. Top posted.  It isn't clear which part you're referring to and this
was pointed out to you multiple times in the past.

I occasionally fall victim to gmail's defaults.  I apologize for that.

 2. No real thoughts or technical details.  Maybe you had some in your
head but nothing was elaborated.  This forces me to guess what you
had on mind when you produced the above sentence and of course me
not being you this takes a considerable amount of brain cycle and
I'd still end up with multiple alternative scenarios that I'll have
to cover.

I think the conversation is well enough understood by the people for
whom this bit of snark was intended that reading my mind was not that
hard.  That said, it was overly snark-tastic, and sent in haste.

My point, of course, was that here is an example of something which
maps very well to the idea of cgroups (a set of processes that share
some controller) but DOES NOT map well to the unified hierarchy model.
It must be managed more carefully than arbitrary hierarchy can
enforce.  The result is the mish-mash of workarounds proposed in this
thread to force it into arbitrary hierarchy mode, including this
no-win situation of running out of hardware resources - it is going to
fail.  Will it fail at cgroup creation time (doesn't scale to
arbitrary hierarchy) or will it fail when you add processes to it
(awkward at best) or will it fail when you flip some control file to
enable the feature?

I know the unified hierarchy ship has sailed, so there's not
non-snarky way to argue the point any further, but this is such an
obvious case, to me, that I had to say something.

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

2014-07-09 Thread Tim Hockin

How is this different from RLIMIT_AS?  You specifically mentioned it
earlier but you don't explain how this is different.

>From my perspective, this is pointless.  There's plenty of perfectly
correct software that mmaps files without concern for VSIZE, because
they never fault most of those pages in.  From my observations it is
not generally possible to predict an average VSIZE limit that would
satisfy your concerns *and* not kill lots of valid apps.

It sounds like what you want is to limit or even disable swap usage.
Given your example, your hypothetical user would probably be better of
getting an OOM kill early so she can fix her job spec to request more
memory.

On Wed, Jul 9, 2014 at 12:52 AM, Vladimir Davydov
 wrote:
> On Thu, Jul 03, 2014 at 04:48:16PM +0400, Vladimir Davydov wrote:
>> Hi,
>>
>> Typically, when a process calls mmap, it isn't given all the memory pages it
>> requested immediately. Instead, only its address space is grown, while the
>> memory pages will be actually allocated on the first use. If the system fails
>> to allocate a page, it will have no choice except invoking the OOM killer,
>> which may kill this or any other process. Obviously, it isn't the best way of
>> telling the user that the system is unable to handle his request. It would be
>> much better to fail mmap with ENOMEM instead.
>>
>> That's why Linux has the memory overcommit control feature, which accounts 
>> and
>> limits VM size that may contribute to mem+swap, i.e. private writable 
>> mappings
>> and shared memory areas. However, currently it's only available system-wide,
>> and there's no way of avoiding OOM in cgroups.
>>
>> This patch set is an attempt to fill the gap. It implements the resource
>> controller for cgroups that accounts and limits address space allocations 
>> that
>> may contribute to mem+swap.
>>
>> The interface is similar to the one of the memory cgroup except it controls
>> virtual memory usage, not actual memory allocation:
>>
>>   vm.usage_in_bytescurrent vm usage of processes inside cgroup
>>(read-only)
>>
>>   vm.max_usage_in_bytesmax vm.usage_in_bytes, can be reset by 
>> writing 0
>>
>>   vm.limit_in_bytesvm.usage_in_bytes must be <= 
>> vm.limite_in_bytes;
>>allocations that hit the limit will be failed
>>with ENOMEM
>>
>>   vm.failcnt   number of times the limit was hit, can be 
>> reset
>>by writing 0
>>
>> In future, the controller can be easily extended to account for locked pages
>> and shmem.
>
> Any thoughts on this?
>
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

2014-07-09 Thread Tim Hockin

How is this different from RLIMIT_AS?  You specifically mentioned it
earlier but you don't explain how this is different.

From my perspective, this is pointless.  There's plenty of perfectly
correct software that mmaps files without concern for VSIZE, because
they never fault most of those pages in.  From my observations it is
not generally possible to predict an average VSIZE limit that would
satisfy your concerns *and* not kill lots of valid apps.

It sounds like what you want is to limit or even disable swap usage.
Given your example, your hypothetical user would probably be better of
getting an OOM kill early so she can fix her job spec to request more
memory.

On Wed, Jul 9, 2014 at 12:52 AM, Vladimir Davydov
vdavy...@parallels.com wrote:
 On Thu, Jul 03, 2014 at 04:48:16PM +0400, Vladimir Davydov wrote:
 Hi,

 Typically, when a process calls mmap, it isn't given all the memory pages it
 requested immediately. Instead, only its address space is grown, while the
 memory pages will be actually allocated on the first use. If the system fails
 to allocate a page, it will have no choice except invoking the OOM killer,
 which may kill this or any other process. Obviously, it isn't the best way of
 telling the user that the system is unable to handle his request. It would be
 much better to fail mmap with ENOMEM instead.

 That's why Linux has the memory overcommit control feature, which accounts 
 and
 limits VM size that may contribute to mem+swap, i.e. private writable 
 mappings
 and shared memory areas. However, currently it's only available system-wide,
 and there's no way of avoiding OOM in cgroups.

 This patch set is an attempt to fill the gap. It implements the resource
 controller for cgroups that accounts and limits address space allocations 
 that
 may contribute to mem+swap.

 The interface is similar to the one of the memory cgroup except it controls
 virtual memory usage, not actual memory allocation:

   vm.usage_in_bytescurrent vm usage of processes inside cgroup
(read-only)

   vm.max_usage_in_bytesmax vm.usage_in_bytes, can be reset by 
 writing 0

   vm.limit_in_bytesvm.usage_in_bytes must be = 
 vm.limite_in_bytes;
allocations that hit the limit will be failed
with ENOMEM

   vm.failcnt   number of times the limit was hit, can be 
 reset
by writing 0

 In future, the controller can be easily extended to account for locked pages
 and shmem.

 Any thoughts on this?

 Thanks.
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] net: Implement SO_PEERCGROUP

2014-03-13 Thread Tim Hockin

I don't buy that it is not practical.  Not convenient, maybe.  Not
clean, sure.  But it is practical - it uses mechanisms that exist on
all kernels today.  That is a win, to me.

On Thu, Mar 13, 2014 at 10:58 AM, Simo Sorce  wrote:
> On Thu, 2014-03-13 at 10:55 -0700, Andy Lutomirski wrote:
>>
>> So give each container its own unix socket.  Problem solved, no?
>
> Not really practical if you have hundreds of containers.
>
> Simo.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] net: Implement SO_PEERCGROUP

2014-03-13 Thread Tim Hockin

In some sense a cgroup is a pgrp that mere mortals can't escape.  Why
not just do something like that?  root can set this "container id" or
"job id" on your process when it first starts (e.g. docker sets it on
your container process) or even make a cgroup that sets this for all
processes in that cgroup.

ints are better than strings anyway.

On Thu, Mar 13, 2014 at 10:25 AM, Andy Lutomirski  wrote:
> On Thu, Mar 13, 2014 at 9:33 AM, Simo Sorce  wrote:
>> On Thu, 2014-03-13 at 11:00 -0400, Vivek Goyal wrote:
>>> On Thu, Mar 13, 2014 at 10:55:34AM -0400, Simo Sorce wrote:
>>>
>>> [..]
>>> > > > This might not be quite as awful as I thought.  At least you're
>>> > > > looking up the cgroup at connection time instead of at send time.
>>> > > >
>>> > > > OTOH, this is still racy -- the socket could easily outlive the cgroup
>>> > > > that created it.
>>> > >
>>> > > That's a good point. What guarantees that previous cgroup was not
>>> > > reassigned to a different container.
>>> > >
>>> > > What if a process A opens the connection with sssd. Process A passes the
>>> > > file descriptor to a different process B in a differnt container.
>>> >
>>> > Stop right here.
>>> > If the process passes the fd it is not my problem anymore.
>>> > The process can as well just 'proxy' all the information to another
>>> > process.
>>> >
>>> > We just care to properly identify the 'original' container, we are not
>>> > in the business of detecting malicious behavior. That's something other
>>> > mechanism need to protect against (SELinux or other LSMs, normal
>>> > permissions, capabilities, etc...).
>>> >
>>> > > Process A exits. Container gets removed from system and new one gets
>>> > > launched which uses same cgroup as old one. Now process B sends a new
>>> > > request and SSSD will serve it based on policy of newly launched
>>> > > container.
>>> > >
>>> > > This sounds very similar to pid race where socket/connection will 
>>> > > outlive
>>> > > the pid.
>>> >
>>> > Nope, completely different.
>>> >
>>>
>>> I think you missed my point. Passing file descriptor is not the problem.
>>> Problem is reuse of same cgroup name for a different container while
>>> socket lives on. And it is same race as reuse of a pid for a different
>>> process.
>>
>> The cgroup name should not be reused of course, if userspace does that,
>> it is userspace's issue. cgroup names are not a constrained namespace
>> like pids which force the kernel to reuse them for processes of a
>> different nature.
>>
>
> You're proposing a feature that will enshrine cgroups into the API use
> by non-cgroup-controlling applications.  I don't think that anyone
> thinks that cgroups are pretty, so this is an unfortunate thing to
> have to do.
>
> I've suggested three different ways that your goal could be achieved
> without using cgroups at all.  You haven't really addressed any of
> them.
>
> In order for something like this to go into the kernel, I would expect
> a real use case and a justification for why this is the right way to
> do it.
>
> "Docker containers can be identified by cgroup path" is completely
> unconvincing to me.
>
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] net: Implement SO_PEERCGROUP

2014-03-13 Thread Tim Hockin

In some sense a cgroup is a pgrp that mere mortals can't escape.  Why
not just do something like that?  root can set this container id or
job id on your process when it first starts (e.g. docker sets it on
your container process) or even make a cgroup that sets this for all
processes in that cgroup.

ints are better than strings anyway.

On Thu, Mar 13, 2014 at 10:25 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Thu, Mar 13, 2014 at 9:33 AM, Simo Sorce sso...@redhat.com wrote:
 On Thu, 2014-03-13 at 11:00 -0400, Vivek Goyal wrote:
 On Thu, Mar 13, 2014 at 10:55:34AM -0400, Simo Sorce wrote:

 [..]
This might not be quite as awful as I thought.  At least you're
looking up the cgroup at connection time instead of at send time.
   
OTOH, this is still racy -- the socket could easily outlive the cgroup
that created it.
  
   That's a good point. What guarantees that previous cgroup was not
   reassigned to a different container.
  
   What if a process A opens the connection with sssd. Process A passes the
   file descriptor to a different process B in a differnt container.
 
  Stop right here.
  If the process passes the fd it is not my problem anymore.
  The process can as well just 'proxy' all the information to another
  process.
 
  We just care to properly identify the 'original' container, we are not
  in the business of detecting malicious behavior. That's something other
  mechanism need to protect against (SELinux or other LSMs, normal
  permissions, capabilities, etc...).
 
   Process A exits. Container gets removed from system and new one gets
   launched which uses same cgroup as old one. Now process B sends a new
   request and SSSD will serve it based on policy of newly launched
   container.
  
   This sounds very similar to pid race where socket/connection will 
   outlive
   the pid.
 
  Nope, completely different.
 

 I think you missed my point. Passing file descriptor is not the problem.
 Problem is reuse of same cgroup name for a different container while
 socket lives on. And it is same race as reuse of a pid for a different
 process.

 The cgroup name should not be reused of course, if userspace does that,
 it is userspace's issue. cgroup names are not a constrained namespace
 like pids which force the kernel to reuse them for processes of a
 different nature.


 You're proposing a feature that will enshrine cgroups into the API use
 by non-cgroup-controlling applications.  I don't think that anyone
 thinks that cgroups are pretty, so this is an unfortunate thing to
 have to do.

 I've suggested three different ways that your goal could be achieved
 without using cgroups at all.  You haven't really addressed any of
 them.

 In order for something like this to go into the kernel, I would expect
 a real use case and a justification for why this is the right way to
 do it.

 Docker containers can be identified by cgroup path is completely
 unconvincing to me.

 --Andy
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] net: Implement SO_PEERCGROUP

2014-03-13 Thread Tim Hockin

I don't buy that it is not practical.  Not convenient, maybe.  Not
clean, sure.  But it is practical - it uses mechanisms that exist on
all kernels today.  That is a win, to me.

On Thu, Mar 13, 2014 at 10:58 AM, Simo Sorce sso...@redhat.com wrote:
 On Thu, 2014-03-13 at 10:55 -0700, Andy Lutomirski wrote:

 So give each container its own unix socket.  Problem solved, no?

 Not really practical if you have hundreds of containers.

 Simo.

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin

On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote:
>> Yeah sorry.  Replying from my phone is awkward at best.  I know better :)
>
> Heh, sorry about being bitchy. :)
>
>> In my mind, the ONLY point of pulling system-OOM handling into
>> userspace is to make it easier for crazy people (Google) to implement
>> bizarre system-OOM policies.  Example:
>
> I think that's one of the places where we largely disagree.  If at all

Just to be clear - I say this because it doesn't feel right to impose
my craziness on others, and it sucks when we try and are met with
"you're crazy, go away".  And you have to admit that happens to
Google. :)  Punching an escape valve that allows us to be crazy
without hurting anyone else sounds ideal, IF and ONLY IF that escape
valve is itself maintainable.

If the escape valve is userspace it's REALLY easy to iterate on our
craziness.  If it is kernel space, it's somewhat less easy, but not
impossible.

> possible, I'd much prefer google's workload to be supported inside the
> general boundaries of the upstream kernel without having to punch a
> large hole in it.  To me, the general development history of memcg in
> general and this thread in particular seem to epitomize why it is a
> bad idea to have isolated, large and deep "crazy" use cases.  Punching
> the initial hole is the easy part; however, we all are quite limited
> in anticpating future needs and sooner or later that crazy use case is
> bound to evolve further towards the isolated extreme it departed
> towards and require more and larger holes and further contortions to
> accomodate such progress.
>
> The concern I have with the suggested solution is not necessarily that
> it's technically complex than it looks like on the surface - I'm sure
> it can be made to work one way or the other - but that it's a fairly
> large step toward an isolated extreme which memcg as a project
> probably should not head toward.
>
> There sure are cases where such exceptions can't be avoided and are
> good trade-offs but, here, we're talking about a major architectural
> decision which not only affects memcg but mm in general.  I'm afraid
> this doesn't sound like an no-brainer flexibility we can afford.
>
>> When we have a system OOM we want to do a walk of the administrative
>> memcg tree (which is only a couple levels deep, users can make
>> non-admin sub-memcgs), selecting the lowest priority entity at each
>> step (where both tasks and memcgs have a priority and the priority
>> range is much wider than the current OOM scores, and where memcg
>> priority is sometimes a function of memcg usage), until we reach a
>> leaf.
>>
>> Once we reach a leaf, I want to log some info about the memcg doing
>> the allocation, the memcg being terminated, and maybe some other bits
>> about the system (depending on the priority of the selected victim,
>> this may or may not be an "acceptable" situation).  Then I want to
>> kill *everything* under that memcg.  Then I want to "publish" some
>> information through a sane API (e.g. not dmesg scraping).
>>
>> This is basically our policy as we understand it today.  This is
>> notably different than it was a year ago, and it will probably evolve
>> further in the next year.
>
> I think per-memcg score and killing is something which makes
> fundamental sense.  In fact, killing a single process has never made
> much sense to me as that is a unit which ultimately is only meaningful
> to the kernel itself and not necessraily to userland, so no matter
> what I think we're gonna gain per-memcg behavior and it seems most,
> albeit not all, of what you described above should be implementable
> through that.

Well that's an awesome start.  We have or had patches to do a lot of
this.  I don't know how well scrubbed they are for pushing or whether
they apply at all to current head, though.

> Ultimately, if the use case calls for very fine level of control, I
> think the right thing to do is making nesting work properly which is
> likely to take some time.  In the meantime, even if such use case
> requires modifying the kernel to tailor the OOM behavior, I think
> sticking to kernel OOM provides a lot easier way to eventual
> convergence.  Userland system OOM basically means giving up and would
> lessen the motivation towards improving the shared infrastructures
> while adding significant pressure towards schizophreic diversion.
>
>> We have a long tail of kernel memory usage.  If we provision machines
>> so that the "do work here" first-level memcg excludes the average
>> kernel usage, we have a huge number

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin

On Thu, Dec 12, 2013 at 6:21 AM, Tejun Heo  wrote:
> Hey, Tim.
>
> Sidenote: Please don't top-post with the whole body quoted below
> unless you're adding new cc's.  Please selectively quote the original
> message's body to remind the readers of the context and reply below
> it.  It's a basic lkml etiquette and one with good reasons.  If you
> have to top-post for whatever reason - say you're typing from a
> machine which doesn't allow easy editing of the original message,
> explain so at the top of the message, or better yet, wait till you can
> unless it's urgent.

Yeah sorry.  Replying from my phone is awkward at best.  I know better :)

> On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote:
>> The immediate problem I see with setting aside reserves "off the top"
>> is that we don't really know a priori how much memory the kernel
>> itself is going to use, which could still land us in an overcommitted
>> state.
>>
>> In other words, if I have your 128 MB machine, and I set aside 8 MB
>> for OOM handling, and give 120 MB for jobs, I have not accounted for
>> the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
>> 20 MB for jobs.  That should be enough right?  Hell if I know, and
>> nothing ensures that.
>
> Yes, sure thing, that's the reason why I mentioned "with some slack"
> in the original message and also that it might not be completely the
> same.  It doesn't allow you to aggressively use system level OOM
> handling as the sizing estimator for the root cgroup; however, it's
> more of an implementation details than something which should guide
> the overall architecture - it's a problem which lessens in severity as
> [k]memcg improves and its coverage becomes more complete, which is the
> direction we should be headed no matter what.

In my mind, the ONLY point of pulling system-OOM handling into
userspace is to make it easier for crazy people (Google) to implement
bizarre system-OOM policies.  Example:

When we have a system OOM we want to do a walk of the administrative
memcg tree (which is only a couple levels deep, users can make
non-admin sub-memcgs), selecting the lowest priority entity at each
step (where both tasks and memcgs have a priority and the priority
range is much wider than the current OOM scores, and where memcg
priority is sometimes a function of memcg usage), until we reach a
leaf.

Once we reach a leaf, I want to log some info about the memcg doing
the allocation, the memcg being terminated, and maybe some other bits
about the system (depending on the priority of the selected victim,
this may or may not be an "acceptable" situation).  Then I want to
kill *everything* under that memcg.  Then I want to "publish" some
information through a sane API (e.g. not dmesg scraping).

This is basically our policy as we understand it today.  This is
notably different than it was a year ago, and it will probably evolve
further in the next year.

Teaching the kernel all of this stuff has proven to be sort of
difficult to maintain and forward-port, and has been very slow to
evolve because of how painful it is to test and deploy new kernels.

Maybe we can find a way to push this level of policy down to the
kernel OOM killer?  When this was mentioned internally I got shot down
(gently, but shot down none the less).  Assuming we had
nearly-reliable (it doesn't have to be 100% guaranteed to be useful)
OOM-in-userspace, I can keep the adminstrative memcg metadata in
memory, implement killing as cruelly as I need, and do all of the
logging and publication after the OOM kill is done.  Most importantly
I can test and deploy new policy changes pretty trivially.

Handling per-memcg OOM is a different discussion.  Here is where we
want to be able to extract things like heap profiles or take stats
snapshots, grow memcgs (if so configured) etc.  Allowing our users to
have a moment of mercy before we put a bullet in their brain enables a
whole new realm of debugging, as well as a lot of valuable features.

> It'd depend on the workload but with memcg fully configured it
> shouldn't fluctuate wildly.  If it does, we need to hunt down whatever
> is causing such fluctuatation and include it in kmemcg, right?  That
> way, memcg as a whole improves for all use cases not just your niche
> one and I strongly believe that aligning as many use cases as possible
> along the same axis, rather than creating a large hole to stow away
> the exceptions, is vastly more beneficial to *everyone* in the long
> term.

We have a long tail of kernel memory usage.  If we provision machines
so that the "do work here" first-level memcg excludes the average
kernel usage, we have a huge number of machines that will fail to
apply OOM policy because of actual overcommitment.  If we provision
for 95th or 99th perce

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin

On Thu, Dec 12, 2013 at 6:21 AM, Tejun Heo t...@kernel.org wrote:
 Hey, Tim.

 Sidenote: Please don't top-post with the whole body quoted below
 unless you're adding new cc's.  Please selectively quote the original
 message's body to remind the readers of the context and reply below
 it.  It's a basic lkml etiquette and one with good reasons.  If you
 have to top-post for whatever reason - say you're typing from a
 machine which doesn't allow easy editing of the original message,
 explain so at the top of the message, or better yet, wait till you can
 unless it's urgent.

Yeah sorry.  Replying from my phone is awkward at best.  I know better :)

 On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote:
 The immediate problem I see with setting aside reserves off the top
 is that we don't really know a priori how much memory the kernel
 itself is going to use, which could still land us in an overcommitted
 state.

 In other words, if I have your 128 MB machine, and I set aside 8 MB
 for OOM handling, and give 120 MB for jobs, I have not accounted for
 the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
 20 MB for jobs.  That should be enough right?  Hell if I know, and
 nothing ensures that.

 Yes, sure thing, that's the reason why I mentioned with some slack
 in the original message and also that it might not be completely the
 same.  It doesn't allow you to aggressively use system level OOM
 handling as the sizing estimator for the root cgroup; however, it's
 more of an implementation details than something which should guide
 the overall architecture - it's a problem which lessens in severity as
 [k]memcg improves and its coverage becomes more complete, which is the
 direction we should be headed no matter what.

In my mind, the ONLY point of pulling system-OOM handling into
userspace is to make it easier for crazy people (Google) to implement
bizarre system-OOM policies.  Example:

When we have a system OOM we want to do a walk of the administrative
memcg tree (which is only a couple levels deep, users can make
non-admin sub-memcgs), selecting the lowest priority entity at each
step (where both tasks and memcgs have a priority and the priority
range is much wider than the current OOM scores, and where memcg
priority is sometimes a function of memcg usage), until we reach a
leaf.

Once we reach a leaf, I want to log some info about the memcg doing
the allocation, the memcg being terminated, and maybe some other bits
about the system (depending on the priority of the selected victim,
this may or may not be an acceptable situation).  Then I want to
kill *everything* under that memcg.  Then I want to publish some
information through a sane API (e.g. not dmesg scraping).

This is basically our policy as we understand it today.  This is
notably different than it was a year ago, and it will probably evolve
further in the next year.

Teaching the kernel all of this stuff has proven to be sort of
difficult to maintain and forward-port, and has been very slow to
evolve because of how painful it is to test and deploy new kernels.

Maybe we can find a way to push this level of policy down to the
kernel OOM killer?  When this was mentioned internally I got shot down
(gently, but shot down none the less).  Assuming we had
nearly-reliable (it doesn't have to be 100% guaranteed to be useful)
OOM-in-userspace, I can keep the adminstrative memcg metadata in
memory, implement killing as cruelly as I need, and do all of the
logging and publication after the OOM kill is done.  Most importantly
I can test and deploy new policy changes pretty trivially.

Handling per-memcg OOM is a different discussion.  Here is where we
want to be able to extract things like heap profiles or take stats
snapshots, grow memcgs (if so configured) etc.  Allowing our users to
have a moment of mercy before we put a bullet in their brain enables a
whole new realm of debugging, as well as a lot of valuable features.

 It'd depend on the workload but with memcg fully configured it
 shouldn't fluctuate wildly.  If it does, we need to hunt down whatever
 is causing such fluctuatation and include it in kmemcg, right?  That
 way, memcg as a whole improves for all use cases not just your niche
 one and I strongly believe that aligning as many use cases as possible
 along the same axis, rather than creating a large hole to stow away
 the exceptions, is vastly more beneficial to *everyone* in the long
 term.

We have a long tail of kernel memory usage.  If we provision machines
so that the do work here first-level memcg excludes the average
kernel usage, we have a huge number of machines that will fail to
apply OOM policy because of actual overcommitment.  If we provision
for 95th or 99th percentile kernel usage, we're wasting large amounts
of memory that could be used to schedule jobs.  This is the
fundamental problem we face with static apportionment (and we face it
in a dozen other situations, too).  Expressing this set-aside memory

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin

On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote:
 Yeah sorry.  Replying from my phone is awkward at best.  I know better :)

 Heh, sorry about being bitchy. :)

 In my mind, the ONLY point of pulling system-OOM handling into
 userspace is to make it easier for crazy people (Google) to implement
 bizarre system-OOM policies.  Example:

 I think that's one of the places where we largely disagree.  If at all

Just to be clear - I say this because it doesn't feel right to impose
my craziness on others, and it sucks when we try and are met with
you're crazy, go away.  And you have to admit that happens to
Google. :)  Punching an escape valve that allows us to be crazy
without hurting anyone else sounds ideal, IF and ONLY IF that escape
valve is itself maintainable.

If the escape valve is userspace it's REALLY easy to iterate on our
craziness.  If it is kernel space, it's somewhat less easy, but not
impossible.

 possible, I'd much prefer google's workload to be supported inside the
 general boundaries of the upstream kernel without having to punch a
 large hole in it.  To me, the general development history of memcg in
 general and this thread in particular seem to epitomize why it is a
 bad idea to have isolated, large and deep crazy use cases.  Punching
 the initial hole is the easy part; however, we all are quite limited
 in anticpating future needs and sooner or later that crazy use case is
 bound to evolve further towards the isolated extreme it departed
 towards and require more and larger holes and further contortions to
 accomodate such progress.

 The concern I have with the suggested solution is not necessarily that
 it's technically complex than it looks like on the surface - I'm sure
 it can be made to work one way or the other - but that it's a fairly
 large step toward an isolated extreme which memcg as a project
 probably should not head toward.

 There sure are cases where such exceptions can't be avoided and are
 good trade-offs but, here, we're talking about a major architectural
 decision which not only affects memcg but mm in general.  I'm afraid
 this doesn't sound like an no-brainer flexibility we can afford.

 When we have a system OOM we want to do a walk of the administrative
 memcg tree (which is only a couple levels deep, users can make
 non-admin sub-memcgs), selecting the lowest priority entity at each
 step (where both tasks and memcgs have a priority and the priority
 range is much wider than the current OOM scores, and where memcg
 priority is sometimes a function of memcg usage), until we reach a
 leaf.

 Once we reach a leaf, I want to log some info about the memcg doing
 the allocation, the memcg being terminated, and maybe some other bits
 about the system (depending on the priority of the selected victim,
 this may or may not be an acceptable situation).  Then I want to
 kill *everything* under that memcg.  Then I want to publish some
 information through a sane API (e.g. not dmesg scraping).

 This is basically our policy as we understand it today.  This is
 notably different than it was a year ago, and it will probably evolve
 further in the next year.

 I think per-memcg score and killing is something which makes
 fundamental sense.  In fact, killing a single process has never made
 much sense to me as that is a unit which ultimately is only meaningful
 to the kernel itself and not necessraily to userland, so no matter
 what I think we're gonna gain per-memcg behavior and it seems most,
 albeit not all, of what you described above should be implementable
 through that.

Well that's an awesome start.  We have or had patches to do a lot of
this.  I don't know how well scrubbed they are for pushing or whether
they apply at all to current head, though.

 Ultimately, if the use case calls for very fine level of control, I
 think the right thing to do is making nesting work properly which is
 likely to take some time.  In the meantime, even if such use case
 requires modifying the kernel to tailor the OOM behavior, I think
 sticking to kernel OOM provides a lot easier way to eventual
 convergence.  Userland system OOM basically means giving up and would
 lessen the motivation towards improving the shared infrastructures
 while adding significant pressure towards schizophreic diversion.

 We have a long tail of kernel memory usage.  If we provision machines
 so that the do work here first-level memcg excludes the average
 kernel usage, we have a huge number of machines that will fail to
 apply OOM policy because of actual overcommitment.  If we provision
 for 95th or 99th percentile kernel usage, we're wasting large amounts
 of memory that could be used to schedule jobs.  This is the
 fundamental problem we face with static apportionment (and we face it
 in a dozen other situations, too).  Expressing this set-aside memory
 as off-the-top rather than absolute limits makes the whole

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-11 Thread Tim Hockin

The immediate problem I see with setting aside reserves "off the top"
is that we don't really know a priori how much memory the kernel
itself is going to use, which could still land us in an overcommitted
state.

In other words, if I have your 128 MB machine, and I set aside 8 MB
for OOM handling, and give 120 MB for jobs, I have not accounted for
the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
20 MB for jobs.  That should be enough right?  Hell if I know, and
nothing ensures that.

On Wed, Dec 11, 2013 at 4:42 AM, Tejun Heo  wrote:
> Yo,
>
> On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote:
>> > Well, the gotcha there is that you won't be able to do that with
>> > system level OOM handler either unless you create a separately
>> > reserved memory, which, again, can be achieved using hierarchical
>> > memcg setup already.  Am I missing something here?
>>
>> System oom conditions would only arise when the usage of memcgs A + B
>> above cause the page allocator to not be able to allocate memory without
>> oom killing something even though the limits of both A and B may not have
>> been reached yet.  No userspace oom handler can allocate memory with
>> access to memory reserves in the page allocator in such a context; it's
>> vital that if we are to handle system oom conditions in userspace that we
>> given them access to memory that other processes can't allocate.  You
>> could attach a userspace system oom handler to any memcg in this scenario
>> with memory.oom_reserve_in_bytes and since it has PF_OOM_HANDLER it would
>> be able to allocate in reserves in the page allocator and overcharge in
>> its memcg to handle it.  This isn't possible only with a hierarchical
>> memcg setup unless you ensure the sum of the limits of the top level
>> memcgs do not equal or exceed the sum of the min watermarks of all memory
>> zones, and we exceed that.
>
> Yes, exactly.  If system memory is 128M, create top level memcgs w/
> 120M and 8M each (well, with some slack of course) and then overcommit
> the descendants of 120M while putting OOM handlers and friends under
> 8M without overcommitting.
>
> ...
>> The stronger rationale is that you can't handle system oom in userspace
>> without this functionality and we need to do so.
>
> You're giving yourself an unreasonable precondition - overcommitting
> at root level and handling system OOM from userland - and then trying
> to contort everything to fit that.  How can possibly "overcommitting
> at root level" be a goal of and in itself?  Please take a step back
> and look at and explain the *problem* you're trying to solve.  You
> haven't explained why that *need*s to be the case at all.
>
> I wrote this at the start of the thread but you're still doing the
> same thing.  You're trying to create a hidden memcg level inside a
> memcg.  At the beginning of this thread, you were trying to do that
> for !root memcgs and now you're arguing that you *need* that for root
> memcg.  Because there's no other limit we can make use of, you're
> suggesting the use of kernel reserve memory for that purpose.  It
> seems like an absurd thing to do to me.  It could be that you might
> not be able to achieve exactly the same thing that way, but the right
> thing to do would be improving memcg in general so that it can instead
> of adding yet more layer of half-baked complexity, right?
>
> Even if there are some inherent advantages of system userland OOM
> handling with a separate physical memory reserve, which AFAICS you
> haven't succeeded at showing yet, this is a very invasive change and,
> as you said before, something with an *extremely* narrow use case.
> Wouldn't it be a better idea to improve the existing mechanisms - be
> that memcg in general or kernel OOM handling - to fit the niche use
> case better?  I mean, just think about all the corner cases.  How are
> you gonna handle priority inversion through locked pages or
> allocations given out to other tasks through slab?  You're suggesting
> opening a giant can of worms for extremely narrow benefit which
> doesn't even seem like actually needing opening the said can.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-11 Thread Tim Hockin

The immediate problem I see with setting aside reserves off the top
is that we don't really know a priori how much memory the kernel
itself is going to use, which could still land us in an overcommitted
state.

In other words, if I have your 128 MB machine, and I set aside 8 MB
for OOM handling, and give 120 MB for jobs, I have not accounted for
the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
20 MB for jobs.  That should be enough right?  Hell if I know, and
nothing ensures that.

On Wed, Dec 11, 2013 at 4:42 AM, Tejun Heo t...@kernel.org wrote:
 Yo,

 On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote:
  Well, the gotcha there is that you won't be able to do that with
  system level OOM handler either unless you create a separately
  reserved memory, which, again, can be achieved using hierarchical
  memcg setup already.  Am I missing something here?

 System oom conditions would only arise when the usage of memcgs A + B
 above cause the page allocator to not be able to allocate memory without
 oom killing something even though the limits of both A and B may not have
 been reached yet.  No userspace oom handler can allocate memory with
 access to memory reserves in the page allocator in such a context; it's
 vital that if we are to handle system oom conditions in userspace that we
 given them access to memory that other processes can't allocate.  You
 could attach a userspace system oom handler to any memcg in this scenario
 with memory.oom_reserve_in_bytes and since it has PF_OOM_HANDLER it would
 be able to allocate in reserves in the page allocator and overcharge in
 its memcg to handle it.  This isn't possible only with a hierarchical
 memcg setup unless you ensure the sum of the limits of the top level
 memcgs do not equal or exceed the sum of the min watermarks of all memory
 zones, and we exceed that.

 Yes, exactly.  If system memory is 128M, create top level memcgs w/
 120M and 8M each (well, with some slack of course) and then overcommit
 the descendants of 120M while putting OOM handlers and friends under
 8M without overcommitting.

 ...
 The stronger rationale is that you can't handle system oom in userspace
 without this functionality and we need to do so.

 You're giving yourself an unreasonable precondition - overcommitting
 at root level and handling system OOM from userland - and then trying
 to contort everything to fit that.  How can possibly overcommitting
 at root level be a goal of and in itself?  Please take a step back
 and look at and explain the *problem* you're trying to solve.  You
 haven't explained why that *need*s to be the case at all.

 I wrote this at the start of the thread but you're still doing the
 same thing.  You're trying to create a hidden memcg level inside a
 memcg.  At the beginning of this thread, you were trying to do that
 for !root memcgs and now you're arguing that you *need* that for root
 memcg.  Because there's no other limit we can make use of, you're
 suggesting the use of kernel reserve memory for that purpose.  It
 seems like an absurd thing to do to me.  It could be that you might
 not be able to achieve exactly the same thing that way, but the right
 thing to do would be improving memcg in general so that it can instead
 of adding yet more layer of half-baked complexity, right?

 Even if there are some inherent advantages of system userland OOM
 handling with a separate physical memory reserve, which AFAICS you
 haven't succeeded at showing yet, this is a very invasive change and,
 as you said before, something with an *extremely* narrow use case.
 Wouldn't it be a better idea to improve the existing mechanisms - be
 that memcg in general or kernel OOM handling - to fit the niche use
 case better?  I mean, just think about all the corner cases.  How are
 you gonna handle priority inversion through locked pages or
 allocations given out to other tasks through slab?  You're suggesting
 opening a giant can of worms for extremely narrow benefit which
 doesn't even seem like actually needing opening the said can.

 Thanks.

 --
 tejun
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-07-01 Thread Tim Hockin

On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering
 wrote:
> Heya,
>
>
> On 29.06.2013 05:05, Tim Hockin wrote:
>>
>> Come on, now, Lennart.  You put a lot of words in my mouth.
>
>
>>> I for sure am not going to make the PID 1 a client of another daemon.
>>> That's
>>> just wrong. If you have a daemon that is both conceptually the manager of
>>> another service and the client of that other service, then that's bad
>>> design
>>> and you will easily run into deadlocks and such. Just think about it: if
>>> you
>>> have some external daemon for managing cgroups, and you need cgroups for
>>> running external daemons, how are you going to start the external daemon
>>> for
>>> managing cgroups? Sure, you can hack around this, make that daemon
>>> special,
>>> and magic, and stuff -- or you can just not do such nonsense. There's no
>>> reason to repeat the fuckup that cgroup became in kernelspace a second
>>> time,
>>> but this time in userspace, with multiple manager daemons all with
>>> different
>>> and slightly incompatible definitions what a unit to manage actualy is...
>>
>>
>> I forgot about the tautology of systemd.  systemd is monolithic.
>
>
> systemd is certainly not monolithic for almost any definition of that term.
> I am not sure where you are taking that from, and I am not sure I want to
> discuss on that level. This just sounds like FUD you picked up somewhere and
> are repeating carelessly...

It does a number of sort-of-related things.  Maybe it does them better
by doing them together.  I can't say, really.  We don't use it at
work, and I am on Ubuntu elsewhere, for now.

>> But that's not my point.  It seems pretty easy to make this cgroup
>> management (in "native mode") a library that can have either a thin
>> veneer of a main() function, while also being usable by systemd.  The
>> point is to solve all of the problems ONCE.  I'm trying to make the
>> case that systemd itself should be focusing on features and policies
>> and awesome APIs.
>
> You know, getting this all right isn't easy. If you want to do things
> properly, then you need to propagate attribute changes between the units you
> manage. You also need something like a scheduler, since a number of
> controllers can only be configured under certain external conditions (for
> example: the blkio or devices controller use major/minor parameters for
> configuring per-device limits. Since major/minor assignments are pretty much
> unpredictable these days -- and users probably want to configure things with
> friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
> wait for devices to show up before we can configure the parameters.) Soo...
> you need a graph of units, where you can propagate things, and schedule
> things based on some execution/event queue. And the propagation and
> scheduling are closely intermingled.

I'm really just talking about the most basic low-level substrate of
writing to cgroupfs.  Again, we don't use udev (yet?) so we don't have
these problems.  It seems to me that it's possible to formulate a
bottom layer that is usable by both systemd and non-systemd systems.
But, you know, maybe I am wrong and our internal universe is so much
simpler (and behind the times) than the rest of the world that
layering can work for us and not you.

> Now, that's pretty much exactly what systemd actually *is*. It implements a
> graph of units with a scheduler. And if you rip that part out of systemd to
> make this an "easy cgroup management library", then you simply turn what
> systemd is into a library without leaving anything. Which is just bogus.
>
> So no, if you say "seems pretty easy to make this cgroup management a
> library" then well, I have to disagree with you.
>
>
>>> We want to run fewer, simpler things on our systems, we want to reuse as
>>
>>
>> Fewer and simpler are not compatible, unless you are losing
>> functionality.  Systemd is fewer, but NOT simpler.
>
>
> Oh, certainly it is. If we'd split up the cgroup fs access into separate
> daemon of some kind, then we'd need some kind of IPC for that, and so you
> have more daemons and you have some complex IPC between the processes. So
> yeah, the systemd approach is certainly both simpler and uses fewer daemons
> then your hypothetical one.

Well, it SOUNDS like Serge is trying to develop this to demonstrate
that a standalone daemon works.  That's what I am keen to help with
(or else we have to invent ourselves).  I am not really afraid of IPC
or of "more daemons".  I much prefer simple agents doing one thing and
interacting

Re: cgroup: status-quo and userland efforts

2013-07-01 Thread Tim Hockin

On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering
lpoet...@redhat.com wrote:
 Heya,


 On 29.06.2013 05:05, Tim Hockin wrote:

 Come on, now, Lennart.  You put a lot of words in my mouth.


 I for sure am not going to make the PID 1 a client of another daemon.
 That's
 just wrong. If you have a daemon that is both conceptually the manager of
 another service and the client of that other service, then that's bad
 design
 and you will easily run into deadlocks and such. Just think about it: if
 you
 have some external daemon for managing cgroups, and you need cgroups for
 running external daemons, how are you going to start the external daemon
 for
 managing cgroups? Sure, you can hack around this, make that daemon
 special,
 and magic, and stuff -- or you can just not do such nonsense. There's no
 reason to repeat the fuckup that cgroup became in kernelspace a second
 time,
 but this time in userspace, with multiple manager daemons all with
 different
 and slightly incompatible definitions what a unit to manage actualy is...


 I forgot about the tautology of systemd.  systemd is monolithic.


 systemd is certainly not monolithic for almost any definition of that term.
 I am not sure where you are taking that from, and I am not sure I want to
 discuss on that level. This just sounds like FUD you picked up somewhere and
 are repeating carelessly...

It does a number of sort-of-related things.  Maybe it does them better
by doing them together.  I can't say, really.  We don't use it at
work, and I am on Ubuntu elsewhere, for now.

 But that's not my point.  It seems pretty easy to make this cgroup
 management (in native mode) a library that can have either a thin
 veneer of a main() function, while also being usable by systemd.  The
 point is to solve all of the problems ONCE.  I'm trying to make the
 case that systemd itself should be focusing on features and policies
 and awesome APIs.

 You know, getting this all right isn't easy. If you want to do things
 properly, then you need to propagate attribute changes between the units you
 manage. You also need something like a scheduler, since a number of
 controllers can only be configured under certain external conditions (for
 example: the blkio or devices controller use major/minor parameters for
 configuring per-device limits. Since major/minor assignments are pretty much
 unpredictable these days -- and users probably want to configure things with
 friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
 wait for devices to show up before we can configure the parameters.) Soo...
 you need a graph of units, where you can propagate things, and schedule
 things based on some execution/event queue. And the propagation and
 scheduling are closely intermingled.

I'm really just talking about the most basic low-level substrate of
writing to cgroupfs.  Again, we don't use udev (yet?) so we don't have
these problems.  It seems to me that it's possible to formulate a
bottom layer that is usable by both systemd and non-systemd systems.
But, you know, maybe I am wrong and our internal universe is so much
simpler (and behind the times) than the rest of the world that
layering can work for us and not you.

 Now, that's pretty much exactly what systemd actually *is*. It implements a
 graph of units with a scheduler. And if you rip that part out of systemd to
 make this an easy cgroup management library, then you simply turn what
 systemd is into a library without leaving anything. Which is just bogus.

 So no, if you say seems pretty easy to make this cgroup management a
 library then well, I have to disagree with you.


 We want to run fewer, simpler things on our systems, we want to reuse as


 Fewer and simpler are not compatible, unless you are losing
 functionality.  Systemd is fewer, but NOT simpler.


 Oh, certainly it is. If we'd split up the cgroup fs access into separate
 daemon of some kind, then we'd need some kind of IPC for that, and so you
 have more daemons and you have some complex IPC between the processes. So
 yeah, the systemd approach is certainly both simpler and uses fewer daemons
 then your hypothetical one.

Well, it SOUNDS like Serge is trying to develop this to demonstrate
that a standalone daemon works.  That's what I am keen to help with
(or else we have to invent ourselves).  I am not really afraid of IPC
or of more daemons.  I much prefer simple agents doing one thing and
interacting with each other in simple ways.  But that's me.

 much of the code as we can. You don't achieve that by running yet another
 daemon that does worse what systemd can anyway do simpler, easier and
 better.


 Considering this is all hypothetical, I find this to be a funny
 debate.  My hypothetical idea is better than your hypothetical idea.


 Well, systemd is pretty real, and the code to do the unified cgroup
 management within systemd is pretty complete. systemd is certainly not
 hypothetical.

Fair enough - I did not realize you had

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

Come on, now, Lennart.  You put a lot of words in my mouth.

On Fri, Jun 28, 2013 at 6:48 PM, Lennart Poettering  wrote:
> On 28.06.2013 20:53, Tim Hockin wrote:
>
>> a single-agent, we should make a kick-ass implementation that is
>> flexible and scalable, and full-featured enough to not require
>> divergence at the lowest layer of the stack.  Then build systemd on
>> top of that. Let systemd offer more features and policies and
>> "semantic" APIs.
>
>
> Well, what if systemd is already kick-ass? I mean, if you have a problem
> with systemd, then that's your own problem, but I really don't think why I
> should bother?

I didn't say it wasn't.  I said that we can build a common substrate
that systemd can build on *and* non-systemd systems can use *and*
Google can participate in.

> I for sure am not going to make the PID 1 a client of another daemon. That's
> just wrong. If you have a daemon that is both conceptually the manager of
> another service and the client of that other service, then that's bad design
> and you will easily run into deadlocks and such. Just think about it: if you
> have some external daemon for managing cgroups, and you need cgroups for
> running external daemons, how are you going to start the external daemon for
> managing cgroups? Sure, you can hack around this, make that daemon special,
> and magic, and stuff -- or you can just not do such nonsense. There's no
> reason to repeat the fuckup that cgroup became in kernelspace a second time,
> but this time in userspace, with multiple manager daemons all with different
> and slightly incompatible definitions what a unit to manage actualy is...

I forgot about the tautology of systemd.  systemd is monolithic.
Therefore it can not have any external dependencies.  Therefore it
must absorb anything it depends on.  Therefore systemd continues to
grow in size and scope.  Up next: systemd manages your X sessions!

But that's not my point.  It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd.  The
point is to solve all of the problems ONCE.  I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.

> We want to run fewer, simpler things on our systems, we want to reuse as

Fewer and simpler are not compatible, unless you are losing
functionality.  Systemd is fewer, but NOT simpler.

> much of the code as we can. You don't achieve that by running yet another
> daemon that does worse what systemd can anyway do simpler, easier and
> better.

Considering this is all hypothetical, I find this to be a funny
debate.  My hypothetical idea is better than your hypothetical idea.

> The least you could grant us is to have a look at the final APIs we will
> have to offer before you already imply that systemd cannot be a valid
> implementation of any API people could ever agree on.

Whoah, don't get defensive.  I said nothing of the sort.  The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature.  That's unlikely to change in this timescale.
What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the "upstream" daemons was too
tightly coupled with systemd.

I think we have a lot of experience to offer to this project, and a
vested interest in seeing it done well.  But if it is purely
targetting systemd, we have little incentive to devote resources to
it.

Please note that I am strictly talking about the lowest layer of the
API.  Just the thing that guards cgroupfs against mere mortals.  The
higher layers - where abstractions exist, that are actually USEFUL to
end users - are not really in scope right now.  We already have our
own higher level APIs.

This is supposed to be collaborative, not combative.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup access daemon

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 12:21 PM, Serge Hallyn  wrote:
> Quoting Tim Hockin (thoc...@hockin.org):
>> On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn  
>> wrote:
>> > Quoting Tim Hockin (thoc...@hockin.org):
>> >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn  
>> >> wrote:
>> >> > Quoting Tim Hockin (thoc...@hockin.org):
>> > Could you give examples?
>> >
>> > If you have a white/academic paper I should go read, that'd be great.
>>
>> We don't have anything on this, but examples may help.
>>
>> Someone running as root should be able to connect to the "native"
>> daemon and read or write any cgroup file they want, right?  You could
>> argue that root should be able to do this to a child-daemon, too, but
>> let's ignore that.
>>
>> But inside a container, I don't want the users to be able to write to
>> anything in their own container.  I do want them to be able to make
>> sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
>> able to write to memory.limit_in_bytes, to read but not write
>> memory.soft_limit_in_bytes, and not be able to read memory.stat.
>>
>> To get even fancier, a user should be able to create a sub-cgroup and
>> then designate that sub-cgroup as "final" - no further sub-sub-cgroups
>> allowed under it.  They should also be able to designate that a
>> sub-cgroup is "one-way" - once a process enters it, it can not leave.
>>
>> These are real(ish) examples based on what people want to do today.
>> In particular, the last couple are things that we want to do, but
>> don't do today.
>>
>> The particular policy can differ per-container.  Production jobs might
>> be allowed to create sub-cgroups, but batch jobs are not.  Some user
>> jobs are designated "trusted" in one facet or another and get more
>> (but still not full) access.
>
> Interesting, thanks.
>
> I'll think a bit on how to best address these.
>
>> > At the moment I'm going on the naive belief that proper hierarchy
>> > controls will be enforced (eventually) by the kernel - i.e. if
>> > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
>> > won't be possible for /lxc/c1/lxc/c2 to take that access.
>> >
>> > The native cgroup manager (the one using cgroupfs) will be checking
>> > the credentials of the requesting child manager for access(2) to
>> > the cgroup files.
>>
>> This might be sufficient, or the basis for a sufficient access control
>> system for users.  The problem comes that we have multiple jobs on a
>> single machine running as the same user.  We need to ensure that the
>> jobs can not modify each other.
>
> Would running them each in user namespaces with different mappings (all
> jobs running as uid 1000, but uid 1000  mapped to different host uids
> for each job) would be (long-term) feasible?

Possibly.  It's a largish imposition to make on the caller (we don't
use user namespaces today, though we are evaluating how to start using
them) but perhaps not terrible.

>> > It is a named socket.
>>
>> So anyone can connect?  even with SO_PEERCRED, how do you know which
>> branches of the cgroup tree I am allowed to modify if the same user
>> owns more than one?
>
> I was assuming that any process requesting management of
> /c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1)
>
> So if you have two jobs running as uid 1000, one under /c1 and one
> under /c2, and one as uid 1001 running under /c3 (with the uids owning
> the cgroups), then the file permissions will prevent 1000 and 1001
> from walk over each other, while the cgroup manager will not allow
> a process (child manager or otherwise) under /c1 to manage cgroups
> under /c2 and vice versa.
>
>> >> Do you have a design spec, or a requirements list, or even a prototype
>> >> that we can look at?
>> >
>> > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
>> > shows what I have in mind.  It (and the sloppy code next to it)
>> > represent a few hours' work over the last few days while waiting
>> > for compiles and in between emails...
>>
>> Awesome.  Do you mind if we look?
>
> No, but it might not be worth it (other than the readme) :) - so far
> it's only served to help me think through what I want and need from
> the mgr.
>
>> > But again, it is completely predicated on my goal to have libvirt
>> > and lxc (and other cgroup users) be able to use the same library
>> > or API to make their requests whe

Re: [Workman-devel] cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 8:53 AM, Serge Hallyn  wrote:
> Quoting Daniel P. Berrange (berra...@redhat.com):

>> Are you also planning to actually write a new cgroup parent manager
>> daemon too ? Currently my plan for libvirt is to just talk directly
>
> I'm toying with the idea, yes.  (Right now my toy runs in either native
> mode, using cgroupfs, or child mode, talking to a parent manager)  I'd
> love if someone else does it, but it needs to be done.
>
> As I've said elsewhere in the thread, I see 2 problems to be addressed:
>
> 1. The ability to nest the cgroup manager daemons, so that a daemon
> running in a container can talk to a daemon running on the host.  This
> is the problem my current toy is aiming to address.  But the API it
> exports is just a thin layer over cgroupfs.
>
> 2. Abstract away the kernel/cgroupfs details so that userspace can
> explain its cgroup needs generically.  This is IIUC what systemd is
> addressing with slices and scopes.
>
> (2) is where I'd really like to have a well thought out, community
> designed API that everyone can agree on, and it might be worth getting
> together (with Tejun) at plumbers or something to lay something out.

We're also working on (2) (well, we HAVE it, but we're dis-integrating
it so we can hopefully publish more widely).  But our (2) depends on
direct cgroupfs access.  If that is to change, we need a really robust
(1).  It's OK (desireable, in fact) that (1) be a very thin layer of
abstraction.

> In the end, something like libvirt or lxc should not need to care
> what is running underneat it.  It should be able to make its requests
> the same way regardless of whether it running in fedora or ubuntu,
> and whether it is running on the host or in a tightly bound container.
> That's my goal anyway :)
>
>> to systemd's new DBus APIs for all management of cgroups, and then
>> fall back to writing to cgroupfs directly for cases where systemd
>> is not around.  Having a library to abstract these two possible
>> alternatives isn't all that compelling unless we think there will
>> be multiple cgroups manager daemons. I've been somewhat assuming that
>> even Ubuntu will eventually see the benefits & switch to systemd,
>
> So far I've seen no indication of that :)
>
> If the systemd code to manage slices could be made separately
> compileable as a standalone library or daemon, then I'd advocate
> using that.  But I don't see a lot of incentive for systemd to do
> that, so I'd feel like a heel even asking.

I want to say "let the best API win", but I know that systemd is a
giant katamari ball, and it's absorbing subsystems so it may win by
default.  That isn't going to stop us from trying to do what we do,
and share that with the world.

>> then the issue of multiple manager daemons wouldn't really exist.
>
> True.  But I'm running under the assumption that Ubuntu will stick with
> upstart, and therefore yes I'll need a separate (perhaps pair of)
> management daemons.
>
> Even if we were to switch to systemd, I'd like the API for userspace
> programs to configure and use cgroups to be as generic as possible,
> so that anyone who wanted to write their own daemon could do so.
>
> -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 8:05 AM, Michal Hocko  wrote:
> On Thu 27-06-13 22:01:38, Tejun Heo wrote:

>> Oh, that in itself is not bad.  I mean, if you're root, it's pretty
>> easy to play with and that part is fine.  But combined with the
>> hierarchical nature of cgroup and file permissions, it encourages
>> people to "deligate" subdirectories to less previledged domains,
>
> OK, this really depends on what you expose to non-root users. I have
> seen use cases where admin prepares top-level which is root-only but
> it allows creating sub-groups which are under _full_ control of the
> subdomain. This worked nicely for memcg for example because hard limit,
> oom handling and other knobs are hierarchical so the subdomain cannot
> overwrite what admin has said.

bingo

> And the systemd, with its history of eating projects and not caring much
> about their previous users who are not willing to jump in to the systemd
> car, doesn't sound like a good place where to place the new interface to
> me.

+1

If systemd is the only upstream implementation of this single-agent
idea, we will have to invent our own, and continue to diverge rather
than converge.  I think that, if we are going to pursue this model of
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack.  Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.

We will build our own semantic APIs that are, necessarily, different
from systemd.  But we can all use the same low-level mechanism.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

On Thu, Jun 27, 2013 at 2:04 PM, Tejun Heo  wrote:
> Hello,
>
> On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote:
>> So what you're saying is that you don't care that this new thing is
>> less capable than the old thing, despite it having real impact.
>
> Sort of.  I'm saying, at least up until now, moving away from
> orthogonal hierarchy support seems to be the right trade-off.  It all
> depends on how you measure how much things are simplified and how
> heavy the "real impacts" are.  It's not like these things can be
> determined white and black.  Given the current situation, I think it's
> the right call.

I totally understand where you're coming from - trying to get back to
a stable feature set.  But it sucks to be on the losing end of that
battle - you're cutting things that REALLY matter to us, and without a
really viable alternative.  So we'll keep fighting.

>> If controller C is enabled at level X but disabled at level X/Y, does
>> that mean that X/Y uses the limits set in X?  How about X/Y/Z?
>
> Y and Y/Z wouldn't make any difference.  Tasks belonging to them would
> behave as if they belong to X as far as C is concerened.

OK, that *sounds* sane.  It doesn't solve all our problems, but it
alleviates some of them.

>> So take away some of the flexibility that has minimal impact and
>> maximum return.  Splitting threads across cgroups - we use it, but we
>> could get off that.  Force all-or-nothing joining of an aggregate
>
> Please do so.

Splitting threads is sort of important for some cgroups, like CPU.  I
wonder if pjt is paying attention to this thread.

>> construct (a container vs N cgroups).
>>
>> But perform surgery with a scalpel, not a hatchet.
>
> As anything else, it's drawing a line in a continuous spectrum of
> grey.  Right now, given that maintaining multiple orthogonal
> hierarchies while introducing a proper concept of resource container
> involves addition of completely new constructs and complexity, I don't
> think that's a good option.  If there are problems which can't be
> resolved / worked around in a reasonable manner, please bring them up
> along with their contexts.  Let's examine them and see whether there
> are other ways to accomodate them.

You're arguing that the abstraction you want is that of a "container"
but that it's easier to remove options than to actually build a better
API.

I think this is wrong.  Take the opportunity to define the RIGHT
interface that you WANT - a container.  Implement it in terms of
cgroups (and maybe other stuff!).  Make that API so compelling that
people want to use it, and your war of attrition on direct cgroup
madness will be won, but with net progress rather than regress.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup access daemon

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn  wrote:
> Quoting Tim Hockin (thoc...@hockin.org):
>> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn  
>> wrote:
>> > Quoting Tim Hockin (thoc...@hockin.org):
>> >
>> >> For our use case this is a huge problem.  We have people who access
>> >> cgroup files in a fairly tight loops, polling for information.  We
>> >> have literally hundeds of jobs running on sub-second frequencies -
>> >> plumbing all of that through a daemon is going to be a disaster.
>> >> Either your daemon becomes a bottleneck, or we have to build something
>> >> far more scalable than you really want to.  Not to mention the
>> >> inefficiency of inserting a layer.
>> >
>> > Currently you can trivially create a container which has the
>> > container's cgroups bind-mounted to the expected places
>> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the
>> > configuration file, and handle cgroups through cgroupfs there.
>> > (This is what the management agent wants to be an alternative
>> > for)  The main deficiency there is that /proc/self/cgroups is
>> > not filtered, so it will show /lxc/c1 for init's cgroup, while
>> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
>> > is seen under the container's /sys/fs/cgroup/devices (for
>> > instance).  Not ideal.
>>
>> I'm really saying that if your daemon is to provide a replacement for
>> cgroupfs direct access, it needs to be designed to be scalable.  If
>> we're going to get away from bind mounting cgroupfs into user
>> namespaces, then let's try to solve ALL the problems.
>>
>> >> We also need the ability to set up eventfds for users or to let them
>> >> poll() on the socket from this daemon.
>> >
>> > So you'd want to be able to request updates when any cgroup value
>> > is changed, right?
>>
>> Not necessarily ANY, but that's the terminus of this API facet.
>>
>> > That's currently not in my very limited set of commands, but I can
>> > certainly add it, and yes it would be a simple unix sock so you can
>> > set up eventfd, select/poll, etc.
>>
>> Assuming the protocol is basically a pass-through to basic filesystem
>> ops, it should be pretty easy.  You just need to add it to your
>> protocol.
>>
>> But it brings up another point - access control.  How do you decide
>> which files a child agent should have access to?  Does that ever
>> change based on the child's configuration? In our world, the answer is
>> almost certainly yes.
>
> Could you give examples?
>
> If you have a white/academic paper I should go read, that'd be great.

We don't have anything on this, but examples may help.

Someone running as root should be able to connect to the "native"
daemon and read or write any cgroup file they want, right?  You could
argue that root should be able to do this to a child-daemon, too, but
let's ignore that.

But inside a container, I don't want the users to be able to write to
anything in their own container.  I do want them to be able to make
sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
able to write to memory.limit_in_bytes, to read but not write
memory.soft_limit_in_bytes, and not be able to read memory.stat.

To get even fancier, a user should be able to create a sub-cgroup and
then designate that sub-cgroup as "final" - no further sub-sub-cgroups
allowed under it.  They should also be able to designate that a
sub-cgroup is "one-way" - once a process enters it, it can not leave.

These are real(ish) examples based on what people want to do today.
In particular, the last couple are things that we want to do, but
don't do today.

The particular policy can differ per-container.  Production jobs might
be allowed to create sub-cgroups, but batch jobs are not.  Some user
jobs are designated "trusted" in one facet or another and get more
(but still not full) access.

> At the moment I'm going on the naive belief that proper hierarchy
> controls will be enforced (eventually) by the kernel - i.e. if
> a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
> won't be possible for /lxc/c1/lxc/c2 to take that access.
>
> The native cgroup manager (the one using cgroupfs) will be checking
> the credentials of the requesting child manager for access(2) to
> the cgroup files.

This might be sufficient, or the basis for a sufficient access control
system for users.  The problem comes that we have multiple jobs on a
single machine running as the same user.  We need to ensure that the
jobs can not modify each other.

>>

Re: cgroup access daemon

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Tim Hockin (thoc...@hockin.org):
 On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn serge.hal...@ubuntu.com 
 wrote:
  Quoting Tim Hockin (thoc...@hockin.org):
 
  For our use case this is a huge problem.  We have people who access
  cgroup files in a fairly tight loops, polling for information.  We
  have literally hundeds of jobs running on sub-second frequencies -
  plumbing all of that through a daemon is going to be a disaster.
  Either your daemon becomes a bottleneck, or we have to build something
  far more scalable than you really want to.  Not to mention the
  inefficiency of inserting a layer.
 
  Currently you can trivially create a container which has the
  container's cgroups bind-mounted to the expected places
  (/sys/fs/cgroup/$controller) by uncommenting two lines in the
  configuration file, and handle cgroups through cgroupfs there.
  (This is what the management agent wants to be an alternative
  for)  The main deficiency there is that /proc/self/cgroups is
  not filtered, so it will show /lxc/c1 for init's cgroup, while
  the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
  is seen under the container's /sys/fs/cgroup/devices (for
  instance).  Not ideal.

 I'm really saying that if your daemon is to provide a replacement for
 cgroupfs direct access, it needs to be designed to be scalable.  If
 we're going to get away from bind mounting cgroupfs into user
 namespaces, then let's try to solve ALL the problems.

  We also need the ability to set up eventfds for users or to let them
  poll() on the socket from this daemon.
 
  So you'd want to be able to request updates when any cgroup value
  is changed, right?

 Not necessarily ANY, but that's the terminus of this API facet.

  That's currently not in my very limited set of commands, but I can
  certainly add it, and yes it would be a simple unix sock so you can
  set up eventfd, select/poll, etc.

 Assuming the protocol is basically a pass-through to basic filesystem
 ops, it should be pretty easy.  You just need to add it to your
 protocol.

 But it brings up another point - access control.  How do you decide
 which files a child agent should have access to?  Does that ever
 change based on the child's configuration? In our world, the answer is
 almost certainly yes.

 Could you give examples?

 If you have a white/academic paper I should go read, that'd be great.

We don't have anything on this, but examples may help.

Someone running as root should be able to connect to the native
daemon and read or write any cgroup file they want, right?  You could
argue that root should be able to do this to a child-daemon, too, but
let's ignore that.

But inside a container, I don't want the users to be able to write to
anything in their own container.  I do want them to be able to make
sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
able to write to memory.limit_in_bytes, to read but not write
memory.soft_limit_in_bytes, and not be able to read memory.stat.

To get even fancier, a user should be able to create a sub-cgroup and
then designate that sub-cgroup as final - no further sub-sub-cgroups
allowed under it.  They should also be able to designate that a
sub-cgroup is one-way - once a process enters it, it can not leave.

These are real(ish) examples based on what people want to do today.
In particular, the last couple are things that we want to do, but
don't do today.

The particular policy can differ per-container.  Production jobs might
be allowed to create sub-cgroups, but batch jobs are not.  Some user
jobs are designated trusted in one facet or another and get more
(but still not full) access.

 At the moment I'm going on the naive belief that proper hierarchy
 controls will be enforced (eventually) by the kernel - i.e. if
 a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
 won't be possible for /lxc/c1/lxc/c2 to take that access.

 The native cgroup manager (the one using cgroupfs) will be checking
 the credentials of the requesting child manager for access(2) to
 the cgroup files.

This might be sufficient, or the basis for a sufficient access control
system for users.  The problem comes that we have multiple jobs on a
single machine running as the same user.  We need to ensure that the
jobs can not modify each other.

So then the idea would be that userspace (like libvirt and lxc) would
talk over /dev/cgroup to its manager.  Userspace inside a container
(which can't actually mount cgroups itself) would talk to its own
manager which is talking over a passed-in socket to the host manager,
which in turn runs natively (uses cgroupfs, and nests create /c1 
under
the requestor's cgroup).
  
   How do you handle updates of this agent?  Suppose I have hundreds of
   running containers, and I want to release a new version of the cgroupd
   ?
  
   This may change (which is part

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

On Thu, Jun 27, 2013 at 2:04 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote:
 So what you're saying is that you don't care that this new thing is
 less capable than the old thing, despite it having real impact.

 Sort of.  I'm saying, at least up until now, moving away from
 orthogonal hierarchy support seems to be the right trade-off.  It all
 depends on how you measure how much things are simplified and how
 heavy the real impacts are.  It's not like these things can be
 determined white and black.  Given the current situation, I think it's
 the right call.

I totally understand where you're coming from - trying to get back to
a stable feature set.  But it sucks to be on the losing end of that
battle - you're cutting things that REALLY matter to us, and without a
really viable alternative.  So we'll keep fighting.

 If controller C is enabled at level X but disabled at level X/Y, does
 that mean that X/Y uses the limits set in X?  How about X/Y/Z?

 Y and Y/Z wouldn't make any difference.  Tasks belonging to them would
 behave as if they belong to X as far as C is concerened.

OK, that *sounds* sane.  It doesn't solve all our problems, but it
alleviates some of them.

 So take away some of the flexibility that has minimal impact and
 maximum return.  Splitting threads across cgroups - we use it, but we
 could get off that.  Force all-or-nothing joining of an aggregate

 Please do so.

Splitting threads is sort of important for some cgroups, like CPU.  I
wonder if pjt is paying attention to this thread.

 construct (a container vs N cgroups).

 But perform surgery with a scalpel, not a hatchet.

 As anything else, it's drawing a line in a continuous spectrum of
 grey.  Right now, given that maintaining multiple orthogonal
 hierarchies while introducing a proper concept of resource container
 involves addition of completely new constructs and complexity, I don't
 think that's a good option.  If there are problems which can't be
 resolved / worked around in a reasonable manner, please bring them up
 along with their contexts.  Let's examine them and see whether there
 are other ways to accomodate them.

You're arguing that the abstraction you want is that of a container
but that it's easier to remove options than to actually build a better
API.

I think this is wrong.  Take the opportunity to define the RIGHT
interface that you WANT - a container.  Implement it in terms of
cgroups (and maybe other stuff!).  Make that API so compelling that
people want to use it, and your war of attrition on direct cgroup
madness will be won, but with net progress rather than regress.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Workman-devel] cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 8:53 AM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Daniel P. Berrange (berra...@redhat.com):

 Are you also planning to actually write a new cgroup parent manager
 daemon too ? Currently my plan for libvirt is to just talk directly

 I'm toying with the idea, yes.  (Right now my toy runs in either native
 mode, using cgroupfs, or child mode, talking to a parent manager)  I'd
 love if someone else does it, but it needs to be done.

 As I've said elsewhere in the thread, I see 2 problems to be addressed:

 1. The ability to nest the cgroup manager daemons, so that a daemon
 running in a container can talk to a daemon running on the host.  This
 is the problem my current toy is aiming to address.  But the API it
 exports is just a thin layer over cgroupfs.

 2. Abstract away the kernel/cgroupfs details so that userspace can
 explain its cgroup needs generically.  This is IIUC what systemd is
 addressing with slices and scopes.

 (2) is where I'd really like to have a well thought out, community
 designed API that everyone can agree on, and it might be worth getting
 together (with Tejun) at plumbers or something to lay something out.

We're also working on (2) (well, we HAVE it, but we're dis-integrating
it so we can hopefully publish more widely).  But our (2) depends on
direct cgroupfs access.  If that is to change, we need a really robust
(1).  It's OK (desireable, in fact) that (1) be a very thin layer of
abstraction.

 In the end, something like libvirt or lxc should not need to care
 what is running underneat it.  It should be able to make its requests
 the same way regardless of whether it running in fedora or ubuntu,
 and whether it is running on the host or in a tightly bound container.
 That's my goal anyway :)

 to systemd's new DBus APIs for all management of cgroups, and then
 fall back to writing to cgroupfs directly for cases where systemd
 is not around.  Having a library to abstract these two possible
 alternatives isn't all that compelling unless we think there will
 be multiple cgroups manager daemons. I've been somewhat assuming that
 even Ubuntu will eventually see the benefits  switch to systemd,

 So far I've seen no indication of that :)

 If the systemd code to manage slices could be made separately
 compileable as a standalone library or daemon, then I'd advocate
 using that.  But I don't see a lot of incentive for systemd to do
 that, so I'd feel like a heel even asking.

I want to say let the best API win, but I know that systemd is a
giant katamari ball, and it's absorbing subsystems so it may win by
default.  That isn't going to stop us from trying to do what we do,
and share that with the world.

 then the issue of multiple manager daemons wouldn't really exist.

 True.  But I'm running under the assumption that Ubuntu will stick with
 upstart, and therefore yes I'll need a separate (perhaps pair of)
 management daemons.

 Even if we were to switch to systemd, I'd like the API for userspace
 programs to configure and use cgroups to be as generic as possible,
 so that anyone who wanted to write their own daemon could do so.

 -serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 8:05 AM, Michal Hocko mho...@suse.cz wrote:
 On Thu 27-06-13 22:01:38, Tejun Heo wrote:

 Oh, that in itself is not bad.  I mean, if you're root, it's pretty
 easy to play with and that part is fine.  But combined with the
 hierarchical nature of cgroup and file permissions, it encourages
 people to deligate subdirectories to less previledged domains,

 OK, this really depends on what you expose to non-root users. I have
 seen use cases where admin prepares top-level which is root-only but
 it allows creating sub-groups which are under _full_ control of the
 subdomain. This worked nicely for memcg for example because hard limit,
 oom handling and other knobs are hierarchical so the subdomain cannot
 overwrite what admin has said.

bingo

 And the systemd, with its history of eating projects and not caring much
 about their previous users who are not willing to jump in to the systemd
 car, doesn't sound like a good place where to place the new interface to
 me.

+1

If systemd is the only upstream implementation of this single-agent
idea, we will have to invent our own, and continue to diverge rather
than converge.  I think that, if we are going to pursue this model of
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack.  Then build systemd on
top of that. Let systemd offer more features and policies and
semantic APIs.

We will build our own semantic APIs that are, necessarily, different
from systemd.  But we can all use the same low-level mechanism.

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup access daemon

2013-06-28 Thread Tim Hockin

On Fri, Jun 28, 2013 at 12:21 PM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Tim Hockin (thoc...@hockin.org):
 On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn serge.hal...@ubuntu.com 
 wrote:
  Quoting Tim Hockin (thoc...@hockin.org):
  On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn serge.hal...@ubuntu.com 
  wrote:
   Quoting Tim Hockin (thoc...@hockin.org):
  Could you give examples?
 
  If you have a white/academic paper I should go read, that'd be great.

 We don't have anything on this, but examples may help.

 Someone running as root should be able to connect to the native
 daemon and read or write any cgroup file they want, right?  You could
 argue that root should be able to do this to a child-daemon, too, but
 let's ignore that.

 But inside a container, I don't want the users to be able to write to
 anything in their own container.  I do want them to be able to make
 sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
 able to write to memory.limit_in_bytes, to read but not write
 memory.soft_limit_in_bytes, and not be able to read memory.stat.

 To get even fancier, a user should be able to create a sub-cgroup and
 then designate that sub-cgroup as final - no further sub-sub-cgroups
 allowed under it.  They should also be able to designate that a
 sub-cgroup is one-way - once a process enters it, it can not leave.

 These are real(ish) examples based on what people want to do today.
 In particular, the last couple are things that we want to do, but
 don't do today.

 The particular policy can differ per-container.  Production jobs might
 be allowed to create sub-cgroups, but batch jobs are not.  Some user
 jobs are designated trusted in one facet or another and get more
 (but still not full) access.

 Interesting, thanks.

 I'll think a bit on how to best address these.

  At the moment I'm going on the naive belief that proper hierarchy
  controls will be enforced (eventually) by the kernel - i.e. if
  a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
  won't be possible for /lxc/c1/lxc/c2 to take that access.
 
  The native cgroup manager (the one using cgroupfs) will be checking
  the credentials of the requesting child manager for access(2) to
  the cgroup files.

 This might be sufficient, or the basis for a sufficient access control
 system for users.  The problem comes that we have multiple jobs on a
 single machine running as the same user.  We need to ensure that the
 jobs can not modify each other.

 Would running them each in user namespaces with different mappings (all
 jobs running as uid 1000, but uid 1000  mapped to different host uids
 for each job) would be (long-term) feasible?

Possibly.  It's a largish imposition to make on the caller (we don't
use user namespaces today, though we are evaluating how to start using
them) but perhaps not terrible.

  It is a named socket.

 So anyone can connect?  even with SO_PEERCRED, how do you know which
 branches of the cgroup tree I am allowed to modify if the same user
 owns more than one?

 I was assuming that any process requesting management of
 /c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1)

 So if you have two jobs running as uid 1000, one under /c1 and one
 under /c2, and one as uid 1001 running under /c3 (with the uids owning
 the cgroups), then the file permissions will prevent 1000 and 1001
 from walk over each other, while the cgroup manager will not allow
 a process (child manager or otherwise) under /c1 to manage cgroups
 under /c2 and vice versa.

  Do you have a design spec, or a requirements list, or even a prototype
  that we can look at?
 
  The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
  shows what I have in mind.  It (and the sloppy code next to it)
  represent a few hours' work over the last few days while waiting
  for compiles and in between emails...

 Awesome.  Do you mind if we look?

 No, but it might not be worth it (other than the readme) :) - so far
 it's only served to help me think through what I want and need from
 the mgr.

  But again, it is completely predicated on my goal to have libvirt
  and lxc (and other cgroup users) be able to use the same library
  or API to make their requests whether they are on host or in a
  container, and regardless of the distro they're running under.

 I think that is a good goal.  We'd like to not be different, if
 possible.  Obviously, we can't impose our needs on you if you don't
 want to handle them.  It sounds like what you are building is the
 bottom layer in a stack - we (Google) should use that same bottom
 layer.  But that can only happen iff you're open to hearing our
 requirements.  Otherwise we have to strike out on our own or build
 more layers in-between.

 I'm definately open to your requirements - whether providing what
 you need for another layer on top, or building it right in.

Great.  That's a good place to start :)
--
To unsubscribe from this list: send the line

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin

Come on, now, Lennart.  You put a lot of words in my mouth.

On Fri, Jun 28, 2013 at 6:48 PM, Lennart Poettering lpoet...@redhat.com wrote:
 On 28.06.2013 20:53, Tim Hockin wrote:

 a single-agent, we should make a kick-ass implementation that is
 flexible and scalable, and full-featured enough to not require
 divergence at the lowest layer of the stack.  Then build systemd on
 top of that. Let systemd offer more features and policies and
 semantic APIs.


 Well, what if systemd is already kick-ass? I mean, if you have a problem
 with systemd, then that's your own problem, but I really don't think why I
 should bother?

I didn't say it wasn't.  I said that we can build a common substrate
that systemd can build on *and* non-systemd systems can use *and*
Google can participate in.

 I for sure am not going to make the PID 1 a client of another daemon. That's
 just wrong. If you have a daemon that is both conceptually the manager of
 another service and the client of that other service, then that's bad design
 and you will easily run into deadlocks and such. Just think about it: if you
 have some external daemon for managing cgroups, and you need cgroups for
 running external daemons, how are you going to start the external daemon for
 managing cgroups? Sure, you can hack around this, make that daemon special,
 and magic, and stuff -- or you can just not do such nonsense. There's no
 reason to repeat the fuckup that cgroup became in kernelspace a second time,
 but this time in userspace, with multiple manager daemons all with different
 and slightly incompatible definitions what a unit to manage actualy is...

I forgot about the tautology of systemd.  systemd is monolithic.
Therefore it can not have any external dependencies.  Therefore it
must absorb anything it depends on.  Therefore systemd continues to
grow in size and scope.  Up next: systemd manages your X sessions!

But that's not my point.  It seems pretty easy to make this cgroup
management (in native mode) a library that can have either a thin
veneer of a main() function, while also being usable by systemd.  The
point is to solve all of the problems ONCE.  I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.

 We want to run fewer, simpler things on our systems, we want to reuse as

Fewer and simpler are not compatible, unless you are losing
functionality.  Systemd is fewer, but NOT simpler.

 much of the code as we can. You don't achieve that by running yet another
 daemon that does worse what systemd can anyway do simpler, easier and
 better.

Considering this is all hypothetical, I find this to be a funny
debate.  My hypothetical idea is better than your hypothetical idea.

 The least you could grant us is to have a look at the final APIs we will
 have to offer before you already imply that systemd cannot be a valid
 implementation of any API people could ever agree on.

Whoah, don't get defensive.  I said nothing of the sort.  The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature.  That's unlikely to change in this timescale.
What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the upstream daemons was too
tightly coupled with systemd.

I think we have a lot of experience to offer to this project, and a
vested interest in seeing it done well.  But if it is purely
targetting systemd, we have little incentive to devote resources to
it.

Please note that I am strictly talking about the lowest layer of the
API.  Just the thing that guards cgroupfs against mere mortals.  The
higher layers - where abstractions exist, that are actually USEFUL to
end users - are not really in scope right now.  We already have our
own higher level APIs.

This is supposed to be collaborative, not combative.

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 11:14 AM, Serge Hallyn  wrote:
> Quoting Tejun Heo (t...@kernel.org):
>> Hello, Serge.
>>
>> On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote:
>> > At some point (probably soon) we might want to talk about a standard API
>> > for these things.  However I think it will have to come in the form of
>> > a standard library, which knows to either send requests over dbus to
>> > systemd, or over /dev/cgroup sock to the manager.
>>
>> Yeah, eventually, I think we'll have a standardized way to configure
>> resource distribution in the system.  Maybe we'll agree on a
>> standardized dbus protocol or there will be library, I don't know;
>> however, whatever form it may be in, it abstraction level should be
>> way higher than that of direct cgroupfs access.  It's way too low
>> level and very easy to end up in a complete nonsense configuration.
>>
>> e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone
>> wouldn't enable fair scheduling on that cgroup but drastically reduce
>> the amount of cpu share it gets as it now gets treated as single
>> entity competing with all tasks at the parent level.
>
> Right.  I *think* this can be offered as a daemon which sits as the
> sole consumer of my agent's API, and offers a higher level "do what I
> want" API.  But designing that API is going to be interesting.

This is something we have, partially, and are working to be able to
open-source.  We have a LOT of experience feeding into the semantics
that actually make users happy.

Today it leverages split-hierarchies, but that is not required in the
generic case (only if you want to offer the semantics we do).  It
explicitly delegates some aspects of sub-cgroup control to users, but
that could go away if your lowest-level agency can handle it.

> I should find a good, up-to-date summary of the current behaviors of
> each controller so I can talk more intelligently about it.  (I'll
> start by looking at the kernel Documentation/cgroups, but don't
> feel too confident that they'll be uptodate :)
>
>> At the moment, I'm not sure what the eventual abstraction would look
>> like.  systemd is extending its basic constructs by adding slices and
>> scopes and it does make sense to integrate the general organization of
>> the system (services, user sessions, VMs and so on) with resource
>> management.  Given some time, I'm hoping we'll be able to come up with
>> and agree on some common constructs so that each workload can indicate
>> its resource requirements in a unified way.
>>
>> That said, I really think we should experiment for a while before
>> trying to settle down on things.  We've now just started exploring how
>> system-wide resource managment can be made widely available to systems
>> without requiring extremely specialized hand-crafted configurations
>> and I'm pretty sure we're getting and gonna get quite a few details
>> wrong, so I don't think it'd be a good idea to try to agree on things
>> right now.  As far as such integration goes, I think it's time to play
>> with things and observe the results.
>
> Right,  I'm not attached to my toy implementation at all - except for
> the ability, in some fashion, to have nested agents which don't have
> cgroupfs access but talk to another agent to get the job done.
>
> -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 10:38 AM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote:
>> OK, then what I don't know is what is the new interface?  A new cgroupfs?
>
> It's gonna be a new mount option for cgroupfs.
>
>> DTF and CPU and cpuset all have "default" groups for some tasks (and
>> not others) in our world today.  DTF actually has default, prio, and
>> "normal".  I was simplifying before.  I really wish it were as simple
>> as you think it is.  But if it were, do you think I'd still be
>> arguing?
>
> How am I supposed to know when you don't communicate it but just wave
> your hands saying it's all very complicated?  The cpuset / blkcg
> example is pretty bad because you can enforce any cpuset rules at the
> leaves.

Modifying hundreds of cgroups is really painful, and yes, we do it
often enough to be able to see it.

>> This really doesn't scale when I have thousands of jobs running.
>> Being able to disable at some levels on some controllers probably
>> helps some, but I can't say for sure without knowing the new interface
>
> How does the number of jobs affect it?  Does each job create a new
> cgroup?

Well, in your model it does...

>> We tried it in unified hierarchy.  We had our Top People on the
>> problem.  The best we could get was bad enough that we embarked on a
>> LITERAL 2 year transition to make it better.
>
> What didn't work?  What part was so bad?  I find it pretty difficult
> to believe that multiple orthogonal hierarchies is the only possible
> solution, so please elaborate the issues that you guys have
> experienced.

I'm looping in more Google people.

> The hierarchy is for organization and enforcement of dynamic
> hierarchical resource distribution and that's it.  If its expressive
> power is lacking, take compromise or tune the configuration according
> to the workloads.  The latter is necessary in workloads which have
> clear distinction of foreground and background anyway - anything which
> interacts with human beings including androids.

So what you're saying is that you don't care that this new thing is
less capable than the old thing, despite it having real impact.

>> In other words, define a container as a set of cgroups, one under each
>> each active controller type.  A TID enters the container atomically,
>> joining all of the cgroups or none of the cgroups.
>>
>> container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
>> /cgroup/io/default/foo/bar, /cgroup/cpuset/
>>
>> This is an abstraction that we maintain in userspace (more or less)
>> and we do actually have headaches from split hierarchies here
>> (handling partial failures, non-atomic joins, etc)
>
> That'd separate out task organization from controllre config
> hierarchies.  Kay had a similar idea some time ago.  I think it makes
> things even more complex than it is right now.  I'll continue on this
> below.
>
>> I'm still a bit fuzzy - is all of this written somewhere?
>
> If you dig through cgroup ML, most are there.  There'll be
> "cgroup.controllers" file with which you can enable / disable
> controllers.  Enabling a controller in a cgroup implies that the
> controller is enabled in all ancestors.

Implies or requires?  Cause or predicate?

If controller C is enabled at level X but disabled at level X/Y, does
that mean that X/Y uses the limits set in X?  How about X/Y/Z?

This will get rid of the bulk of the cpuset scaling problem, but not
all of it.  I think we still have the same problems with cpu as we do
with io.  Perhaps that should have been the example.

>> It sounds like you're missing a layer of abstraction.  Why not add the
>> abstraction you want to expose on top of powerful primitives, instead
>> of dumbing down the primitives?
>
> It sure would be possible build more and try to address the issues
> we're seeing now; however, after looking at cgroups for some time now,
> the underlying theme is failure to take reasonable trade-offs and
> going for maximum flexibility in making each choice - the choice of
> interface, multiple hierarchies, no restriction on hierarchical
> behavior, splitting threads of the same process into separate cgroups,
> semi-encouraging delegation through file permission without actually
> pondering the consequences and so on.  And each choice probably made
> sense trying to serve each immediate requirement at the time but added
> up it's a giant pile of mess which developed without direction.

I am very sympathetic to this problem.  You could have just described
some of our internal problems too.  The difference is that we are
trying to make changes that provide more structure and boundaries in
ways

Re: cgroup access daemon

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn  wrote:
> Quoting Tim Hockin (thoc...@hockin.org):
>
>> For our use case this is a huge problem.  We have people who access
>> cgroup files in a fairly tight loops, polling for information.  We
>> have literally hundeds of jobs running on sub-second frequencies -
>> plumbing all of that through a daemon is going to be a disaster.
>> Either your daemon becomes a bottleneck, or we have to build something
>> far more scalable than you really want to.  Not to mention the
>> inefficiency of inserting a layer.
>
> Currently you can trivially create a container which has the
> container's cgroups bind-mounted to the expected places
> (/sys/fs/cgroup/$controller) by uncommenting two lines in the
> configuration file, and handle cgroups through cgroupfs there.
> (This is what the management agent wants to be an alternative
> for)  The main deficiency there is that /proc/self/cgroups is
> not filtered, so it will show /lxc/c1 for init's cgroup, while
> the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
> is seen under the container's /sys/fs/cgroup/devices (for
> instance).  Not ideal.

I'm really saying that if your daemon is to provide a replacement for
cgroupfs direct access, it needs to be designed to be scalable.  If
we're going to get away from bind mounting cgroupfs into user
namespaces, then let's try to solve ALL the problems.

>> We also need the ability to set up eventfds for users or to let them
>> poll() on the socket from this daemon.
>
> So you'd want to be able to request updates when any cgroup value
> is changed, right?

Not necessarily ANY, but that's the terminus of this API facet.

> That's currently not in my very limited set of commands, but I can
> certainly add it, and yes it would be a simple unix sock so you can
> set up eventfd, select/poll, etc.

Assuming the protocol is basically a pass-through to basic filesystem
ops, it should be pretty easy.  You just need to add it to your
protocol.

But it brings up another point - access control.  How do you decide
which files a child agent should have access to?  Does that ever
change based on the child's configuration? In our world, the answer is
almost certainly yes.

>> >> > So then the idea would be that userspace (like libvirt and lxc) would
>> >> > talk over /dev/cgroup to its manager.  Userspace inside a container
>> >> > (which can't actually mount cgroups itself) would talk to its own
>> >> > manager which is talking over a passed-in socket to the host manager,
>> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> >> > the requestor's cgroup).
>> >>
>> >> How do you handle updates of this agent?  Suppose I have hundreds of
>> >> running containers, and I want to release a new version of the cgroupd
>> >> ?
>> >
>> > This may change (which is part of what I want to investigate with some
>> > POC), but right now I'm building any controller-aware smarts into it.  I
>> > think that's what you're asking about?  The agent doesn't do "slices"
>> > etc.  This may turn out to be insufficient, we'll see.
>>
>> No, what I am asking is a release-engineering problem.  Suppose we
>> need to roll out a new version of this daemon (some new feature or a
>> bug or something).  We have hundreds of these "child" agents running
>> in the job containers.
>
> When I say "container" I mean an lxc container, with it's own isolated
> rootfs and mntns.  I'm not sure what your "containers" are, but I if
> they're not that, then they shouldn't need to run a child agent.  They
> can just talk over the host cgroup agent's socket.

If they talk over the host agent's socket, where is the access control
and restriction done?  Who decides how deep I can nest groups?  Who
says which files I may access?  Who stops me from modifying someone
else's container?

Our containers are somewhat thinner and more managed than LXC, but not
that much.  If we're running a system agent in a user container, we
need to manage that software.  We can't just start up a version and
leave it running until the user decides to upgrade - we force
upgrades.

>> How do I bring down all these children, and then bring them back up on
>> a new version in a way that does not disrupt user jobs (much)?
>>
>> Similarly, what happens when one of these child agents crashes?  Does
>> someone restart it?  Do user jobs just stop working?
>
> An upstart^W$init_system job will restart it...

What happens when the main agent crashes?  All those children on UNIX
sockets need to reconnect, I guess.

cgroup access daemon

2013-06-27 Thread Tim Hockin

Changing the subject, so as not to mix two discussions

On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn  wrote:
>
>> > FWIW, the code is too embarassing yet to see daylight, but I'm playing
>> > with a very lowlevel cgroup manager which supports nesting itself.
>> > Access in this POC is low-level ("set freezer.state to THAWED for cgroup
>> > /c1/c2", "Create /c3"), but the key feature is that it can run in two
>> > modes - native mode in which it uses cgroupfs, and child mode where it
>> > talks to a parent manager to make the changes.
>>
>> In this world, are users able to read cgroup files, or do they have to
>> go through a central agent, too?
>
> The agent won't itself do anything to stop access through cgroupfs, but
> the idea would be that cgroupfs would only be mounted in the agent's
> mntns.  My hope would be that the libcgroup commands (like cgexec,
> cgcreate, etc) would know to talk to the agent when possible, and users
> would use those.

For our use case this is a huge problem.  We have people who access
cgroup files in a fairly tight loops, polling for information.  We
have literally hundeds of jobs running on sub-second frequencies -
plumbing all of that through a daemon is going to be a disaster.
Either your daemon becomes a bottleneck, or we have to build something
far more scalable than you really want to.  Not to mention the
inefficiency of inserting a layer.

We also need the ability to set up eventfds for users or to let them
poll() on the socket from this daemon.

>> > So then the idea would be that userspace (like libvirt and lxc) would
>> > talk over /dev/cgroup to its manager.  Userspace inside a container
>> > (which can't actually mount cgroups itself) would talk to its own
>> > manager which is talking over a passed-in socket to the host manager,
>> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> > the requestor's cgroup).
>>
>> How do you handle updates of this agent?  Suppose I have hundreds of
>> running containers, and I want to release a new version of the cgroupd
>> ?
>
> This may change (which is part of what I want to investigate with some
> POC), but right now I'm building any controller-aware smarts into it.  I
> think that's what you're asking about?  The agent doesn't do "slices"
> etc.  This may turn out to be insufficient, we'll see.

No, what I am asking is a release-engineering problem.  Suppose we
need to roll out a new version of this daemon (some new feature or a
bug or something).  We have hundreds of these "child" agents running
in the job containers.

How do I bring down all these children, and then bring them back up on
a new version in a way that does not disrupt user jobs (much)?

Similarly, what happens when one of these child agents crashes?  Does
someone restart it?  Do user jobs just stop working?

>
> So the only state which the agent stores is a list of cgroup mounts (if
> in native mode) or an open socket to the parent (if in child mode), and a
> list of connected children sockets.
>
> HUPping the agent will cause it to reload the cgroupfs mounts (in case
> you've mounted a new controller, living in "the old world" :).  If you
> just kill it and start a new one, it shouldn't matter.
>
>> (note: inquiries about the implementation do not denote acceptance of
>> the model :)
>
> To put it another way, the problem I'm solving (for now) is not the "I
> want a daemon to ensure that requested guarantees are correctly
> implemented." In that sense I'm maintaining the status quo, i.e. the
> admin needs to architect the layout correctly.
>
> The problem I'm solving is really that I want containers to be able to
> handle cgroups even if they can't mount cgroupfs, and I want all
> userspace to be able to behave the same whether they are in a container
> or not.
>
> This isn't meant as a poke in the eye of anyone who wants to address the
> other problem.  If it turns out that we (meaning "the community of
> cgroup users") really want such an agent, then we can add that.  I'm not
> convinced.
>
> What would probably be a better design, then, would be that the agent
> I'm working on can plug into a resource allocation agent.  Or, I
> suppose, the other way around.
>
>> > At some point (probably soon) we might want to talk about a standard API
>> > for these things.  However I think it will have to come in the form of
>> > a standard library, which knows to either send requests over dbus to
>> > systemd, or over /dev/cgroup sock to the manager.
>> >
>> > -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn  wrote:
> Quoting Mike Galbraith (bitbuc...@online.de):
>> On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote:
>> > Hello, Tim.
>> >
>> > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote:
>> > > I really want to understand why this is SO IMPORTANT that you have to
>> > > break userspace compatibility?  I mean, isn't Linux supposed to be the
>> > > OS with the stable kernel interface?  I've seen Linus rant time and
>> > > time again about this - why is it OK now?
>> >
>> > What the hell are you talking about?  Nobody is breaking userland
>> > interface.  A new version of interface is being phased in and the old
>> > one will stay there for the foreseeable future.  It will be phased out
>> > eventually but that's gonna take a long time and it will have to be
>> > something hardly noticeable.  Of course new features will only be
>> > available with the new interface and there will be efforts to nudge
>> > people away from the old one but the existing interface will keep
>> > working it does.
>>
>> I can understand some alarm.  When I saw the below I started frothing at
>> the face and howling at the moon, and I don't even use the things much.
>>
>> http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
>>
>> Hierarchy layout aside, that "private property" bit says that the folks
>> who currently own and use the cgroups interface will lose direct access
>> to it.  I can imagine folks who have become dependent upon an on the fly
>> management agents of their own design becoming a tad alarmed.
>
> FWIW, the code is too embarassing yet to see daylight, but I'm playing
> with a very lowlevel cgroup manager which supports nesting itself.
> Access in this POC is low-level ("set freezer.state to THAWED for cgroup
> /c1/c2", "Create /c3"), but the key feature is that it can run in two
> modes - native mode in which it uses cgroupfs, and child mode where it
> talks to a parent manager to make the changes.

In this world, are users able to read cgroup files, or do they have to
go through a central agent, too?

> So then the idea would be that userspace (like libvirt and lxc) would
> talk over /dev/cgroup to its manager.  Userspace inside a container
> (which can't actually mount cgroups itself) would talk to its own
> manager which is talking over a passed-in socket to the host manager,
> which in turn runs natively (uses cgroupfs, and nests "create /c1" under
> the requestor's cgroup).

How do you handle updates of this agent?  Suppose I have hundreds of
running containers, and I want to release a new version of the cgroupd
?

(note: inquiries about the implementation do not denote acceptance of
the model :)

> At some point (probably soon) we might want to talk about a standard API
> for these things.  However I think it will have to come in the form of
> a standard library, which knows to either send requests over dbus to
> systemd, or over /dev/cgroup sock to the manager.
>
> -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Mike Galbraith (bitbuc...@online.de):
 On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote:
  Hello, Tim.
 
  On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote:
   I really want to understand why this is SO IMPORTANT that you have to
   break userspace compatibility?  I mean, isn't Linux supposed to be the
   OS with the stable kernel interface?  I've seen Linus rant time and
   time again about this - why is it OK now?
 
  What the hell are you talking about?  Nobody is breaking userland
  interface.  A new version of interface is being phased in and the old
  one will stay there for the foreseeable future.  It will be phased out
  eventually but that's gonna take a long time and it will have to be
  something hardly noticeable.  Of course new features will only be
  available with the new interface and there will be efforts to nudge
  people away from the old one but the existing interface will keep
  working it does.

 I can understand some alarm.  When I saw the below I started frothing at
 the face and howling at the moon, and I don't even use the things much.

 http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html

 Hierarchy layout aside, that private property bit says that the folks
 who currently own and use the cgroups interface will lose direct access
 to it.  I can imagine folks who have become dependent upon an on the fly
 management agents of their own design becoming a tad alarmed.

 FWIW, the code is too embarassing yet to see daylight, but I'm playing
 with a very lowlevel cgroup manager which supports nesting itself.
 Access in this POC is low-level (set freezer.state to THAWED for cgroup
 /c1/c2, Create /c3), but the key feature is that it can run in two
 modes - native mode in which it uses cgroupfs, and child mode where it
 talks to a parent manager to make the changes.

In this world, are users able to read cgroup files, or do they have to
go through a central agent, too?

 So then the idea would be that userspace (like libvirt and lxc) would
 talk over /dev/cgroup to its manager.  Userspace inside a container
 (which can't actually mount cgroups itself) would talk to its own
 manager which is talking over a passed-in socket to the host manager,
 which in turn runs natively (uses cgroupfs, and nests create /c1 under
 the requestor's cgroup).

How do you handle updates of this agent?  Suppose I have hundreds of
running containers, and I want to release a new version of the cgroupd
?

(note: inquiries about the implementation do not denote acceptance of
the model :)

 At some point (probably soon) we might want to talk about a standard API
 for these things.  However I think it will have to come in the form of
 a standard library, which knows to either send requests over dbus to
 systemd, or over /dev/cgroup sock to the manager.

 -serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

cgroup access daemon

2013-06-27 Thread Tim Hockin

Changing the subject, so as not to mix two discussions

On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn serge.hal...@ubuntu.com wrote:

  FWIW, the code is too embarassing yet to see daylight, but I'm playing
  with a very lowlevel cgroup manager which supports nesting itself.
  Access in this POC is low-level (set freezer.state to THAWED for cgroup
  /c1/c2, Create /c3), but the key feature is that it can run in two
  modes - native mode in which it uses cgroupfs, and child mode where it
  talks to a parent manager to make the changes.

 In this world, are users able to read cgroup files, or do they have to
 go through a central agent, too?

 The agent won't itself do anything to stop access through cgroupfs, but
 the idea would be that cgroupfs would only be mounted in the agent's
 mntns.  My hope would be that the libcgroup commands (like cgexec,
 cgcreate, etc) would know to talk to the agent when possible, and users
 would use those.

For our use case this is a huge problem.  We have people who access
cgroup files in a fairly tight loops, polling for information.  We
have literally hundeds of jobs running on sub-second frequencies -
plumbing all of that through a daemon is going to be a disaster.
Either your daemon becomes a bottleneck, or we have to build something
far more scalable than you really want to.  Not to mention the
inefficiency of inserting a layer.

We also need the ability to set up eventfds for users or to let them
poll() on the socket from this daemon.

  So then the idea would be that userspace (like libvirt and lxc) would
  talk over /dev/cgroup to its manager.  Userspace inside a container
  (which can't actually mount cgroups itself) would talk to its own
  manager which is talking over a passed-in socket to the host manager,
  which in turn runs natively (uses cgroupfs, and nests create /c1 under
  the requestor's cgroup).

 How do you handle updates of this agent?  Suppose I have hundreds of
 running containers, and I want to release a new version of the cgroupd
 ?

 This may change (which is part of what I want to investigate with some
 POC), but right now I'm building any controller-aware smarts into it.  I
 think that's what you're asking about?  The agent doesn't do slices
 etc.  This may turn out to be insufficient, we'll see.

No, what I am asking is a release-engineering problem.  Suppose we
need to roll out a new version of this daemon (some new feature or a
bug or something).  We have hundreds of these child agents running
in the job containers.

How do I bring down all these children, and then bring them back up on
a new version in a way that does not disrupt user jobs (much)?

Similarly, what happens when one of these child agents crashes?  Does
someone restart it?  Do user jobs just stop working?


 So the only state which the agent stores is a list of cgroup mounts (if
 in native mode) or an open socket to the parent (if in child mode), and a
 list of connected children sockets.

 HUPping the agent will cause it to reload the cgroupfs mounts (in case
 you've mounted a new controller, living in the old world :).  If you
 just kill it and start a new one, it shouldn't matter.

 (note: inquiries about the implementation do not denote acceptance of
 the model :)

 To put it another way, the problem I'm solving (for now) is not the I
 want a daemon to ensure that requested guarantees are correctly
 implemented. In that sense I'm maintaining the status quo, i.e. the
 admin needs to architect the layout correctly.

 The problem I'm solving is really that I want containers to be able to
 handle cgroups even if they can't mount cgroupfs, and I want all
 userspace to be able to behave the same whether they are in a container
 or not.

 This isn't meant as a poke in the eye of anyone who wants to address the
 other problem.  If it turns out that we (meaning the community of
 cgroup users) really want such an agent, then we can add that.  I'm not
 convinced.

 What would probably be a better design, then, would be that the agent
 I'm working on can plug into a resource allocation agent.  Or, I
 suppose, the other way around.

  At some point (probably soon) we might want to talk about a standard API
  for these things.  However I think it will have to come in the form of
  a standard library, which knows to either send requests over dbus to
  systemd, or over /dev/cgroup sock to the manager.
 
  -serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup access daemon

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Tim Hockin (thoc...@hockin.org):

 For our use case this is a huge problem.  We have people who access
 cgroup files in a fairly tight loops, polling for information.  We
 have literally hundeds of jobs running on sub-second frequencies -
 plumbing all of that through a daemon is going to be a disaster.
 Either your daemon becomes a bottleneck, or we have to build something
 far more scalable than you really want to.  Not to mention the
 inefficiency of inserting a layer.

 Currently you can trivially create a container which has the
 container's cgroups bind-mounted to the expected places
 (/sys/fs/cgroup/$controller) by uncommenting two lines in the
 configuration file, and handle cgroups through cgroupfs there.
 (This is what the management agent wants to be an alternative
 for)  The main deficiency there is that /proc/self/cgroups is
 not filtered, so it will show /lxc/c1 for init's cgroup, while
 the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
 is seen under the container's /sys/fs/cgroup/devices (for
 instance).  Not ideal.

I'm really saying that if your daemon is to provide a replacement for
cgroupfs direct access, it needs to be designed to be scalable.  If
we're going to get away from bind mounting cgroupfs into user
namespaces, then let's try to solve ALL the problems.

 We also need the ability to set up eventfds for users or to let them
 poll() on the socket from this daemon.

 So you'd want to be able to request updates when any cgroup value
 is changed, right?

Not necessarily ANY, but that's the terminus of this API facet.

 That's currently not in my very limited set of commands, but I can
 certainly add it, and yes it would be a simple unix sock so you can
 set up eventfd, select/poll, etc.

Assuming the protocol is basically a pass-through to basic filesystem
ops, it should be pretty easy.  You just need to add it to your
protocol.

But it brings up another point - access control.  How do you decide
which files a child agent should have access to?  Does that ever
change based on the child's configuration? In our world, the answer is
almost certainly yes.

   So then the idea would be that userspace (like libvirt and lxc) would
   talk over /dev/cgroup to its manager.  Userspace inside a container
   (which can't actually mount cgroups itself) would talk to its own
   manager which is talking over a passed-in socket to the host manager,
   which in turn runs natively (uses cgroupfs, and nests create /c1 under
   the requestor's cgroup).
 
  How do you handle updates of this agent?  Suppose I have hundreds of
  running containers, and I want to release a new version of the cgroupd
  ?
 
  This may change (which is part of what I want to investigate with some
  POC), but right now I'm building any controller-aware smarts into it.  I
  think that's what you're asking about?  The agent doesn't do slices
  etc.  This may turn out to be insufficient, we'll see.

 No, what I am asking is a release-engineering problem.  Suppose we
 need to roll out a new version of this daemon (some new feature or a
 bug or something).  We have hundreds of these child agents running
 in the job containers.

 When I say container I mean an lxc container, with it's own isolated
 rootfs and mntns.  I'm not sure what your containers are, but I if
 they're not that, then they shouldn't need to run a child agent.  They
 can just talk over the host cgroup agent's socket.

If they talk over the host agent's socket, where is the access control
and restriction done?  Who decides how deep I can nest groups?  Who
says which files I may access?  Who stops me from modifying someone
else's container?

Our containers are somewhat thinner and more managed than LXC, but not
that much.  If we're running a system agent in a user container, we
need to manage that software.  We can't just start up a version and
leave it running until the user decides to upgrade - we force
upgrades.

 How do I bring down all these children, and then bring them back up on
 a new version in a way that does not disrupt user jobs (much)?

 Similarly, what happens when one of these child agents crashes?  Does
 someone restart it?  Do user jobs just stop working?

 An upstart^W$init_system job will restart it...

What happens when the main agent crashes?  All those children on UNIX
sockets need to reconnect, I guess.  This means your UNIX socket needs
to be a named socket, not just a socketpair(),  making your auth model
more complicated.

What happens when the main agent hangs?  Is someone health-checking
it?  How about all the child daemons?

I guess my main point is that this SOUNDS like a simple project, but
if you just do the simple obvious things, it will be woefully
inadequate for anything but simple use-cases.  If we get forced into
such a model (and there are some good reasons to do it, even
disregarding all the other chatter), we'd rather use

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 10:38 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote:
 OK, then what I don't know is what is the new interface?  A new cgroupfs?

 It's gonna be a new mount option for cgroupfs.

 DTF and CPU and cpuset all have default groups for some tasks (and
 not others) in our world today.  DTF actually has default, prio, and
 normal.  I was simplifying before.  I really wish it were as simple
 as you think it is.  But if it were, do you think I'd still be
 arguing?

 How am I supposed to know when you don't communicate it but just wave
 your hands saying it's all very complicated?  The cpuset / blkcg
 example is pretty bad because you can enforce any cpuset rules at the
 leaves.

Modifying hundreds of cgroups is really painful, and yes, we do it
often enough to be able to see it.

 This really doesn't scale when I have thousands of jobs running.
 Being able to disable at some levels on some controllers probably
 helps some, but I can't say for sure without knowing the new interface

 How does the number of jobs affect it?  Does each job create a new
 cgroup?

Well, in your model it does...

 We tried it in unified hierarchy.  We had our Top People on the
 problem.  The best we could get was bad enough that we embarked on a
 LITERAL 2 year transition to make it better.

 What didn't work?  What part was so bad?  I find it pretty difficult
 to believe that multiple orthogonal hierarchies is the only possible
 solution, so please elaborate the issues that you guys have
 experienced.

I'm looping in more Google people.

 The hierarchy is for organization and enforcement of dynamic
 hierarchical resource distribution and that's it.  If its expressive
 power is lacking, take compromise or tune the configuration according
 to the workloads.  The latter is necessary in workloads which have
 clear distinction of foreground and background anyway - anything which
 interacts with human beings including androids.

So what you're saying is that you don't care that this new thing is
less capable than the old thing, despite it having real impact.

 In other words, define a container as a set of cgroups, one under each
 each active controller type.  A TID enters the container atomically,
 joining all of the cgroups or none of the cgroups.

 container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
 /cgroup/io/default/foo/bar, /cgroup/cpuset/

 This is an abstraction that we maintain in userspace (more or less)
 and we do actually have headaches from split hierarchies here
 (handling partial failures, non-atomic joins, etc)

 That'd separate out task organization from controllre config
 hierarchies.  Kay had a similar idea some time ago.  I think it makes
 things even more complex than it is right now.  I'll continue on this
 below.

 I'm still a bit fuzzy - is all of this written somewhere?

 If you dig through cgroup ML, most are there.  There'll be
 cgroup.controllers file with which you can enable / disable
 controllers.  Enabling a controller in a cgroup implies that the
 controller is enabled in all ancestors.

Implies or requires?  Cause or predicate?

If controller C is enabled at level X but disabled at level X/Y, does
that mean that X/Y uses the limits set in X?  How about X/Y/Z?

This will get rid of the bulk of the cpuset scaling problem, but not
all of it.  I think we still have the same problems with cpu as we do
with io.  Perhaps that should have been the example.

 It sounds like you're missing a layer of abstraction.  Why not add the
 abstraction you want to expose on top of powerful primitives, instead
 of dumbing down the primitives?

 It sure would be possible build more and try to address the issues
 we're seeing now; however, after looking at cgroups for some time now,
 the underlying theme is failure to take reasonable trade-offs and
 going for maximum flexibility in making each choice - the choice of
 interface, multiple hierarchies, no restriction on hierarchical
 behavior, splitting threads of the same process into separate cgroups,
 semi-encouraging delegation through file permission without actually
 pondering the consequences and so on.  And each choice probably made
 sense trying to serve each immediate requirement at the time but added
 up it's a giant pile of mess which developed without direction.

I am very sympathetic to this problem.  You could have just described
some of our internal problems too.  The difference is that we are
trying to make changes that provide more structure and boundaries in
ways that retain the fundamental power, without tossing out the baby
with the bathwater.

 So, at this point, I'm very skeptical about adding more flexibility.
 Once the basics are settled, we sure can look into the missing pieces
 but I don't think that's what we should be doing right now.  Another
 thing is that the unified hierarchy can be implemented by using most
 of the constructs cgroup core already has in more

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin

On Thu, Jun 27, 2013 at 11:14 AM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Tejun Heo (t...@kernel.org):
 Hello, Serge.

 On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote:
  At some point (probably soon) we might want to talk about a standard API
  for these things.  However I think it will have to come in the form of
  a standard library, which knows to either send requests over dbus to
  systemd, or over /dev/cgroup sock to the manager.

 Yeah, eventually, I think we'll have a standardized way to configure
 resource distribution in the system.  Maybe we'll agree on a
 standardized dbus protocol or there will be library, I don't know;
 however, whatever form it may be in, it abstraction level should be
 way higher than that of direct cgroupfs access.  It's way too low
 level and very easy to end up in a complete nonsense configuration.

 e.g. enabling cpu on a cgroup whlie leaving other cgroups alone
 wouldn't enable fair scheduling on that cgroup but drastically reduce
 the amount of cpu share it gets as it now gets treated as single
 entity competing with all tasks at the parent level.

 Right.  I *think* this can be offered as a daemon which sits as the
 sole consumer of my agent's API, and offers a higher level do what I
 want API.  But designing that API is going to be interesting.

This is something we have, partially, and are working to be able to
open-source.  We have a LOT of experience feeding into the semantics
that actually make users happy.

Today it leverages split-hierarchies, but that is not required in the
generic case (only if you want to offer the semantics we do).  It
explicitly delegates some aspects of sub-cgroup control to users, but
that could go away if your lowest-level agency can handle it.

 I should find a good, up-to-date summary of the current behaviors of
 each controller so I can talk more intelligently about it.  (I'll
 start by looking at the kernel Documentation/cgroups, but don't
 feel too confident that they'll be uptodate :)

 At the moment, I'm not sure what the eventual abstraction would look
 like.  systemd is extending its basic constructs by adding slices and
 scopes and it does make sense to integrate the general organization of
 the system (services, user sessions, VMs and so on) with resource
 management.  Given some time, I'm hoping we'll be able to come up with
 and agree on some common constructs so that each workload can indicate
 its resource requirements in a unified way.

 That said, I really think we should experiment for a while before
 trying to settle down on things.  We've now just started exploring how
 system-wide resource managment can be made widely available to systems
 without requiring extremely specialized hand-crafted configurations
 and I'm pretty sure we're getting and gonna get quite a few details
 wrong, so I don't think it'd be a good idea to try to agree on things
 right now.  As far as such integration goes, I think it's time to play
 with things and observe the results.

 Right,  I'm not attached to my toy implementation at all - except for
 the ability, in some fashion, to have nested agents which don't have
 cgroupfs access but talk to another agent to get the job done.

 -serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-26 Thread Tim Hockin

On Wed, Jun 26, 2013 at 6:04 PM, Tejun Heo  wrote:
> Hello,
>
> On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote:
>> The first assertion, as I understood, was that (eventually) cgroupfs
>> will not allow split hierarchies - that unified hierarchy would be the
>> only mode.  Is that not the case?
>
> No, unified hierarchy would be an optional thing for quite a while.
>
>> The second assertion, as I understood, was that (eventually) cgroupfs
>> would not support granting access to some cgroup control files to
>> users (through chown/chmod).  Is that not the case?
>
> Again, it'll be an opt-in thing.  The hierarchy controller would be
> able to notice that and issue warnings if it wants to.
>
>> Hmm, so what exactly is changing then?  If, as you say here, the
>> existing interfaces will keep working - what is changing?
>
> New interface is being added and new features will be added only for
> the new interface.  The old one will eventually be deprecated and
> removed, but that *years* away.

OK, then what I don't know is what is the new interface?  A new cgroupfs?

>> As I said, it's controlled delegated access.  And we have some patches
>> that we carry to prevent some of these DoS situations.
>
> I don't know.  You can probably hack around some of the most serious
> problems but the whole thing isn't built for proper delgation and
> that's not the direction the upstream kernel is headed at the moment.
>
>> I actually can not speak to the details of the default IO problem, as
>> it happened before I really got involved.  But just think through it.
>> If one half of the split has 5 processes running and the other half
>> has 200, the processes in the 200 set each get FAR less spindle time
>> than those in the 5 set.  That is NOT the semantic we need.  We're
>> trying to offer ~equal access for users of the non-DTF class of jobs.
>>
>> This is not the tail doing the wagging.  This is your assertion that
>> something should work, when it just doesn't.  We have two, totally
>> orthogonal classes of applications on two totally disjoint sets of
>> resources.  Conjoining them is the wrong answer.
>
> As I've said multiple times, there sure are things that you cannot
> achieve without orthogonal multiple hierarchies, but given the options
> we have at hands, compromising inside a unified hierarchy seems like
> the best trade-off.  Please take a step back from the immediate detail
> and think of the general hierarchical organization of workloads.  If
> DTF / non-DTF is a fundamental part of your workload classfication,
> that should go above.

DTF and CPU and cpuset all have "default" groups for some tasks (and
not others) in our world today.  DTF actually has default, prio, and
"normal".  I was simplifying before.  I really wish it were as simple
as you think it is.  But if it were, do you think I'd still be
arguing?

> I don't really understand your example anyway because you can classify
> by DTF / non-DTF first and then just propagate cpuset settings along.
> You won't lose anything that way, right?

This really doesn't scale when I have thousands of jobs running.
Being able to disable at some levels on some controllers probably
helps some, but I can't say for sure without knowing the new interface

> Again, in general, you might not be able to achieve *exactly* what
> you've been doing, but, an acceptable compromise should be possible
> and not doing so leads to complete mess.

We tried it in unified hierarchy.  We had our Top People on the
problem.  The best we could get was bad enough that we embarked on a
LITERAL 2 year transition to make it better.

>> > But I don't follow the conclusion here.  For short term workaround,
>> > sure, but having that dictate the whole architecture decision seems
>> > completely backwards to me.
>>
>> My point is that the orthogonality of resources is intrinsic.  Letting
>> "it's hard to make it work" dictate the architecture is what's
>> backwards.
>
> No, it's not "it's hard to make it work".  It's more "it's
> fundamentally broken".  You can't identify a resource to be belonging
> to a cgroup independent of who's looking at the resource.

What if you could ensure that for a given TID (or PID if required) in
dir X of controller C, all of the other TIDs in that cgroup were in
the same group, but maybe not the same sub-path, under every
controller?  This gives you what it sounds like you wanted elsewhere -
a container abstraction.

In other words, define a container as a set of cgroups, one under each
each active controller type.  A TID enters the container atomically,
joining all of the cgroups or none of the cgroups.

cont

Re: cgroup: status-quo and userland efforts

2013-06-26 Thread Tim Hockin

On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote:
>> I really want to understand why this is SO IMPORTANT that you have to
>> break userspace compatibility?  I mean, isn't Linux supposed to be the
>> OS with the stable kernel interface?  I've seen Linus rant time and
>> time again about this - why is it OK now?
>
> What the hell are you talking about?  Nobody is breaking userland
> interface.  A new version of interface is being phased in and the old

The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode.  Is that not the case?

The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod).  Is that not the case?

> one will stay there for the foreseeable future.  It will be phased out
> eventually but that's gonna take a long time and it will have to be
> something hardly noticeable.  Of course new features will only be
> available with the new interface and there will be efforts to nudge
> people away from the old one but the existing interface will keep
> working it does.

Hmm, so what exactly is changing then?  If, as you say here, the
existing interfaces will keep working - what is changing?

>> Examples?  we obviously don't grant full access, but our kernel gang
>> and security gang seem to trust the bits we're enabling well enough...
>
> Then the security gang doesn't have any clue what's going on, or at
> least operating on very different assumptions (ie. the workloads are
> trusted by default).  You can OOM the whole kernel by creating many
> cgroups, completely mess up controllers by creating deep hierarchies,
> affect your siblings by adjusting your weight and so on.  It's really
> easy to DoS the whole system if you have write access to a cgroup
> directory.

As I said, it's controlled delegated access.  And we have some patches
that we carry to prevent some of these DoS situations.

>> The non-DTF jobs have a combined share that is small but non-trivial.
>> If we cut that share in half, giving one slice to prod and one slice
>> to batch, we get bad sharing under contention.  We tried this.  We
>
> Why is that tho?  It *should* work fine and I can't think of a reason
> why that would behave particularly badly off the top of my head.
> Maybe I forgot too much of the iosched modification used in google.
> Anyways, if there's a problem, that should be fixable, right?  And
> controller-specific issues like that should really dictate the
> architectural design too much.

I actually can not speak to the details of the default IO problem, as
it happened before I really got involved.  But just think through it.
If one half of the split has 5 processes running and the other half
has 200, the processes in the 200 set each get FAR less spindle time
than those in the 5 set.  That is NOT the semantic we need.  We're
trying to offer ~equal access for users of the non-DTF class of jobs.

This is not the tail doing the wagging.  This is your assertion that
something should work, when it just doesn't.  We have two, totally
orthogonal classes of applications on two totally disjoint sets of
resources.  Conjoining them is the wrong answer.

>> could add control loops in userspace code which try to balance the
>> shares in proportion to the load.  We did that with CPU, and it's sort
>
> Yeah, that is horrible.

Yeah, I would love to explain some of the really nasty things we have
done and are moving away from.  I am not sure I am allowed to, though
:)

>> of horrible.  We're moving AWAY from all this craziness in favor of
>> well-defined hierarchical behaviors.
>
> But I don't follow the conclusion here.  For short term workaround,
> sure, but having that dictate the whole architecture decision seems
> completely backwards to me.

My point is that the orthogonality of resources is intrinsic.  Letting
"it's hard to make it work" dictate the architecture is what's
backwards.

>> It's a bit naive to think that this is some absolute truth, don't you
>> think?  It just isn't so.  You should know better than most what
>> craziness our users do, and what (legit) rationales they can produce.
>> I have $large_number of machines running $huge_number of jobs from
>> thousands of developers running for years upon years backing up my
>> worldview.
>
> If so, you aren't communicating it very well.  I've talked with quite
> a few people about multiple orthogonal hierarchies including people
> inside google.  Sure, some are using it as it is there but I couldn't
> find strong enough rationale to continue that way given

Re: cgroup: status-quo and userland efforts

2013-06-26 Thread Tim Hockin

On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote:
 I really want to understand why this is SO IMPORTANT that you have to
 break userspace compatibility?  I mean, isn't Linux supposed to be the
 OS with the stable kernel interface?  I've seen Linus rant time and
 time again about this - why is it OK now?

 What the hell are you talking about?  Nobody is breaking userland
 interface.  A new version of interface is being phased in and the old

The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode.  Is that not the case?

The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod).  Is that not the case?

 one will stay there for the foreseeable future.  It will be phased out
 eventually but that's gonna take a long time and it will have to be
 something hardly noticeable.  Of course new features will only be
 available with the new interface and there will be efforts to nudge
 people away from the old one but the existing interface will keep
 working it does.

Hmm, so what exactly is changing then?  If, as you say here, the
existing interfaces will keep working - what is changing?

 Examples?  we obviously don't grant full access, but our kernel gang
 and security gang seem to trust the bits we're enabling well enough...

 Then the security gang doesn't have any clue what's going on, or at
 least operating on very different assumptions (ie. the workloads are
 trusted by default).  You can OOM the whole kernel by creating many
 cgroups, completely mess up controllers by creating deep hierarchies,
 affect your siblings by adjusting your weight and so on.  It's really
 easy to DoS the whole system if you have write access to a cgroup
 directory.

As I said, it's controlled delegated access.  And we have some patches
that we carry to prevent some of these DoS situations.

 The non-DTF jobs have a combined share that is small but non-trivial.
 If we cut that share in half, giving one slice to prod and one slice
 to batch, we get bad sharing under contention.  We tried this.  We

 Why is that tho?  It *should* work fine and I can't think of a reason
 why that would behave particularly badly off the top of my head.
 Maybe I forgot too much of the iosched modification used in google.
 Anyways, if there's a problem, that should be fixable, right?  And
 controller-specific issues like that should really dictate the
 architectural design too much.

I actually can not speak to the details of the default IO problem, as
it happened before I really got involved.  But just think through it.
If one half of the split has 5 processes running and the other half
has 200, the processes in the 200 set each get FAR less spindle time
than those in the 5 set.  That is NOT the semantic we need.  We're
trying to offer ~equal access for users of the non-DTF class of jobs.

This is not the tail doing the wagging.  This is your assertion that
something should work, when it just doesn't.  We have two, totally
orthogonal classes of applications on two totally disjoint sets of
resources.  Conjoining them is the wrong answer.

 could add control loops in userspace code which try to balance the
 shares in proportion to the load.  We did that with CPU, and it's sort

 Yeah, that is horrible.

Yeah, I would love to explain some of the really nasty things we have
done and are moving away from.  I am not sure I am allowed to, though
:)

 of horrible.  We're moving AWAY from all this craziness in favor of
 well-defined hierarchical behaviors.

 But I don't follow the conclusion here.  For short term workaround,
 sure, but having that dictate the whole architecture decision seems
 completely backwards to me.

My point is that the orthogonality of resources is intrinsic.  Letting
it's hard to make it work dictate the architecture is what's
backwards.

 It's a bit naive to think that this is some absolute truth, don't you
 think?  It just isn't so.  You should know better than most what
 craziness our users do, and what (legit) rationales they can produce.
 I have $large_number of machines running $huge_number of jobs from
 thousands of developers running for years upon years backing up my
 worldview.

 If so, you aren't communicating it very well.  I've talked with quite
 a few people about multiple orthogonal hierarchies including people
 inside google.  Sure, some are using it as it is there but I couldn't
 find strong enough rationale to continue that way given the amount of
 crazys it implies / encourages.  On the other hand, most people agreed
 that having a unified hierarchy with differing level of granularities
 would serve their cases well enough while not being crazy.

I'm not sure what differing level of granularities means?  But that
aside, who have you

Re: cgroup: status-quo and userland efforts

2013-06-26 Thread Tim Hockin

On Wed, Jun 26, 2013 at 6:04 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote:
 The first assertion, as I understood, was that (eventually) cgroupfs
 will not allow split hierarchies - that unified hierarchy would be the
 only mode.  Is that not the case?

 No, unified hierarchy would be an optional thing for quite a while.

 The second assertion, as I understood, was that (eventually) cgroupfs
 would not support granting access to some cgroup control files to
 users (through chown/chmod).  Is that not the case?

 Again, it'll be an opt-in thing.  The hierarchy controller would be
 able to notice that and issue warnings if it wants to.

 Hmm, so what exactly is changing then?  If, as you say here, the
 existing interfaces will keep working - what is changing?

 New interface is being added and new features will be added only for
 the new interface.  The old one will eventually be deprecated and
 removed, but that *years* away.

OK, then what I don't know is what is the new interface?  A new cgroupfs?

 As I said, it's controlled delegated access.  And we have some patches
 that we carry to prevent some of these DoS situations.

 I don't know.  You can probably hack around some of the most serious
 problems but the whole thing isn't built for proper delgation and
 that's not the direction the upstream kernel is headed at the moment.

 I actually can not speak to the details of the default IO problem, as
 it happened before I really got involved.  But just think through it.
 If one half of the split has 5 processes running and the other half
 has 200, the processes in the 200 set each get FAR less spindle time
 than those in the 5 set.  That is NOT the semantic we need.  We're
 trying to offer ~equal access for users of the non-DTF class of jobs.

 This is not the tail doing the wagging.  This is your assertion that
 something should work, when it just doesn't.  We have two, totally
 orthogonal classes of applications on two totally disjoint sets of
 resources.  Conjoining them is the wrong answer.

 As I've said multiple times, there sure are things that you cannot
 achieve without orthogonal multiple hierarchies, but given the options
 we have at hands, compromising inside a unified hierarchy seems like
 the best trade-off.  Please take a step back from the immediate detail
 and think of the general hierarchical organization of workloads.  If
 DTF / non-DTF is a fundamental part of your workload classfication,
 that should go above.

DTF and CPU and cpuset all have default groups for some tasks (and
not others) in our world today.  DTF actually has default, prio, and
normal.  I was simplifying before.  I really wish it were as simple
as you think it is.  But if it were, do you think I'd still be
arguing?

 I don't really understand your example anyway because you can classify
 by DTF / non-DTF first and then just propagate cpuset settings along.
 You won't lose anything that way, right?

This really doesn't scale when I have thousands of jobs running.
Being able to disable at some levels on some controllers probably
helps some, but I can't say for sure without knowing the new interface

 Again, in general, you might not be able to achieve *exactly* what
 you've been doing, but, an acceptable compromise should be possible
 and not doing so leads to complete mess.

We tried it in unified hierarchy.  We had our Top People on the
problem.  The best we could get was bad enough that we embarked on a
LITERAL 2 year transition to make it better.

  But I don't follow the conclusion here.  For short term workaround,
  sure, but having that dictate the whole architecture decision seems
  completely backwards to me.

 My point is that the orthogonality of resources is intrinsic.  Letting
 it's hard to make it work dictate the architecture is what's
 backwards.

 No, it's not it's hard to make it work.  It's more it's
 fundamentally broken.  You can't identify a resource to be belonging
 to a cgroup independent of who's looking at the resource.

What if you could ensure that for a given TID (or PID if required) in
dir X of controller C, all of the other TIDs in that cgroup were in
the same group, but maybe not the same sub-path, under every
controller?  This gives you what it sounds like you wanted elsewhere -
a container abstraction.

In other words, define a container as a set of cgroups, one under each
each active controller type.  A TID enters the container atomically,
joining all of the cgroups or none of the cgroups.

container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
/cgroup/io/default/foo/bar, /cgroup/cpuset/

This is an abstraction that we maintain in userspace (more or less)
and we do actually have headaches from split hierarchies here
(handling partial failures, non-atomic joins, etc)

 I'm not sure what differing level of granularities means?  But that

 It means that you'll be able to ignore subtrees depending on
 controllers.

I'm still a bit

Re: cgroup: status-quo and userland efforts

2013-06-24 Thread Tim Hockin

On Mon, Jun 24, 2013 at 5:01 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote:
>> I'm very sorry I let this fall off my plate.  I was pointed at a
>> systemd-devel message indicating that this is done.  Is it so?  It
>
> It's progressing pretty fast.
>
>> seems so completely ass-backwards to me. Below is one of our use-cases
>> that I just don't see how we can reproduce in a single-heierarchy.
>
> Configurations which depend on orthogonal multiple hierarchies of
> course won't be replicated under unified hierarchy.  It's unfortunate
> but those just have to go.  More on this later.

I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility?  I mean, isn't Linux supposed to be the
OS with the stable kernel interface?  I've seen Linus rant time and
time again about this - why is it OK now?

>> We're also long into the model that users can control their own
>> sub-cgroups (moderated by permissions decided by admin SW up front).
>
> If you're in control of the base system, nothing prevents you from
> doing so.  It's utterly broken security and policy-enforcement point
> of view but if you can trust each software running on your system to
> do the right thing, it's gonna be fine.

Examples?  we obviously don't grant full access, but our kernel gang
and security gang seem to trust the bits we're enabling well enough...

>> This gives us 4 combinations:
>>   1) { production, DTF }
>>   2) { production, non-DTF }
>>   3) { batch, DTF }
>>   4) { batch non-DTF }
>>
>> Of these, (3) is sort of nonsense, but the others are actually used
>> and needed.  This is only
>> possible because of split hierarchies.  In fact, we undertook a very painful
>> process to move from a unified cgroup hierarchy to split hierarchies in large
>> part _because of_ these examples.
>
> You can create three sibling cgroups and configure cpuset and blkio
> accordingly.  For cpuset, the setup wouldn't make any different.  For
> blkio, the two non-DTFs would now belong to different cgroups and
> compete with each other as two groups, which won't matter at all as
> non-DTFs are given what's left over after serving DTFs anyway, IIRC.

The non-DTF jobs have a combined share that is small but non-trivial.
If we cut that share in half, giving one slice to prod and one slice
to batch, we get bad sharing under contention.  We tried this.  We
could add control loops in userspace code which try to balance the
shares in proportion to the load.  We did that with CPU, and it's sort
of horrible.  We're moving AWAY from all this craziness in favor of
well-defined hierarchical behaviors.

>> Making cgroups composable allows us to build a higher level abstraction that
>> is very powerful and flexible.  Moving back to unified hierarchies goes
>> against everything that we're doing here, and will cause us REAL pain.
>
> Categorizing processes into hierarchical groups of tasks is a
> fundamental idea and a fundamental idea is something to base things on
> top of as it's something people can agree upon relatively easily and
> establish a structure by.  I'd go as far as saying that it's the
> failure on the part of workload design if they in general can't be
> categorized hierarchically.

It's a bit naive to think that this is some absolute truth, don't you
think?  It just isn't so.  You should know better than most what
craziness our users do, and what (legit) rationales they can produce.
I have $large_number of machines running $huge_number of jobs from
thousands of developers running for years upon years backing up my
worldview.

> Even at the practical level, the orthogonal hierarchy encouraged, at
> the very least, the blkcg writeback support which can't be upstreamed
> in any reasonable manner because it is impossible to say that a
> resource can't be said to belong to a cgroup irrespective of who's
> looking at it.

I'm not sure I really grok that statement.  I'm OK with defining new
rules that bring some order to the chaos.  Give us new rules to live
by.  All-or-nothing would be fine.  What if mounting cgroupfs gives me
N sub-dirs, one for each compiled-in controller?  You could make THAT
the mount option - you can have either a unified hierarchy of all
controllers or fully disjoint hierarchies.  Or some other rule.

> It's something fundamentally broken and I have very difficult time
> believing google's workload is so different that it can't be
> categorized in a single hierarchy for the purpose of resource
> distribution.  I'm sure there are cases where some compromises are
> necessary but the laternative is much worse here.  As I wrote multiple
> times now, multiple orthogonal hierarchy support is gonna be around
> for some t

Re: cgroup: status-quo and userland efforts

2013-06-24 Thread Tim Hockin

On Mon, Jun 24, 2013 at 5:01 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote:
 I'm very sorry I let this fall off my plate.  I was pointed at a
 systemd-devel message indicating that this is done.  Is it so?  It

 It's progressing pretty fast.

 seems so completely ass-backwards to me. Below is one of our use-cases
 that I just don't see how we can reproduce in a single-heierarchy.

 Configurations which depend on orthogonal multiple hierarchies of
 course won't be replicated under unified hierarchy.  It's unfortunate
 but those just have to go.  More on this later.

I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility?  I mean, isn't Linux supposed to be the
OS with the stable kernel interface?  I've seen Linus rant time and
time again about this - why is it OK now?

 We're also long into the model that users can control their own
 sub-cgroups (moderated by permissions decided by admin SW up front).

 If you're in control of the base system, nothing prevents you from
 doing so.  It's utterly broken security and policy-enforcement point
 of view but if you can trust each software running on your system to
 do the right thing, it's gonna be fine.

Examples?  we obviously don't grant full access, but our kernel gang
and security gang seem to trust the bits we're enabling well enough...

 This gives us 4 combinations:
   1) { production, DTF }
   2) { production, non-DTF }
   3) { batch, DTF }
   4) { batch non-DTF }

 Of these, (3) is sort of nonsense, but the others are actually used
 and needed.  This is only
 possible because of split hierarchies.  In fact, we undertook a very painful
 process to move from a unified cgroup hierarchy to split hierarchies in large
 part _because of_ these examples.

 You can create three sibling cgroups and configure cpuset and blkio
 accordingly.  For cpuset, the setup wouldn't make any different.  For
 blkio, the two non-DTFs would now belong to different cgroups and
 compete with each other as two groups, which won't matter at all as
 non-DTFs are given what's left over after serving DTFs anyway, IIRC.

The non-DTF jobs have a combined share that is small but non-trivial.
If we cut that share in half, giving one slice to prod and one slice
to batch, we get bad sharing under contention.  We tried this.  We
could add control loops in userspace code which try to balance the
shares in proportion to the load.  We did that with CPU, and it's sort
of horrible.  We're moving AWAY from all this craziness in favor of
well-defined hierarchical behaviors.

 Making cgroups composable allows us to build a higher level abstraction that
 is very powerful and flexible.  Moving back to unified hierarchies goes
 against everything that we're doing here, and will cause us REAL pain.

 Categorizing processes into hierarchical groups of tasks is a
 fundamental idea and a fundamental idea is something to base things on
 top of as it's something people can agree upon relatively easily and
 establish a structure by.  I'd go as far as saying that it's the
 failure on the part of workload design if they in general can't be
 categorized hierarchically.

It's a bit naive to think that this is some absolute truth, don't you
think?  It just isn't so.  You should know better than most what
craziness our users do, and what (legit) rationales they can produce.
I have $large_number of machines running $huge_number of jobs from
thousands of developers running for years upon years backing up my
worldview.

 Even at the practical level, the orthogonal hierarchy encouraged, at
 the very least, the blkcg writeback support which can't be upstreamed
 in any reasonable manner because it is impossible to say that a
 resource can't be said to belong to a cgroup irrespective of who's
 looking at it.

I'm not sure I really grok that statement.  I'm OK with defining new
rules that bring some order to the chaos.  Give us new rules to live
by.  All-or-nothing would be fine.  What if mounting cgroupfs gives me
N sub-dirs, one for each compiled-in controller?  You could make THAT
the mount option - you can have either a unified hierarchy of all
controllers or fully disjoint hierarchies.  Or some other rule.

 It's something fundamentally broken and I have very difficult time
 believing google's workload is so different that it can't be
 categorized in a single hierarchy for the purpose of resource
 distribution.  I'm sure there are cases where some compromises are
 necessary but the laternative is much worse here.  As I wrote multiple
 times now, multiple orthogonal hierarchy support is gonna be around
 for some time, so I don't think there's any rason for panic; that
 said, please at least plan to move on.

The time frame you talk about IS reason for panic.  If I know that
you're going to completely screw me in a a year and a half, I have to
start moving NOW to find new ways to hack around the mess you're
making

Re: cgroup: status-quo and userland efforts

2013-06-22 Thread Tim Hockin

I'm very sorry I let this fall off my plate.  I was pointed at a
systemd-devel message indicating that this is done.  Is it so?  It
seems so completely ass-backwards to me. Below is one of our use-cases
that I just don't see how we can reproduce in a single-heierarchy.
We're also long into the model that users can control their own
sub-cgroups (moderated by permissions decided by admin SW up front).

We have classes of jobs which can run together on shared machines.  This is
VERY important to us, and is a key part of how we run things.  Over the years
we have evolved from very little isolation to fairly strong isolation, and
cgroups are a large part of that.

We have experienced and adapted to a number of problems around isolation over
time.  I won't go into the history of all of these, because it's not so
relevant, but here is how we set things up today.

>From a CPU perspective, we have two classes of jobs: production and batch.
Production jobs can (but don't always) ask for exclusive cores, which ensures
that no batch work runs on those CPUs.  We manage this with the cpuset cgroup.
Batch jobs are relegated to the set of CPUs that are "left-over" after
exclusivity rules are applied.  This is implemented with a shared subdirectory
of the cpuset cgroup called "batch".  Production jobs get their own
subdirectories under cpuset.

>From an IO perspective we also have two classes of jobs: normal and
DTF-approved.  Normal jobs do not get strong isolation for IO, whereas
DTF-enabled jobs do.  The vast majority of jobs are NOT DTF-enabled, and they
share a nominal amount of IO bandwidth.  This is implemented with a shared
subdirectory of the io cgroup called "default".  Jobs that are DTF-enabled get
their own subdirectories under IO.

This gives us 4 combinations:
  1) { production, DTF }
  2) { production, non-DTF }
  3) { batch, DTF }
  4) { batch non-DTF }

Of these, (3) is sort of nonsense, but the others are actually used
and needed.  This is only
possible because of split hierarchies.  In fact, we undertook a very painful
process to move from a unified cgroup hierarchy to split hierarchies in large
part _because of_ these examples.

And for more fun, I am simplifying this all. Batch jobs are actually bound to
NUMA-node specific cpuset cgroups when possible.  And we have a similar
concept for the cpu cgroup as for cpuset.  And we have a third tier of IO
jobs.  We don't do all of this for fun - it is in direct response to REAL
problems we have experienced.

Making cgroups composable allows us to build a higher level abstraction that
is very powerful and flexible.  Moving back to unified hierarchies goes
against everything that we're doing here, and will cause us REAL pain.

On Mon, Apr 22, 2013 at 3:33 PM, Tim Hockin  wrote:
> On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo  wrote:
>> Hello, Tim.
>>
>> On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote:
>>> We absolutely depend on the ability to split cgroup hierarchies.  It
>>> pretty much saved our fleet from imploding, in a way that a unified
>>> hierarchy just could not do.  A mandated unified hierarchy is madness.
>>>  Please step away from the ledge.
>>
>> You need to be a lot more specific about why unified hierarchy can't
>> be implemented.  The last time I asked around blk/memcg people in
>> google, while they said that they'll need different levels of
>> granularities for different controllers, google's use of cgroup
>> doesn't require multiple orthogonal classifications of the same group
>> of tasks.
>
> I'll pull some concrete examples together.  I don't have them on hand,
> and I am out of country this week.  I have looped in the gang at work
> (though some are here with me).
>
>> Also, cgroup isn't dropping multiple hierarchy support over-night.
>> What has been working till now will continue to work for very long
>> time.  If there is no fundamental conflict with the future changes,
>> there should be enough time to migrate gradually as desired.
>>
>>> More, going towards a unified hierarchy really limits what we can
>>> delegate, and that is the word of the day.  We've got a central
>>> authority agent running which manages cgroups, and we want out of this
>>> business.  At least, we want to be able to grant users a set of
>>> constraints, and then let them run wild within those constraints.
>>> Forcing all such work to go through a daemon has proven to be very
>>> problematic, and it has been great now that users can have DIY
>>> sub-cgroups.
>>
>> Sorry, but that doesn't work properly now.  It gives you the illusion
>> of proper delegation but it's inherently dangerous.  If that sort of
>> illusion has been / is good enough for your setup, fine.  Delegate at
&g

Re: cgroup: status-quo and userland efforts

2013-06-22 Thread Tim Hockin

I'm very sorry I let this fall off my plate.  I was pointed at a
systemd-devel message indicating that this is done.  Is it so?  It
seems so completely ass-backwards to me. Below is one of our use-cases
that I just don't see how we can reproduce in a single-heierarchy.
We're also long into the model that users can control their own
sub-cgroups (moderated by permissions decided by admin SW up front).

We have classes of jobs which can run together on shared machines.  This is
VERY important to us, and is a key part of how we run things.  Over the years
we have evolved from very little isolation to fairly strong isolation, and
cgroups are a large part of that.

We have experienced and adapted to a number of problems around isolation over
time.  I won't go into the history of all of these, because it's not so
relevant, but here is how we set things up today.

From a CPU perspective, we have two classes of jobs: production and batch.
Production jobs can (but don't always) ask for exclusive cores, which ensures
that no batch work runs on those CPUs.  We manage this with the cpuset cgroup.
Batch jobs are relegated to the set of CPUs that are left-over after
exclusivity rules are applied.  This is implemented with a shared subdirectory
of the cpuset cgroup called batch.  Production jobs get their own
subdirectories under cpuset.

From an IO perspective we also have two classes of jobs: normal and
DTF-approved.  Normal jobs do not get strong isolation for IO, whereas
DTF-enabled jobs do.  The vast majority of jobs are NOT DTF-enabled, and they
share a nominal amount of IO bandwidth.  This is implemented with a shared
subdirectory of the io cgroup called default.  Jobs that are DTF-enabled get
their own subdirectories under IO.

This gives us 4 combinations:
  1) { production, DTF }
  2) { production, non-DTF }
  3) { batch, DTF }
  4) { batch non-DTF }

Of these, (3) is sort of nonsense, but the others are actually used
and needed.  This is only
possible because of split hierarchies.  In fact, we undertook a very painful
process to move from a unified cgroup hierarchy to split hierarchies in large
part _because of_ these examples.

And for more fun, I am simplifying this all. Batch jobs are actually bound to
NUMA-node specific cpuset cgroups when possible.  And we have a similar
concept for the cpu cgroup as for cpuset.  And we have a third tier of IO
jobs.  We don't do all of this for fun - it is in direct response to REAL
problems we have experienced.

Making cgroups composable allows us to build a higher level abstraction that
is very powerful and flexible.  Moving back to unified hierarchies goes
against everything that we're doing here, and will cause us REAL pain.


On Mon, Apr 22, 2013 at 3:33 PM, Tim Hockin thoc...@hockin.org wrote:
 On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote:
 We absolutely depend on the ability to split cgroup hierarchies.  It
 pretty much saved our fleet from imploding, in a way that a unified
 hierarchy just could not do.  A mandated unified hierarchy is madness.
  Please step away from the ledge.

 You need to be a lot more specific about why unified hierarchy can't
 be implemented.  The last time I asked around blk/memcg people in
 google, while they said that they'll need different levels of
 granularities for different controllers, google's use of cgroup
 doesn't require multiple orthogonal classifications of the same group
 of tasks.

 I'll pull some concrete examples together.  I don't have them on hand,
 and I am out of country this week.  I have looped in the gang at work
 (though some are here with me).

 Also, cgroup isn't dropping multiple hierarchy support over-night.
 What has been working till now will continue to work for very long
 time.  If there is no fundamental conflict with the future changes,
 there should be enough time to migrate gradually as desired.

 More, going towards a unified hierarchy really limits what we can
 delegate, and that is the word of the day.  We've got a central
 authority agent running which manages cgroups, and we want out of this
 business.  At least, we want to be able to grant users a set of
 constraints, and then let them run wild within those constraints.
 Forcing all such work to go through a daemon has proven to be very
 problematic, and it has been great now that users can have DIY
 sub-cgroups.

 Sorry, but that doesn't work properly now.  It gives you the illusion
 of proper delegation but it's inherently dangerous.  If that sort of
 illusion has been / is good enough for your setup, fine.  Delegate at
 your own risks, but cgroup in itself doesn't support delegation to
 lesser security domains and it won't in the foreseeable future.

 We've had great success letting users create sub-cgroups in a few
 specific controller types (cpu, cpuacct, memory).  This is, of course,
 with some restrictions.  We do not just give them blanket

Re: cgroup: status-quo and userland efforts

2013-04-22 Thread Tim Hockin

On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote:
>> We absolutely depend on the ability to split cgroup hierarchies.  It
>> pretty much saved our fleet from imploding, in a way that a unified
>> hierarchy just could not do.  A mandated unified hierarchy is madness.
>>  Please step away from the ledge.
>
> You need to be a lot more specific about why unified hierarchy can't
> be implemented.  The last time I asked around blk/memcg people in
> google, while they said that they'll need different levels of
> granularities for different controllers, google's use of cgroup
> doesn't require multiple orthogonal classifications of the same group
> of tasks.

I'll pull some concrete examples together.  I don't have them on hand,
and I am out of country this week.  I have looped in the gang at work
(though some are here with me).

> Also, cgroup isn't dropping multiple hierarchy support over-night.
> What has been working till now will continue to work for very long
> time.  If there is no fundamental conflict with the future changes,
> there should be enough time to migrate gradually as desired.
>
>> More, going towards a unified hierarchy really limits what we can
>> delegate, and that is the word of the day.  We've got a central
>> authority agent running which manages cgroups, and we want out of this
>> business.  At least, we want to be able to grant users a set of
>> constraints, and then let them run wild within those constraints.
>> Forcing all such work to go through a daemon has proven to be very
>> problematic, and it has been great now that users can have DIY
>> sub-cgroups.
>
> Sorry, but that doesn't work properly now.  It gives you the illusion
> of proper delegation but it's inherently dangerous.  If that sort of
> illusion has been / is good enough for your setup, fine.  Delegate at
> your own risks, but cgroup in itself doesn't support delegation to
> lesser security domains and it won't in the foreseeable future.

We've had great success letting users create sub-cgroups in a few
specific controller types (cpu, cpuacct, memory).  This is, of course,
with some restrictions.  We do not just give them blanket access to
all knobs.  We don't need ALL cgroups, just the important ones.

For a simple example, letting users create sub-groups in freezer or
job (we have a job group that we've been carrying) lets them launch
sub-tasks and manage them in a very clean way.

We've been doing a LOT of development internally to make user-defined
sub-memcgs work in our cluster scheduling system, and it's made some
of our biggest, more insane users very happy.

And for some cgroups, like cpuset, hierarchy just doesn't really make
sense to me.  I just don't care if that never works, though I have no
problem with others wanting it. :)   Aside: if the last CPU in your
cpuset goes offline, you should go into a state akin to freezer.
Running on any other CPU is an overt violation of policy that the
user, or worse - the admin, set up.  Just my 2cents.

>> Strong disagreement, here.  We use split hierarchies to great effect.
>> Containment should be composable.  If your users or abstractions can't
>> handle it, please feel free to co-mount the universe, but please
>> PLEASE don't force us to.
>>
>> I'm happy to talk more about what we do and why.
>
> Please do so.  Why do you need multiple orthogonal hierarchies?

Look for this in the next few days/weeks.  From our point of view,
cgroups are the ideal match for how we want to manage things (no
surprise, really, since Mr. Menage worked on both).

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-04-22 Thread Tim Hockin

Hi Tejun,

This email worries me.  A lot.  It sounds very much like retrograde
motion from our (Google's) point of view.

We absolutely depend on the ability to split cgroup hierarchies.  It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do.  A mandated unified hierarchy is madness.
 Please step away from the ledge.

More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day.  We've got a central
authority agent running which manages cgroups, and we want out of this
business.  At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.

berra...@redhat.com said, downthread:

> We ultimately do need the ability to delegate hierarchy creation to
> unprivileged users / programs, in order to allow containerized OS to
> have the ability to use cgroups. Requiring any applications inside a
> container to talk to a cgroups "authority" existing on the host OS is
> not a satisfactory architecture. We need to allow for a container to
> be self-contained in its usage of cgroups.

This!  A thousand times, this!

> At the same time, we don't need/want to give them unrestricted ability
> to create arbitarily complex hiearchies - we need some limits on it
> to avoid them exposing pathelogically bad kernel behaviour.
>
> This could be as simple as saying that each cgroup controller directory
> has a tunable "cgroups.max_children" and/or "cgroups.max_depth" which
> allow limits to be placed when delegating administration of part of a
>cgroups tree to an unprivileged user.

We've been bitten by this, and more limitations would be great.  We've
got some less-than-perfect patches that impose limits for us now.

> I've no disagreement that we need a unified hiearchy. The workman
> app explicitly does /not/ expose the concept of differing hiearchies
> per controller. Likewise libvirt will not allow the user to configure
> non-unified hiearchies.

Strong disagreement, here.  We use split hierarchies to great effect.
Containment should be composable.  If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.

I'm happy to talk more about what we do and why.

Tim




On Sat, Apr 6, 2013 at 3:21 AM, Tejun Heo  wrote:
> Hello, guys.
>
>  Status-quo
>  ==
>
> It's been about a year since I wrote up a summary on cgroup status quo
> and future plans.  We're not there yet but much closer than we were
> before.  At least the locking and object life-time management aren't
> crazy anymore and most controllers now support proper hierarchy
> although not all of them agree on how to treat inheritance.
>
> IIRC, the yet-to-be-converted ones are blk-throttle and perf.  cpu
> needs to be updated so that it at least supports a similar mechanism
> as cfq-iosched for configuring ratio between tasks on an internal
> cgroup and its children.  Also, we really should update how cpuset
> handles a cgroup becoming empty (no cpus or memory node left due to
> hot-unplug).  It currently transfers all its tasks to the nearest
> ancestor with executing resources, which is an irreversible process
> which would affect all other co-mounted controllers.  We probably want
> it to just take on the masks of the ancestor until its own executing
> resources become online again, and the new behavior should be gated
> behind a switch (Li, can you please look into this?).
>
> While we have still ways to go, I feel relatively confident saying
> that we aren't too far out now, well, except for the writeback mess
> that still needs to be tackled.  Anyways, once the remaining bits are
> settled, we can proceed to implement the unified hierarchy mode I've
> been talking about forever.  I can't think of any fundamental
> roadblocks at the moment but who knows?  The devil usually is in the
> details.  Let's hope it goes okay.
>
> So, while we aren't moving as fast as we wish we were, the kernel side
> of things are falling into places.  At least, that's how I see it.
> From now on, I think how to make it actually useable to userland
> deserves a bit more focus, and by "useable to userland", I don't mean
> some group hacking up an elaborate, manual configuration which is
> tailored to the point of being eccentric to suit the needs of the said
> group.  There's nothing wrong with that and they can continue to do
> so, but it just isn't generically useable or useful.  It should be
> possible to generically and automatically split resources among, say,
> several servers and a couple users sharing a system without resorting
> to indecipherable ad-hoc shell script running off rc.local.
>
>
>  Userland efforts
>  
>
> There are currently a few userland efforts trying to make interfacing
> with cgroup

Re: cgroup: status-quo and userland efforts

2013-04-22 Thread Tim Hockin

Hi Tejun,

This email worries me.  A lot.  It sounds very much like retrograde
motion from our (Google's) point of view.

We absolutely depend on the ability to split cgroup hierarchies.  It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do.  A mandated unified hierarchy is madness.
 Please step away from the ledge.

More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day.  We've got a central
authority agent running which manages cgroups, and we want out of this
business.  At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.

berra...@redhat.com said, downthread:

 We ultimately do need the ability to delegate hierarchy creation to
 unprivileged users / programs, in order to allow containerized OS to
 have the ability to use cgroups. Requiring any applications inside a
 container to talk to a cgroups authority existing on the host OS is
 not a satisfactory architecture. We need to allow for a container to
 be self-contained in its usage of cgroups.

This!  A thousand times, this!

 At the same time, we don't need/want to give them unrestricted ability
 to create arbitarily complex hiearchies - we need some limits on it
 to avoid them exposing pathelogically bad kernel behaviour.

 This could be as simple as saying that each cgroup controller directory
 has a tunable cgroups.max_children and/or cgroups.max_depth which
 allow limits to be placed when delegating administration of part of a
cgroups tree to an unprivileged user.

We've been bitten by this, and more limitations would be great.  We've
got some less-than-perfect patches that impose limits for us now.

 I've no disagreement that we need a unified hiearchy. The workman
 app explicitly does /not/ expose the concept of differing hiearchies
 per controller. Likewise libvirt will not allow the user to configure
 non-unified hiearchies.

Strong disagreement, here.  We use split hierarchies to great effect.
Containment should be composable.  If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.

I'm happy to talk more about what we do and why.

Tim




On Sat, Apr 6, 2013 at 3:21 AM, Tejun Heo t...@kernel.org wrote:
 Hello, guys.

  Status-quo
  ==

 It's been about a year since I wrote up a summary on cgroup status quo
 and future plans.  We're not there yet but much closer than we were
 before.  At least the locking and object life-time management aren't
 crazy anymore and most controllers now support proper hierarchy
 although not all of them agree on how to treat inheritance.

 IIRC, the yet-to-be-converted ones are blk-throttle and perf.  cpu
 needs to be updated so that it at least supports a similar mechanism
 as cfq-iosched for configuring ratio between tasks on an internal
 cgroup and its children.  Also, we really should update how cpuset
 handles a cgroup becoming empty (no cpus or memory node left due to
 hot-unplug).  It currently transfers all its tasks to the nearest
 ancestor with executing resources, which is an irreversible process
 which would affect all other co-mounted controllers.  We probably want
 it to just take on the masks of the ancestor until its own executing
 resources become online again, and the new behavior should be gated
 behind a switch (Li, can you please look into this?).

 While we have still ways to go, I feel relatively confident saying
 that we aren't too far out now, well, except for the writeback mess
 that still needs to be tackled.  Anyways, once the remaining bits are
 settled, we can proceed to implement the unified hierarchy mode I've
 been talking about forever.  I can't think of any fundamental
 roadblocks at the moment but who knows?  The devil usually is in the
 details.  Let's hope it goes okay.

 So, while we aren't moving as fast as we wish we were, the kernel side
 of things are falling into places.  At least, that's how I see it.
 From now on, I think how to make it actually useable to userland
 deserves a bit more focus, and by useable to userland, I don't mean
 some group hacking up an elaborate, manual configuration which is
 tailored to the point of being eccentric to suit the needs of the said
 group.  There's nothing wrong with that and they can continue to do
 so, but it just isn't generically useable or useful.  It should be
 possible to generically and automatically split resources among, say,
 several servers and a couple users sharing a system without resorting
 to indecipherable ad-hoc shell script running off rc.local.


  Userland efforts
  

 There are currently a few userland efforts trying to make interfacing
 with cgroup less painful.

 * libcg: Make cgroup interface accessible

Re: cgroup: status-quo and userland efforts

2013-04-22 Thread Tim Hockin

On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote:
 We absolutely depend on the ability to split cgroup hierarchies.  It
 pretty much saved our fleet from imploding, in a way that a unified
 hierarchy just could not do.  A mandated unified hierarchy is madness.
  Please step away from the ledge.

 You need to be a lot more specific about why unified hierarchy can't
 be implemented.  The last time I asked around blk/memcg people in
 google, while they said that they'll need different levels of
 granularities for different controllers, google's use of cgroup
 doesn't require multiple orthogonal classifications of the same group
 of tasks.

I'll pull some concrete examples together.  I don't have them on hand,
and I am out of country this week.  I have looped in the gang at work
(though some are here with me).

 Also, cgroup isn't dropping multiple hierarchy support over-night.
 What has been working till now will continue to work for very long
 time.  If there is no fundamental conflict with the future changes,
 there should be enough time to migrate gradually as desired.

 More, going towards a unified hierarchy really limits what we can
 delegate, and that is the word of the day.  We've got a central
 authority agent running which manages cgroups, and we want out of this
 business.  At least, we want to be able to grant users a set of
 constraints, and then let them run wild within those constraints.
 Forcing all such work to go through a daemon has proven to be very
 problematic, and it has been great now that users can have DIY
 sub-cgroups.

 Sorry, but that doesn't work properly now.  It gives you the illusion
 of proper delegation but it's inherently dangerous.  If that sort of
 illusion has been / is good enough for your setup, fine.  Delegate at
 your own risks, but cgroup in itself doesn't support delegation to
 lesser security domains and it won't in the foreseeable future.

We've had great success letting users create sub-cgroups in a few
specific controller types (cpu, cpuacct, memory).  This is, of course,
with some restrictions.  We do not just give them blanket access to
all knobs.  We don't need ALL cgroups, just the important ones.

For a simple example, letting users create sub-groups in freezer or
job (we have a job group that we've been carrying) lets them launch
sub-tasks and manage them in a very clean way.

We've been doing a LOT of development internally to make user-defined
sub-memcgs work in our cluster scheduling system, and it's made some
of our biggest, more insane users very happy.

And for some cgroups, like cpuset, hierarchy just doesn't really make
sense to me.  I just don't care if that never works, though I have no
problem with others wanting it. :)   Aside: if the last CPU in your
cpuset goes offline, you should go into a state akin to freezer.
Running on any other CPU is an overt violation of policy that the
user, or worse - the admin, set up.  Just my 2cents.

 Strong disagreement, here.  We use split hierarchies to great effect.
 Containment should be composable.  If your users or abstractions can't
 handle it, please feel free to co-mount the universe, but please
 PLEASE don't force us to.

 I'm happy to talk more about what we do and why.

 Please do so.  Why do you need multiple orthogonal hierarchies?

Look for this in the next few days/weeks.  From our point of view,
cgroups are the ideal match for how we want to manage things (no
surprise, really, since Mr. Menage worked on both).

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
>> I am not limited by kernel memory, I am limited by PIDs, and I need to
>> be able to manage them.  memory.kmem.usage_in_bytes seems to be far
>> too noisy to be useful for this purpose.  It may work fine for "just
>> stop a fork bomb" but not for any sort of finer-grained control.
>
> So, why are you limited by PIDs other than the arcane / weird
> limitation that you have whereever that limitation is?

Does anyone anywhere actually set PID_MAX > 64K?  As far as I can
tell, distros default it to 32K or 64K because there's a lot of stuff
out there that assumes this to be true.  This is the problem we have -
deep down in the bowels of code that is taking literally years to
overhaul, we have identified a bad assumption that PIDs are always 5
characters long.  I can't fix it any faster.  That said, we also
identified other software that make similar assumptions, though they
are less critical to us.

>> > If you think you can tilt it the other way, please feel free to try.
>>
>> Just because others caved, doesn't make it less of a hack.  And I will
>> cave, too, because I don't have time to bang my head against a wall,
>> especially when I can see the remnants of other people who have tried.
>>
>> We'll work around it, or we'll hack around it, or we'll carry this
>> patch in our own tree and just grumble about ridiculous hacks every
>> time we have to forward port it.
>>
>> I was just hoping that things had worked themselves out in the last year.
>
> It's kinda weird getting this response, as I don't think it has been
> particularly walley.  The arguments were pretty sound from what I
> recall and Frederic's use case was actually better covered by kmemcg,
> so where's the said wall?  And I asked you why your use case is
> different and the only reason you gave me is some arbitrary PID
> limitation on whatever thing you're using, which you gotta agree is a
> pretty hard sell.  So, if you think you have a valid case, please just
> explain it.  Why go passive agressive on it?  If you don't have a
> valid case for pushing it, yes, you'll have to hack around it - carry
> the patches in your tree, whatever, or better, fix the weird PID
> problem.

Sorry Tejun, you're being very reasonable, I was not.  The history of
this patch is what makes me frustrated.  It seems like such an obvious
thing to support that it blows my mind that people argue it.

You know our environment.  Users can use their memory budgets however
they like - kernel or userspace.  We have good accounting, but we are
PID limited.  We've even implemented some hacks of our own to make
that hurt less because the previously-mentioned assumptions are just
NOT going away any time soon.  I literally have user bugs every week
on this.  Hopefully the hacks we have put in place will make the users
stop hurting.  But we're left with some residual problems, some of
which are because the only limits we can apply are per-user rather
than per-container.

>From our POV building the cluster, cgroups are strictly superior to
most other control interfaces because they work at the same
granularity that we do.  I want more things to support cgroup control.
 This particular one was double-tasty because the ability to set the
limit to 0 would actually solve a different problem we have in
teardown.  But as I said, we can mostly work around that.

So I am frustrated because I don't think my use case will convince you
(at the root of it, it is a problem of our own making, but it LONG
predates me), despite my belief that it is obviously a good feature.
I find myself hoping that someone else comes along and says "me too"
rather than using a totally different hack for this.

Oh well.  Thanks for the update.  Off to do our own thing again.

> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo  wrote:
> Hey,
>
> On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
>> > U so that's why you guys can't use kernel memory limit? :(
>>
>> Because it is completely non-obvious how to map between the two in a
>> way that is safe across kernel versions and not likely to blow up in
>> our faces.  It's a hack, in other words.
>
> Now we're repeating the argument Frederic and Johannes had, so I'd
> suggest going back the thread and reading the discussion and if you
> still think using kmemcg is a bad idea, please explain why that is so.
> For the specific point that you just raised, the scale tilted toward
> thread/process count is a hacky and unreliable representation of
> kernel memory resource than the other way around, at least back then.

I am not limited by kernel memory, I am limited by PIDs, and I need to
be able to manage them.  memory.kmem.usage_in_bytes seems to be far
too noisy to be useful for this purpose.  It may work fine for "just
stop a fork bomb" but not for any sort of finer-grained control.

> If you think you can tilt it the other way, please feel free to try.

Just because others caved, doesn't make it less of a hack.  And I will
cave, too, because I don't have time to bang my head against a wall,
especially when I can see the remnants of other people who have tried.

We'll work around it, or we'll hack around it, or we'll carry this
patch in our own tree and just grumble about ridiculous hacks every
time we have to forward port it.

I was just hoping that things had worked themselves out in the last year.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
>> We run dozens of jobs from dozens users on a single machine.  We
>> regularly experience users who leak threads, running into the tens of
>> thousands.  We are unable to raise the PID_MAX significantly due to
>> some bad, but really thoroughly baked-in decisions that were made a
>> long time ago.  What we experience on a daily basis is users
>
> U so that's why you guys can't use kernel memory limit? :(

Because it is completely non-obvious how to map between the two in a
way that is safe across kernel versions and not likely to blow up in
our faces.  It's a hack, in other words.

>> complaining about getting a "pthread_create(): resource unavailable"
>> error because someone on the machine has leaked.
> ...
>> What I really don't understand is why so much push back?  We have this
>> nicely structured cgroup system.  Each cgroup controller's code is
>> pretty well partitioned - why would we not want more complete
>> functionality built around it?  We accept device drivers for the most
>> random, useless crap on the assertion that "if you don't need it,
>> don't compile it in".  I can think of a half dozen more really useful,
>> cool things we can do with cgroups, but I know the pushback will be
>> tremendous, and I just don't grok why.
>
> In general, because it adds to maintenance overhead.  e.g. We've been
> trying to make all cgroups follow consistent nesting rules.  We're now
> almost there with a couple controllers left.  This one would have been
> another thing to do, which is fine if it's necessary but if it isn't
> we're just adding up work for no good reason.
>
> More importantly, because cgroup is already plagued with so many bad
> design decisions - some from core design decisions - e.g. not being
> able to actually identify a resource outside of a context of a task.
> Others are added on by each controller going out doing whatever it
> wants without thinking about how the whole thing would come together
> afterwards - e.g. double accounting between cpu and cpuacct,
> completely illogical and unusable hierarchy implementations in
> anything other than cpu controllers (they're getting better), and so
> on.  Right now it's in a state where there's not many things coherent
> about it.  Sure, every controller and feature supports the ones their
> makers intended them to but when collected together it's just a mess,
> which is one of the common complaints against cgroup.
>
> So, no free-for-all, please.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo  wrote:
> On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
>> Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
>> for process/thread counts?  That sounds terrible - I hope I am
>
> Well, the argument was that process / thread counts were a poor and
> unnecessary proxy for kernel memory consumption limit.  IIRC, Johannes
> put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread
> worth of component".
>
>> misunderstanding?  This task counter patch had several properties that
>> mapped very well to what we want.
>>
>> Is it dead in the water?
>
> After some discussion, Frederic agreed that at least his use case can
> be served well by kmemcg, maybe even better - IIRC it was container
> fork bomb scenario, so you'll have to argue your way in why kmemcg
> isn't a suitable solution for your use case if you wanna revive this.

We run dozens of jobs from dozens users on a single machine.  We
regularly experience users who leak threads, running into the tens of
thousands.  We are unable to raise the PID_MAX significantly due to
some bad, but really thoroughly baked-in decisions that were made a
long time ago.  What we experience on a daily basis is users
complaining about getting a "pthread_create(): resource unavailable"
error because someone on the machine has leaked.

Today we use RLIMIT_NPROC to lock most users down to a smaller max.
But this is a per-user setting, not a per-container setting, and users
do not control where their jobs land.  Scheduling decisions often put
multiple thread-heavy but non-leaking jobs from one user onto the same
machine, which again causes problems.  Further, it does not help for
some of our use cases where a logical job can run as multiple UIDs for
different processes within.

>From the end-user point of view this is an isolation leak which is
totally non-deterministic for them.  They can not know what to plan
for.  Getting cgroup-level control of this limit is important for a
saner SLA for our users.

In addition, the behavior around locking-out new tasks seems like a
nice way to simplify and clean up end-life work for the administrative
system.  Admittedly, we can mostly work around this with freezer
instead.

What I really don't understand is why so much push back?  We have this
nicely structured cgroup system.  Each cgroup controller's code is
pretty well partitioned - why would we not want more complete
functionality built around it?  We accept device drivers for the most
random, useless crap on the assertion that "if you don't need it,
don't compile it in".  I can think of a half dozen more really useful,
cool things we can do with cgroups, but I know the pushback will be
tremendous, and I just don't grok why.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo  wrote:
> On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
>> A year later - what ever happened with this?  I want it more than ever
>> for Google's use.
>
> I think the conclusion was "use kmemcg instead".

Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
for process/thread counts?  That sounds terrible - I hope I am
misunderstanding?  This task counter patch had several properties that
mapped very well to what we want.

Is it dead in the water?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

A year later - what ever happened with this?  I want it more than ever
for Google's use.

On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker  wrote:
> Hi,
>
> Changes In this version:
>
> - Split 32/64 bits version of res_counter_write_u64() [1/10]
>   Courtesy of Kirill A. Shutemov
>
> - Added Kirill's ack [8/10]
>
> - Added selftests [9/10], [10/10]
>
> Please consider for merging. At least two users want this feature:
> https://lkml.org/lkml/2011/12/13/309
> https://lkml.org/lkml/2011/12/13/364
>
> More general details provided in the last version posting:
> https://lkml.org/lkml/2012/1/13/230
>
> Thanks!
>
>
> Frederic Weisbecker (9):
>   cgroups: add res_counter_write_u64() API
>   cgroups: new resource counter inheritance API
>   cgroups: ability to stop res charge propagation on bounded ancestor
>   res_counter: allow charge failure pointer to be null
>   cgroups: pull up res counter charge failure interpretation to caller
>   cgroups: allow subsystems to cancel a fork
>   cgroups: Add a task counter subsystem
>   selftests: Enter each directories before executing selftests
>   selftests: Add a new task counter selftest
>
> Kirill A. Shutemov (1):
>   cgroups: add res counter common ancestor searching
>
>  Documentation/cgroups/resource_counter.txt |   20 ++-
>  Documentation/cgroups/task_counter.txt |  153 +++
>  include/linux/cgroup.h |   20 +-
>  include/linux/cgroup_subsys.h  |5 +
>  include/linux/res_counter.h|   27 ++-
>  init/Kconfig   |9 +
>  kernel/Makefile|1 +
>  kernel/cgroup.c|   23 ++-
>  kernel/cgroup_freezer.c|6 +-
>  kernel/cgroup_task_counter.c   |  272 
> 
>  kernel/exit.c  |2 +-
>  kernel/fork.c  |7 +-
>  kernel/res_counter.c   |  103 +++-
>  tools/testing/selftests/Makefile   |2 +-
>  tools/testing/selftests/run_tests  |6 +-
>  tools/testing/selftests/task_counter/Makefile  |8 +
>  tools/testing/selftests/task_counter/fork.c|   40 +++
>  tools/testing/selftests/task_counter/forkbomb.c|   40 +++
>  tools/testing/selftests/task_counter/multithread.c |   68 +
>  tools/testing/selftests/task_counter/run_test  |  198 ++
>  .../selftests/task_counter/spread_thread_group.c   |   82 ++
>  21 files changed, 1056 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/cgroups/task_counter.txt
>  create mode 100644 kernel/cgroup_task_counter.c
>  create mode 100644 tools/testing/selftests/task_counter/Makefile
>  create mode 100644 tools/testing/selftests/task_counter/fork.c
>  create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
>  create mode 100644 tools/testing/selftests/task_counter/multithread.c
>  create mode 100755 tools/testing/selftests/task_counter/run_test
>  create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c
>
> --
> 1.7.5.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

A year later - what ever happened with this?  I want it more than ever
for Google's use.

On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker fweis...@gmail.com wrote:
 Hi,

 Changes In this version:

 - Split 32/64 bits version of res_counter_write_u64() [1/10]
   Courtesy of Kirill A. Shutemov

 - Added Kirill's ack [8/10]

 - Added selftests [9/10], [10/10]

 Please consider for merging. At least two users want this feature:
 https://lkml.org/lkml/2011/12/13/309
 https://lkml.org/lkml/2011/12/13/364

 More general details provided in the last version posting:
 https://lkml.org/lkml/2012/1/13/230

 Thanks!


 Frederic Weisbecker (9):
   cgroups: add res_counter_write_u64() API
   cgroups: new resource counter inheritance API
   cgroups: ability to stop res charge propagation on bounded ancestor
   res_counter: allow charge failure pointer to be null
   cgroups: pull up res counter charge failure interpretation to caller
   cgroups: allow subsystems to cancel a fork
   cgroups: Add a task counter subsystem
   selftests: Enter each directories before executing selftests
   selftests: Add a new task counter selftest

 Kirill A. Shutemov (1):
   cgroups: add res counter common ancestor searching

  Documentation/cgroups/resource_counter.txt |   20 ++-
  Documentation/cgroups/task_counter.txt |  153 +++
  include/linux/cgroup.h |   20 +-
  include/linux/cgroup_subsys.h  |5 +
  include/linux/res_counter.h|   27 ++-
  init/Kconfig   |9 +
  kernel/Makefile|1 +
  kernel/cgroup.c|   23 ++-
  kernel/cgroup_freezer.c|6 +-
  kernel/cgroup_task_counter.c   |  272 
 
  kernel/exit.c  |2 +-
  kernel/fork.c  |7 +-
  kernel/res_counter.c   |  103 +++-
  tools/testing/selftests/Makefile   |2 +-
  tools/testing/selftests/run_tests  |6 +-
  tools/testing/selftests/task_counter/Makefile  |8 +
  tools/testing/selftests/task_counter/fork.c|   40 +++
  tools/testing/selftests/task_counter/forkbomb.c|   40 +++
  tools/testing/selftests/task_counter/multithread.c |   68 +
  tools/testing/selftests/task_counter/run_test  |  198 ++
  .../selftests/task_counter/spread_thread_group.c   |   82 ++
  21 files changed, 1056 insertions(+), 36 deletions(-)
  create mode 100644 Documentation/cgroups/task_counter.txt
  create mode 100644 kernel/cgroup_task_counter.c
  create mode 100644 tools/testing/selftests/task_counter/Makefile
  create mode 100644 tools/testing/selftests/task_counter/fork.c
  create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
  create mode 100644 tools/testing/selftests/task_counter/multithread.c
  create mode 100755 tools/testing/selftests/task_counter/run_test
  create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c

 --
 1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo t...@kernel.org wrote:
 On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
 A year later - what ever happened with this?  I want it more than ever
 for Google's use.

 I think the conclusion was use kmemcg instead.

Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
for process/thread counts?  That sounds terrible - I hope I am
misunderstanding?  This task counter patch had several properties that
mapped very well to what we want.

Is it dead in the water?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo t...@kernel.org wrote:
 On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
 Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
 for process/thread counts?  That sounds terrible - I hope I am

 Well, the argument was that process / thread counts were a poor and
 unnecessary proxy for kernel memory consumption limit.  IIRC, Johannes
 put it as (I'm paraphrasing) you can't go to Fry's and buy 4k thread
 worth of component.

 misunderstanding?  This task counter patch had several properties that
 mapped very well to what we want.

 Is it dead in the water?

 After some discussion, Frederic agreed that at least his use case can
 be served well by kmemcg, maybe even better - IIRC it was container
 fork bomb scenario, so you'll have to argue your way in why kmemcg
 isn't a suitable solution for your use case if you wanna revive this.

We run dozens of jobs from dozens users on a single machine.  We
regularly experience users who leak threads, running into the tens of
thousands.  We are unable to raise the PID_MAX significantly due to
some bad, but really thoroughly baked-in decisions that were made a
long time ago.  What we experience on a daily basis is users
complaining about getting a pthread_create(): resource unavailable
error because someone on the machine has leaked.

Today we use RLIMIT_NPROC to lock most users down to a smaller max.
But this is a per-user setting, not a per-container setting, and users
do not control where their jobs land.  Scheduling decisions often put
multiple thread-heavy but non-leaking jobs from one user onto the same
machine, which again causes problems.  Further, it does not help for
some of our use cases where a logical job can run as multiple UIDs for
different processes within.

From the end-user point of view this is an isolation leak which is
totally non-deterministic for them.  They can not know what to plan
for.  Getting cgroup-level control of this limit is important for a
saner SLA for our users.

In addition, the behavior around locking-out new tasks seems like a
nice way to simplify and clean up end-life work for the administrative
system.  Admittedly, we can mostly work around this with freezer
instead.

What I really don't understand is why so much push back?  We have this
nicely structured cgroup system.  Each cgroup controller's code is
pretty well partitioned - why would we not want more complete
functionality built around it?  We accept device drivers for the most
random, useless crap on the assertion that if you don't need it,
don't compile it in.  I can think of a half dozen more really useful,
cool things we can do with cgroups, but I know the pushback will be
tremendous, and I just don't grok why.

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Tim.

 On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
 We run dozens of jobs from dozens users on a single machine.  We
 regularly experience users who leak threads, running into the tens of
 thousands.  We are unable to raise the PID_MAX significantly due to
 some bad, but really thoroughly baked-in decisions that were made a
 long time ago.  What we experience on a daily basis is users

 U so that's why you guys can't use kernel memory limit? :(

Because it is completely non-obvious how to map between the two in a
way that is safe across kernel versions and not likely to blow up in
our faces.  It's a hack, in other words.

 complaining about getting a pthread_create(): resource unavailable
 error because someone on the machine has leaked.
 ...
 What I really don't understand is why so much push back?  We have this
 nicely structured cgroup system.  Each cgroup controller's code is
 pretty well partitioned - why would we not want more complete
 functionality built around it?  We accept device drivers for the most
 random, useless crap on the assertion that if you don't need it,
 don't compile it in.  I can think of a half dozen more really useful,
 cool things we can do with cgroups, but I know the pushback will be
 tremendous, and I just don't grok why.

 In general, because it adds to maintenance overhead.  e.g. We've been
 trying to make all cgroups follow consistent nesting rules.  We're now
 almost there with a couple controllers left.  This one would have been
 another thing to do, which is fine if it's necessary but if it isn't
 we're just adding up work for no good reason.

 More importantly, because cgroup is already plagued with so many bad
 design decisions - some from core design decisions - e.g. not being
 able to actually identify a resource outside of a context of a task.
 Others are added on by each controller going out doing whatever it
 wants without thinking about how the whole thing would come together
 afterwards - e.g. double accounting between cpu and cpuacct,
 completely illogical and unusable hierarchy implementations in
 anything other than cpu controllers (they're getting better), and so
 on.  Right now it's in a state where there's not many things coherent
 about it.  Sure, every controller and feature supports the ones their
 makers intended them to but when collected together it's just a mess,
 which is one of the common complaints against cgroup.

 So, no free-for-all, please.

 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo t...@kernel.org wrote:
 Hey,

 On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
  U so that's why you guys can't use kernel memory limit? :(

 Because it is completely non-obvious how to map between the two in a
 way that is safe across kernel versions and not likely to blow up in
 our faces.  It's a hack, in other words.

 Now we're repeating the argument Frederic and Johannes had, so I'd
 suggest going back the thread and reading the discussion and if you
 still think using kmemcg is a bad idea, please explain why that is so.
 For the specific point that you just raised, the scale tilted toward
 thread/process count is a hacky and unreliable representation of
 kernel memory resource than the other way around, at least back then.

I am not limited by kernel memory, I am limited by PIDs, and I need to
be able to manage them.  memory.kmem.usage_in_bytes seems to be far
too noisy to be useful for this purpose.  It may work fine for just
stop a fork bomb but not for any sort of finer-grained control.

 If you think you can tilt it the other way, please feel free to try.

Just because others caved, doesn't make it less of a hack.  And I will
cave, too, because I don't have time to bang my head against a wall,
especially when I can see the remnants of other people who have tried.

We'll work around it, or we'll hack around it, or we'll carry this
patch in our own tree and just grumble about ridiculous hacks every
time we have to forward port it.

I was just hoping that things had worked themselves out in the last year.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin

On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
 I am not limited by kernel memory, I am limited by PIDs, and I need to
 be able to manage them.  memory.kmem.usage_in_bytes seems to be far
 too noisy to be useful for this purpose.  It may work fine for just
 stop a fork bomb but not for any sort of finer-grained control.

 So, why are you limited by PIDs other than the arcane / weird
 limitation that you have whereever that limitation is?

Does anyone anywhere actually set PID_MAX  64K?  As far as I can
tell, distros default it to 32K or 64K because there's a lot of stuff
out there that assumes this to be true.  This is the problem we have -
deep down in the bowels of code that is taking literally years to
overhaul, we have identified a bad assumption that PIDs are always 5
characters long.  I can't fix it any faster.  That said, we also
identified other software that make similar assumptions, though they
are less critical to us.

  If you think you can tilt it the other way, please feel free to try.

 Just because others caved, doesn't make it less of a hack.  And I will
 cave, too, because I don't have time to bang my head against a wall,
 especially when I can see the remnants of other people who have tried.

 We'll work around it, or we'll hack around it, or we'll carry this
 patch in our own tree and just grumble about ridiculous hacks every
 time we have to forward port it.

 I was just hoping that things had worked themselves out in the last year.

 It's kinda weird getting this response, as I don't think it has been
 particularly walley.  The arguments were pretty sound from what I
 recall and Frederic's use case was actually better covered by kmemcg,
 so where's the said wall?  And I asked you why your use case is
 different and the only reason you gave me is some arbitrary PID
 limitation on whatever thing you're using, which you gotta agree is a
 pretty hard sell.  So, if you think you have a valid case, please just
 explain it.  Why go passive agressive on it?  If you don't have a
 valid case for pushing it, yes, you'll have to hack around it - carry
 the patches in your tree, whatever, or better, fix the weird PID
 problem.

Sorry Tejun, you're being very reasonable, I was not.  The history of
this patch is what makes me frustrated.  It seems like such an obvious
thing to support that it blows my mind that people argue it.

You know our environment.  Users can use their memory budgets however
they like - kernel or userspace.  We have good accounting, but we are
PID limited.  We've even implemented some hacks of our own to make
that hurt less because the previously-mentioned assumptions are just
NOT going away any time soon.  I literally have user bugs every week
on this.  Hopefully the hacks we have put in place will make the users
stop hurting.  But we're left with some residual problems, some of
which are because the only limits we can apply are per-user rather
than per-container.

From our POV building the cluster, cgroups are strictly superior to
most other control interfaces because they work at the same
granularity that we do.  I want more things to support cgroup control.
 This particular one was double-tasty because the ability to set the
limit to 0 would actually solve a different problem we have in
teardown.  But as I said, we can mostly work around that.

So I am frustrated because I don't think my use case will convince you
(at the root of it, it is a problem of our own making, but it LONG
predates me), despite my belief that it is obviously a good feature.
I find myself hoping that someone else comes along and says me too
rather than using a totally different hack for this.

Oh well.  Thanks for the update.  Off to do our own thing again.


 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic on resume from S3 - stumped

2012-12-30 Thread Tim Hockin

On Sun, Dec 30, 2012 at 2:55 PM, Rafael J. Wysocki  wrote:
> On Saturday, December 29, 2012 11:17:11 PM Tim Hockin wrote:
>> Best guess:
>>
>> With 'noapic', I see the "irq 5: nobody cared" message on resume,
>> along with 1 IRQ5 counts in /proc/interrupts (the devices claiming
>> that IRQ are quiescent).
>>
>> Without 'noapic' that must be triggering something else to go haywire,
>> perhaps the AER logic (though that is all MSI, so probably not).  I'm
>> flying blind on those boots.
>>
>> I bet that, if I can recall how to re-enable IRQ5, I'll see it
>> continuously asserting.  Chipset or BIOS bug maybe.  I don't know if I
>> had AER enabled under Lucid, so that might be the difference.
>>
>> I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I
>> can make it progress.
>
> I wonder what happens if you simply disable AER for starters?
>
> There is the pci=noaer kernel command line switch for that.

That still panics on resume.  Damn.  I really think it is down to that
interrupt storm at resume.  Something somewhere is getting stuck
asserting, and we don't know how to EOI it.  PIC vs APIC is just
changing the operating mode.

Now the question is whether I am going to track through Intel errata
(more than I have already) and through chipset docs to figure out what
it could be, or just leave it at noapic.

I've already got one new PCI quirk to code up.

> Thanks,
> Rafael
>
>
>> On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin  wrote:
>> > Quick update: booting with 'noapic' on the commandline seems to make
>> > it resume successfully.
>> >
>> > The main dmesg diffs, other than the obvious "Skipping IOAPIC probe"
>> > and IRG number diffs) are:
>> >
>> > -nr_irqs_gsi: 40
>> > +nr_irqs_gsi: 16
>> >
>> > -NR_IRQS:16640 nr_irqs:776 16
>> > +NR_IRQS:16640 nr_irqs:368 16
>> >
>> > -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
>> > +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved
>> >
>> > and a new warning about irq 5: nobody cared (try booting with the
>> > "irqpoll" option)
>> >
>> > I'll see if I can sort out further differences, but I thought it was
>> > worth sending this new info along, anyway.
>> >
>> > It did not require 'noapic' on the Lucid (2.6.32?) kernel
>> >
>> >
>> > On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin  wrote:
>> >> Running a suspend with pm_trace set, I get:
>> >>
>> >> aer :00:03.0:pcie02: hash matches
>> >>
>> >> I don't know what magic might be needed here, though.
>> >>
>> >> I guess next step is to try to build a non-distro kernel.
>> >>
>> >> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
>> >>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>> >>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>> >>>> resume worked flawlessly every time.
>> >>>>
>> >>>> Then I upgraded to Ubuntu Precise.
>> >>>
>> >>> Well, do you use a distro kernel or a kernel.org kernel?
>> >>>
>> >>>> Suspend seems to work, but resume
>> >>>> fails every time. The video never initializes.  By the flashing
>> >>>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>> >>>> CD and from a fresh install.
>> >>>>
>> >>>> Here is my debug so far.
>> >>>>
>> >>>> Install all updates (3.2 kernel, nouveau driver)
>> >>>> Reboot
>> >>>> Try suspend = fails
>> >>>>
>> >>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>> >>>> Reboot
>> >>>> Try suspend = fails
>> >>>>
>> >>>> Install nVidia's 304 driver
>> >>>> Reboot
>> >>>> Try suspend = fails
>> >>>>
>> >>>> From within X:
>> >>>>   echo core > /sys/power/pm_test
>> >>>>   echo mem > /sys/power/state
>> >>>> The system acts like it is going to sleep, and then wakes up a few
>> >>>> seconds later. dmesg shows:
>> >>>>
>> >>>> [ 1230.083404] [ cut here ]
>> >>>> [ 1230.083410] WARNING: at
>> >>>&g

Re: kernel panic on resume from S3 - stumped

2012-12-30 Thread Tim Hockin

On Sun, Dec 30, 2012 at 2:55 PM, Rafael J. Wysocki r...@sisk.pl wrote:
 On Saturday, December 29, 2012 11:17:11 PM Tim Hockin wrote:
 Best guess:

 With 'noapic', I see the irq 5: nobody cared message on resume,
 along with 1 IRQ5 counts in /proc/interrupts (the devices claiming
 that IRQ are quiescent).

 Without 'noapic' that must be triggering something else to go haywire,
 perhaps the AER logic (though that is all MSI, so probably not).  I'm
 flying blind on those boots.

 I bet that, if I can recall how to re-enable IRQ5, I'll see it
 continuously asserting.  Chipset or BIOS bug maybe.  I don't know if I
 had AER enabled under Lucid, so that might be the difference.

 I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I
 can make it progress.

 I wonder what happens if you simply disable AER for starters?

 There is the pci=noaer kernel command line switch for that.

That still panics on resume.  Damn.  I really think it is down to that
interrupt storm at resume.  Something somewhere is getting stuck
asserting, and we don't know how to EOI it.  PIC vs APIC is just
changing the operating mode.

Now the question is whether I am going to track through Intel errata
(more than I have already) and through chipset docs to figure out what
it could be, or just leave it at noapic.

I've already got one new PCI quirk to code up.

 Thanks,
 Rafael


 On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin thoc...@hockin.org wrote:
  Quick update: booting with 'noapic' on the commandline seems to make
  it resume successfully.
 
  The main dmesg diffs, other than the obvious Skipping IOAPIC probe
  and IRG number diffs) are:
 
  -nr_irqs_gsi: 40
  +nr_irqs_gsi: 16
 
  -NR_IRQS:16640 nr_irqs:776 16
  +NR_IRQS:16640 nr_irqs:368 16
 
  -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
  +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved
 
  and a new warning about irq 5: nobody cared (try booting with the
  irqpoll option)
 
  I'll see if I can sort out further differences, but I thought it was
  worth sending this new info along, anyway.
 
  It did not require 'noapic' on the Lucid (2.6.32?) kernel
 
 
  On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin thoc...@hockin.org wrote:
  Running a suspend with pm_trace set, I get:
 
  aer :00:03.0:pcie02: hash matches
 
  I don't know what magic might be needed here, though.
 
  I guess next step is to try to build a non-distro kernel.
 
  On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote:
  On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
  4 days ago I had Ubuntu Lucid running on this computer. Suspend and
  resume worked flawlessly every time.
 
  Then I upgraded to Ubuntu Precise.
 
  Well, do you use a distro kernel or a kernel.org kernel?
 
  Suspend seems to work, but resume
  fails every time. The video never initializes.  By the flashing
  keyboard lights, I guess it's a kernel panic. It fails from the Live
  CD and from a fresh install.
 
  Here is my debug so far.
 
  Install all updates (3.2 kernel, nouveau driver)
  Reboot
  Try suspend = fails
 
  Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
  Reboot
  Try suspend = fails
 
  Install nVidia's 304 driver
  Reboot
  Try suspend = fails
 
  From within X:
echo core  /sys/power/pm_test
echo mem  /sys/power/state
  The system acts like it is going to sleep, and then wakes up a few
  seconds later. dmesg shows:
 
  [ 1230.083404] [ cut here ]
  [ 1230.083410] WARNING: at
  /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
  suspend_test_finish+0x86/0x90()
  [ 1230.083411] Hardware name: To Be Filled By O.E.M.
  [ 1230.083412] Component: resume devices, time: 14424
  [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
  snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
  nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
  snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
  snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
  ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
  bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
  shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
  pata_marvell
  [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
  #32~precise1-Ubuntu
  [ 1230.083446] Call Trace:
  [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0
  [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50
  [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90
  [ 1230.083457] [8109b53b] 
  suspend_devices_and_enter+0x10b/0x200
  [ 1230.083460] [8109b701] enter_state+0xd1/0x100
  [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60
  [ 1230.083465] [8109a7a5] state_store+0x45/0x70
  [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30
  [ 1230.083471] [811f77ff] sysfs_write_file

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

Best guess:

With 'noapic', I see the "irq 5: nobody cared" message on resume,
along with 1 IRQ5 counts in /proc/interrupts (the devices claiming
that IRQ are quiescent).

Without 'noapic' that must be triggering something else to go haywire,
perhaps the AER logic (though that is all MSI, so probably not).  I'm
flying blind on those boots.

I bet that, if I can recall how to re-enable IRQ5, I'll see it
continuously asserting.  Chipset or BIOS bug maybe.  I don't know if I
had AER enabled under Lucid, so that might be the difference.

I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I
can make it progress.


On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin  wrote:
> Quick update: booting with 'noapic' on the commandline seems to make
> it resume successfully.
>
> The main dmesg diffs, other than the obvious "Skipping IOAPIC probe"
> and IRG number diffs) are:
>
> -nr_irqs_gsi: 40
> +nr_irqs_gsi: 16
>
> -NR_IRQS:16640 nr_irqs:776 16
> +NR_IRQS:16640 nr_irqs:368 16
>
> -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
> +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved
>
> and a new warning about irq 5: nobody cared (try booting with the
> "irqpoll" option)
>
> I'll see if I can sort out further differences, but I thought it was
> worth sending this new info along, anyway.
>
> It did not require 'noapic' on the Lucid (2.6.32?) kernel
>
>
> On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin  wrote:
>> Running a suspend with pm_trace set, I get:
>>
>> aer :00:03.0:pcie02: hash matches
>>
>> I don't know what magic might be needed here, though.
>>
>> I guess next step is to try to build a non-distro kernel.
>>
>> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
>>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>>>> resume worked flawlessly every time.
>>>>
>>>> Then I upgraded to Ubuntu Precise.
>>>
>>> Well, do you use a distro kernel or a kernel.org kernel?
>>>
>>>> Suspend seems to work, but resume
>>>> fails every time. The video never initializes.  By the flashing
>>>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>>>> CD and from a fresh install.
>>>>
>>>> Here is my debug so far.
>>>>
>>>> Install all updates (3.2 kernel, nouveau driver)
>>>> Reboot
>>>> Try suspend = fails
>>>>
>>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>>>> Reboot
>>>> Try suspend = fails
>>>>
>>>> Install nVidia's 304 driver
>>>> Reboot
>>>> Try suspend = fails
>>>>
>>>> From within X:
>>>>   echo core > /sys/power/pm_test
>>>>   echo mem > /sys/power/state
>>>> The system acts like it is going to sleep, and then wakes up a few
>>>> seconds later. dmesg shows:
>>>>
>>>> [ 1230.083404] [ cut here ]
>>>> [ 1230.083410] WARNING: at
>>>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
>>>> suspend_test_finish+0x86/0x90()
>>>> [ 1230.083411] Hardware name: To Be Filled By O.E.M.
>>>> [ 1230.083412] Component: resume devices, time: 14424
>>>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
>>>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
>>>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
>>>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
>>>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
>>>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
>>>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
>>>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
>>>> pata_marvell
>>>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
>>>> #32~precise1-Ubuntu
>>>> [ 1230.083446] Call Trace:
>>>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
>>>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
>>>> [ 1230.083455] [] suspend_test_finish+0x86/0x90
>>>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
>>>> [ 1230.083460] [] enter_state+0xd1/0x100
>>>> [ 1230.083463] [] pm_suspend+0x1b/0x60
>>>> [ 1230.083465] [] state_store+0x4

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

Quick update: booting with 'noapic' on the commandline seems to make
it resume successfully.

The main dmesg diffs, other than the obvious "Skipping IOAPIC probe"
and IRG number diffs) are:

-nr_irqs_gsi: 40
+nr_irqs_gsi: 16

-NR_IRQS:16640 nr_irqs:776 16
+NR_IRQS:16640 nr_irqs:368 16

-system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
+system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved

and a new warning about irq 5: nobody cared (try booting with the
"irqpoll" option)

I'll see if I can sort out further differences, but I thought it was
worth sending this new info along, anyway.

It did not require 'noapic' on the Lucid (2.6.32?) kernel


On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin  wrote:
> Running a suspend with pm_trace set, I get:
>
> aer :00:03.0:pcie02: hash matches
>
> I don't know what magic might be needed here, though.
>
> I guess next step is to try to build a non-distro kernel.
>
> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>>> resume worked flawlessly every time.
>>>
>>> Then I upgraded to Ubuntu Precise.
>>
>> Well, do you use a distro kernel or a kernel.org kernel?
>>
>>> Suspend seems to work, but resume
>>> fails every time. The video never initializes.  By the flashing
>>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>>> CD and from a fresh install.
>>>
>>> Here is my debug so far.
>>>
>>> Install all updates (3.2 kernel, nouveau driver)
>>> Reboot
>>> Try suspend = fails
>>>
>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>>> Reboot
>>> Try suspend = fails
>>>
>>> Install nVidia's 304 driver
>>> Reboot
>>> Try suspend = fails
>>>
>>> From within X:
>>>   echo core > /sys/power/pm_test
>>>   echo mem > /sys/power/state
>>> The system acts like it is going to sleep, and then wakes up a few
>>> seconds later. dmesg shows:
>>>
>>> [ 1230.083404] [ cut here ]
>>> [ 1230.083410] WARNING: at
>>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
>>> suspend_test_finish+0x86/0x90()
>>> [ 1230.083411] Hardware name: To Be Filled By O.E.M.
>>> [ 1230.083412] Component: resume devices, time: 14424
>>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
>>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
>>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
>>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
>>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
>>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
>>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
>>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
>>> pata_marvell
>>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
>>> #32~precise1-Ubuntu
>>> [ 1230.083446] Call Trace:
>>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
>>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
>>> [ 1230.083455] [] suspend_test_finish+0x86/0x90
>>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
>>> [ 1230.083460] [] enter_state+0xd1/0x100
>>> [ 1230.083463] [] pm_suspend+0x1b/0x60
>>> [ 1230.083465] [] state_store+0x45/0x70
>>> [ 1230.083467] [] kobj_attr_store+0xf/0x30
>>> [ 1230.083471] [] sysfs_write_file+0xef/0x170
>>> [ 1230.083476] [] vfs_write+0xb3/0x180
>>> [ 1230.083480] [] sys_write+0x4a/0x90
>>> [ 1230.083483] [] system_call_fastpath+0x16/0x1b
>>> [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---
>>>
>>> Boot with init=/bin/bash
>>> unload all modules except USBHID
>>> echo core > /sys/power/pm_test
>>> echo mem > /sys/power/state
>>> system acts like it is going to sleep, and then wakes up a few seconds later
>>> echo none > /sys/power/pm_test
>>> echo mem > /sys/power/state
>>> system goes to sleep
>>> press power to resume = fails
>>>
>>> At this point I am stumped on how to debug. This is a "modern"
>>> computer with no serial ports. It worked under Lucid, so I know it is
>>> POSSIBLE.
>>>
>>> Mobo: ASRock X58 single-socket
>>> CPU: West

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

Running a suspend with pm_trace set, I get:

aer :00:03.0:pcie02: hash matches

I don't know what magic might be needed here, though.

I guess next step is to try to build a non-distro kernel.

On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>> resume worked flawlessly every time.
>>
>> Then I upgraded to Ubuntu Precise.
>
> Well, do you use a distro kernel or a kernel.org kernel?
>
>> Suspend seems to work, but resume
>> fails every time. The video never initializes.  By the flashing
>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>> CD and from a fresh install.
>>
>> Here is my debug so far.
>>
>> Install all updates (3.2 kernel, nouveau driver)
>> Reboot
>> Try suspend = fails
>>
>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>> Reboot
>> Try suspend = fails
>>
>> Install nVidia's 304 driver
>> Reboot
>> Try suspend = fails
>>
>> From within X:
>>   echo core > /sys/power/pm_test
>>   echo mem > /sys/power/state
>> The system acts like it is going to sleep, and then wakes up a few
>> seconds later. dmesg shows:
>>
>> [ 1230.083404] [ cut here ]
>> [ 1230.083410] WARNING: at
>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
>> suspend_test_finish+0x86/0x90()
>> [ 1230.083411] Hardware name: To Be Filled By O.E.M.
>> [ 1230.083412] Component: resume devices, time: 14424
>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
>> pata_marvell
>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
>> #32~precise1-Ubuntu
>> [ 1230.083446] Call Trace:
>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
>> [ 1230.083455] [] suspend_test_finish+0x86/0x90
>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
>> [ 1230.083460] [] enter_state+0xd1/0x100
>> [ 1230.083463] [] pm_suspend+0x1b/0x60
>> [ 1230.083465] [] state_store+0x45/0x70
>> [ 1230.083467] [] kobj_attr_store+0xf/0x30
>> [ 1230.083471] [] sysfs_write_file+0xef/0x170
>> [ 1230.083476] [] vfs_write+0xb3/0x180
>> [ 1230.083480] [] sys_write+0x4a/0x90
>> [ 1230.083483] [] system_call_fastpath+0x16/0x1b
>> [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---
>>
>> Boot with init=/bin/bash
>> unload all modules except USBHID
>> echo core > /sys/power/pm_test
>> echo mem > /sys/power/state
>> system acts like it is going to sleep, and then wakes up a few seconds later
>> echo none > /sys/power/pm_test
>> echo mem > /sys/power/state
>> system goes to sleep
>> press power to resume = fails
>>
>> At this point I am stumped on how to debug. This is a "modern"
>> computer with no serial ports. It worked under Lucid, so I know it is
>> POSSIBLE.
>>
>> Mobo: ASRock X58 single-socket
>> CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
>> RAM: 12 GB ECC
>> Disk: sda = Intel SSD, mounted on /
>> Disk: sdb = Intel SSD, not mounted
>> Disk: sdc = Seagate HDD, not mounted
>> Disk: sdd = Seagate HDD, not mounted
>> NIC = Onboard RTL8168e/8111e
>> Sound = EMU1212 (emu10k1, not even configured yet)
>> Video = nVidia GeForce 7600 GT
>> KB = PS2 (also tried USB)
>> Mouse = USB
>>
>> I have not updated to a more current kernel than 3.5, but I will if
>> there's evidence that this is resolved.  Any other clever trick to
>> try?
>
> There is no evidence and there won't be if you don't try a newer kernel.
>
> Thanks,
> Rafael
>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

4 days ago I had Ubuntu Lucid running on this computer. Suspend and
resume worked flawlessly every time.

Then I upgraded to Ubuntu Precise. Suspend seems to work, but resume
fails every time. The video never initializes.  By the flashing
keyboard lights, I guess it's a kernel panic. It fails from the Live
CD and from a fresh install.

Here is my debug so far.

Install all updates (3.2 kernel, nouveau driver)
Reboot
Try suspend = fails

Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
Reboot
Try suspend = fails

Install nVidia's 304 driver
Reboot
Try suspend = fails

>From within X:
  echo core > /sys/power/pm_test
  echo mem > /sys/power/state
The system acts like it is going to sleep, and then wakes up a few
seconds later. dmesg shows:

[ 1230.083404] [ cut here ]
[ 1230.083410] WARNING: at
/build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
suspend_test_finish+0x86/0x90()
[ 1230.083411] Hardware name: To Be Filled By O.E.M.
[ 1230.083412] Component: resume devices, time: 14424
[ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
pata_marvell
[ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
#32~precise1-Ubuntu
[ 1230.083446] Call Trace:
[ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
[ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
[ 1230.083455] [] suspend_test_finish+0x86/0x90
[ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
[ 1230.083460] [] enter_state+0xd1/0x100
[ 1230.083463] [] pm_suspend+0x1b/0x60
[ 1230.083465] [] state_store+0x45/0x70
[ 1230.083467] [] kobj_attr_store+0xf/0x30
[ 1230.083471] [] sysfs_write_file+0xef/0x170
[ 1230.083476] [] vfs_write+0xb3/0x180
[ 1230.083480] [] sys_write+0x4a/0x90
[ 1230.083483] [] system_call_fastpath+0x16/0x1b
[ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---

Boot with init=/bin/bash
unload all modules except USBHID
echo core > /sys/power/pm_test
echo mem > /sys/power/state
system acts like it is going to sleep, and then wakes up a few seconds later
echo none > /sys/power/pm_test
echo mem > /sys/power/state
system goes to sleep
press power to resume = fails

At this point I am stumped on how to debug. This is a "modern"
computer with no serial ports. It worked under Lucid, so I know it is
POSSIBLE.

Mobo: ASRock X58 single-socket
CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
RAM: 12 GB ECC
Disk: sda = Intel SSD, mounted on /
Disk: sdb = Intel SSD, not mounted
Disk: sdc = Seagate HDD, not mounted
Disk: sdd = Seagate HDD, not mounted
NIC = Onboard RTL8168e/8111e
Sound = EMU1212 (emu10k1, not even configured yet)
Video = nVidia GeForce 7600 GT
KB = PS2 (also tried USB)
Mouse = USB

I have not updated to a more current kernel than 3.5, but I will if
there's evidence that this is resolved.  Any other clever trick to
try?

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

4 days ago I had Ubuntu Lucid running on this computer. Suspend and
resume worked flawlessly every time.

Then I upgraded to Ubuntu Precise. Suspend seems to work, but resume
fails every time. The video never initializes.  By the flashing
keyboard lights, I guess it's a kernel panic. It fails from the Live
CD and from a fresh install.

Here is my debug so far.

Install all updates (3.2 kernel, nouveau driver)
Reboot
Try suspend = fails

Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
Reboot
Try suspend = fails

Install nVidia's 304 driver
Reboot
Try suspend = fails

From within X:
  echo core  /sys/power/pm_test
  echo mem  /sys/power/state
The system acts like it is going to sleep, and then wakes up a few
seconds later. dmesg shows:

[ 1230.083404] [ cut here ]
[ 1230.083410] WARNING: at
/build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
suspend_test_finish+0x86/0x90()
[ 1230.083411] Hardware name: To Be Filled By O.E.M.
[ 1230.083412] Component: resume devices, time: 14424
[ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
pata_marvell
[ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
#32~precise1-Ubuntu
[ 1230.083446] Call Trace:
[ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0
[ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50
[ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90
[ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200
[ 1230.083460] [8109b701] enter_state+0xd1/0x100
[ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60
[ 1230.083465] [8109a7a5] state_store+0x45/0x70
[ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30
[ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170
[ 1230.083476] [811879d3] vfs_write+0xb3/0x180
[ 1230.083480] [81187cfa] sys_write+0x4a/0x90
[ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b
[ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---

Boot with init=/bin/bash
unload all modules except USBHID
echo core  /sys/power/pm_test
echo mem  /sys/power/state
system acts like it is going to sleep, and then wakes up a few seconds later
echo none  /sys/power/pm_test
echo mem  /sys/power/state
system goes to sleep
press power to resume = fails

At this point I am stumped on how to debug. This is a modern
computer with no serial ports. It worked under Lucid, so I know it is
POSSIBLE.

Mobo: ASRock X58 single-socket
CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
RAM: 12 GB ECC
Disk: sda = Intel SSD, mounted on /
Disk: sdb = Intel SSD, not mounted
Disk: sdc = Seagate HDD, not mounted
Disk: sdd = Seagate HDD, not mounted
NIC = Onboard RTL8168e/8111e
Sound = EMU1212 (emu10k1, not even configured yet)
Video = nVidia GeForce 7600 GT
KB = PS2 (also tried USB)
Mouse = USB

I have not updated to a more current kernel than 3.5, but I will if
there's evidence that this is resolved.  Any other clever trick to
try?

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

Running a suspend with pm_trace set, I get:

aer :00:03.0:pcie02: hash matches

I don't know what magic might be needed here, though.

I guess next step is to try to build a non-distro kernel.

On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote:
 On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
 resume worked flawlessly every time.

 Then I upgraded to Ubuntu Precise.

 Well, do you use a distro kernel or a kernel.org kernel?

 Suspend seems to work, but resume
 fails every time. The video never initializes.  By the flashing
 keyboard lights, I guess it's a kernel panic. It fails from the Live
 CD and from a fresh install.

 Here is my debug so far.

 Install all updates (3.2 kernel, nouveau driver)
 Reboot
 Try suspend = fails

 Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
 Reboot
 Try suspend = fails

 Install nVidia's 304 driver
 Reboot
 Try suspend = fails

 From within X:
   echo core  /sys/power/pm_test
   echo mem  /sys/power/state
 The system acts like it is going to sleep, and then wakes up a few
 seconds later. dmesg shows:

 [ 1230.083404] [ cut here ]
 [ 1230.083410] WARNING: at
 /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
 suspend_test_finish+0x86/0x90()
 [ 1230.083411] Hardware name: To Be Filled By O.E.M.
 [ 1230.083412] Component: resume devices, time: 14424
 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
 snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
 nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
 snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
 snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
 ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
 shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
 pata_marvell
 [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
 #32~precise1-Ubuntu
 [ 1230.083446] Call Trace:
 [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0
 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50
 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90
 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200
 [ 1230.083460] [8109b701] enter_state+0xd1/0x100
 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60
 [ 1230.083465] [8109a7a5] state_store+0x45/0x70
 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30
 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170
 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180
 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90
 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b
 [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---

 Boot with init=/bin/bash
 unload all modules except USBHID
 echo core  /sys/power/pm_test
 echo mem  /sys/power/state
 system acts like it is going to sleep, and then wakes up a few seconds later
 echo none  /sys/power/pm_test
 echo mem  /sys/power/state
 system goes to sleep
 press power to resume = fails

 At this point I am stumped on how to debug. This is a modern
 computer with no serial ports. It worked under Lucid, so I know it is
 POSSIBLE.

 Mobo: ASRock X58 single-socket
 CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
 RAM: 12 GB ECC
 Disk: sda = Intel SSD, mounted on /
 Disk: sdb = Intel SSD, not mounted
 Disk: sdc = Seagate HDD, not mounted
 Disk: sdd = Seagate HDD, not mounted
 NIC = Onboard RTL8168e/8111e
 Sound = EMU1212 (emu10k1, not even configured yet)
 Video = nVidia GeForce 7600 GT
 KB = PS2 (also tried USB)
 Mouse = USB

 I have not updated to a more current kernel than 3.5, but I will if
 there's evidence that this is resolved.  Any other clever trick to
 try?

 There is no evidence and there won't be if you don't try a newer kernel.

 Thanks,
 Rafael


 --
 I speak only for myself.
 Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

Quick update: booting with 'noapic' on the commandline seems to make
it resume successfully.

The main dmesg diffs, other than the obvious Skipping IOAPIC probe
and IRG number diffs) are:

-nr_irqs_gsi: 40
+nr_irqs_gsi: 16

-NR_IRQS:16640 nr_irqs:776 16
+NR_IRQS:16640 nr_irqs:368 16

-system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
+system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved

and a new warning about irq 5: nobody cared (try booting with the
irqpoll option)

I'll see if I can sort out further differences, but I thought it was
worth sending this new info along, anyway.

It did not require 'noapic' on the Lucid (2.6.32?) kernel


On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin thoc...@hockin.org wrote:
 Running a suspend with pm_trace set, I get:

 aer :00:03.0:pcie02: hash matches

 I don't know what magic might be needed here, though.

 I guess next step is to try to build a non-distro kernel.

 On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote:
 On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
 resume worked flawlessly every time.

 Then I upgraded to Ubuntu Precise.

 Well, do you use a distro kernel or a kernel.org kernel?

 Suspend seems to work, but resume
 fails every time. The video never initializes.  By the flashing
 keyboard lights, I guess it's a kernel panic. It fails from the Live
 CD and from a fresh install.

 Here is my debug so far.

 Install all updates (3.2 kernel, nouveau driver)
 Reboot
 Try suspend = fails

 Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
 Reboot
 Try suspend = fails

 Install nVidia's 304 driver
 Reboot
 Try suspend = fails

 From within X:
   echo core  /sys/power/pm_test
   echo mem  /sys/power/state
 The system acts like it is going to sleep, and then wakes up a few
 seconds later. dmesg shows:

 [ 1230.083404] [ cut here ]
 [ 1230.083410] WARNING: at
 /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
 suspend_test_finish+0x86/0x90()
 [ 1230.083411] Hardware name: To Be Filled By O.E.M.
 [ 1230.083412] Component: resume devices, time: 14424
 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
 snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
 nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
 snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
 snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
 ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
 shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
 pata_marvell
 [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
 #32~precise1-Ubuntu
 [ 1230.083446] Call Trace:
 [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0
 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50
 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90
 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200
 [ 1230.083460] [8109b701] enter_state+0xd1/0x100
 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60
 [ 1230.083465] [8109a7a5] state_store+0x45/0x70
 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30
 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170
 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180
 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90
 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b
 [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---

 Boot with init=/bin/bash
 unload all modules except USBHID
 echo core  /sys/power/pm_test
 echo mem  /sys/power/state
 system acts like it is going to sleep, and then wakes up a few seconds later
 echo none  /sys/power/pm_test
 echo mem  /sys/power/state
 system goes to sleep
 press power to resume = fails

 At this point I am stumped on how to debug. This is a modern
 computer with no serial ports. It worked under Lucid, so I know it is
 POSSIBLE.

 Mobo: ASRock X58 single-socket
 CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
 RAM: 12 GB ECC
 Disk: sda = Intel SSD, mounted on /
 Disk: sdb = Intel SSD, not mounted
 Disk: sdc = Seagate HDD, not mounted
 Disk: sdd = Seagate HDD, not mounted
 NIC = Onboard RTL8168e/8111e
 Sound = EMU1212 (emu10k1, not even configured yet)
 Video = nVidia GeForce 7600 GT
 KB = PS2 (also tried USB)
 Mouse = USB

 I have not updated to a more current kernel than 3.5, but I will if
 there's evidence that this is resolved.  Any other clever trick to
 try?

 There is no evidence and there won't be if you don't try a newer kernel.

 Thanks,
 Rafael


 --
 I speak only for myself.
 Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin

Best guess:

With 'noapic', I see the irq 5: nobody cared message on resume,
along with 1 IRQ5 counts in /proc/interrupts (the devices claiming
that IRQ are quiescent).

Without 'noapic' that must be triggering something else to go haywire,
perhaps the AER logic (though that is all MSI, so probably not).  I'm
flying blind on those boots.

I bet that, if I can recall how to re-enable IRQ5, I'll see it
continuously asserting.  Chipset or BIOS bug maybe.  I don't know if I
had AER enabled under Lucid, so that might be the difference.

I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I
can make it progress.


On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin thoc...@hockin.org wrote:
 Quick update: booting with 'noapic' on the commandline seems to make
 it resume successfully.

 The main dmesg diffs, other than the obvious Skipping IOAPIC probe
 and IRG number diffs) are:

 -nr_irqs_gsi: 40
 +nr_irqs_gsi: 16

 -NR_IRQS:16640 nr_irqs:776 16
 +NR_IRQS:16640 nr_irqs:368 16

 -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
 +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved

 and a new warning about irq 5: nobody cared (try booting with the
 irqpoll option)

 I'll see if I can sort out further differences, but I thought it was
 worth sending this new info along, anyway.

 It did not require 'noapic' on the Lucid (2.6.32?) kernel


 On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin thoc...@hockin.org wrote:
 Running a suspend with pm_trace set, I get:

 aer :00:03.0:pcie02: hash matches

 I don't know what magic might be needed here, though.

 I guess next step is to try to build a non-distro kernel.

 On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki r...@sisk.pl wrote:
 On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
 resume worked flawlessly every time.

 Then I upgraded to Ubuntu Precise.

 Well, do you use a distro kernel or a kernel.org kernel?

 Suspend seems to work, but resume
 fails every time. The video never initializes.  By the flashing
 keyboard lights, I guess it's a kernel panic. It fails from the Live
 CD and from a fresh install.

 Here is my debug so far.

 Install all updates (3.2 kernel, nouveau driver)
 Reboot
 Try suspend = fails

 Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
 Reboot
 Try suspend = fails

 Install nVidia's 304 driver
 Reboot
 Try suspend = fails

 From within X:
   echo core  /sys/power/pm_test
   echo mem  /sys/power/state
 The system acts like it is going to sleep, and then wakes up a few
 seconds later. dmesg shows:

 [ 1230.083404] [ cut here ]
 [ 1230.083410] WARNING: at
 /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
 suspend_test_finish+0x86/0x90()
 [ 1230.083411] Hardware name: To Be Filled By O.E.M.
 [ 1230.083412] Component: resume devices, time: 14424
 [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
 snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
 nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
 snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
 snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
 ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
 bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
 shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
 pata_marvell
 [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
 #32~precise1-Ubuntu
 [ 1230.083446] Call Trace:
 [ 1230.083448] [81052c9f] warn_slowpath_common+0x7f/0xc0
 [ 1230.083452] [81052d96] warn_slowpath_fmt+0x46/0x50
 [ 1230.083455] [8109b836] suspend_test_finish+0x86/0x90
 [ 1230.083457] [8109b53b] suspend_devices_and_enter+0x10b/0x200
 [ 1230.083460] [8109b701] enter_state+0xd1/0x100
 [ 1230.083463] [8109b74b] pm_suspend+0x1b/0x60
 [ 1230.083465] [8109a7a5] state_store+0x45/0x70
 [ 1230.083467] [81331d2f] kobj_attr_store+0xf/0x30
 [ 1230.083471] [811f77ff] sysfs_write_file+0xef/0x170
 [ 1230.083476] [811879d3] vfs_write+0xb3/0x180
 [ 1230.083480] [81187cfa] sys_write+0x4a/0x90
 [ 1230.083483] [816a6e69] system_call_fastpath+0x16/0x1b
 [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---

 Boot with init=/bin/bash
 unload all modules except USBHID
 echo core  /sys/power/pm_test
 echo mem  /sys/power/state
 system acts like it is going to sleep, and then wakes up a few seconds 
 later
 echo none  /sys/power/pm_test
 echo mem  /sys/power/state
 system goes to sleep
 press power to resume = fails

 At this point I am stumped on how to debug. This is a modern
 computer with no serial ports. It worked under Lucid, so I know it is
 POSSIBLE.

 Mobo: ASRock X58 single-socket
 CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
 RAM: 12 GB ECC
 Disk: sda = Intel SSD, mounted on /
 Disk: sdb

Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Tim Hockin

On 7/12/07, Andi Kleen <[EMAIL PROTECTED]> wrote:

> -- there may be other edge cases other than
> this one. I'm actually surprised that this wasn't a ring buffer to start
> with -- it certainly seems like it wanted to be one.

The problem with a ring buffer is that it would lose old entries; but
for machine checks you really want the first entries because
the later ones might be just junk.

Couldn't the ring just have logic to detect an overrun and stop
logging until that is alleviated?  Similar to what is done now.  Maybe
I am underestimating it..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Tim Hockin


On 7/12/07, Andi Kleen [EMAIL PROTECTED] wrote:


 -- there may be other edge cases other than
 this one. I'm actually surprised that this wasn't a ring buffer to start
 with -- it certainly seems like it wanted to be one.

The problem with a ring buffer is that it would lose old entries; but
for machine checks you really want the first entries because
the later ones might be just junk.


Couldn't the ring just have logic to detect an overrun and stop
logging until that is alleviated?  Similar to what is done now.  Maybe
I am underestimating it..
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kconfig variable "COBALT" is not defined anywhere

2007-06-03 Thread Tim Hockin


That sounds correct.

On 6/3/07, Robert P. J. Day <[EMAIL PROTECTED]> wrote:

On Sun, 3 Jun 2007, Tim Hockin wrote:

> I think the nvram is the only place left that uses CONFIG_COBALT

sure, but once you remove this snippet near the top of
drivers/char/nvram.c:

...
#  if defined(CONFIG_COBALT)
#include 
#define MACH COBALT
#  else
#define MACH PC
#  endif
...

then everything else COBALT-related in that file should be tossed as
well, which would include stuff conditional on:

  #if MACH == COBALT

and so on.  just making sure that what you're saying is that *all*
COBALT-related content in that file can be thrown out.  i'll submit a
patch shortly and you can pass judgment.

rday
--

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kconfig variable "COBALT" is not defined anywhere

2007-06-03 Thread Tim Hockin


I think the nvram is the only place left that uses CONFIG_COBALT

On 6/3/07, Robert P. J. Day <[EMAIL PROTECTED]> wrote:

On Sun, 3 Jun 2007, Tim Hockin wrote:

> There were other patches which added more COBALT support, but they
> were dropped or lost or whatever.
>
> I would not balk at having that code yanked.  I never got around to
> doing proper Cobalt support for modern kernels. :(
>
> On 6/3/07, Roland Dreier <[EMAIL PROTECTED]> wrote:
> >  > > >   there is no Kconfig file which defines the selectable option
> >  > > > "COBALT", which means that this snippet from drivers/char/nvram.c:
> >  > > >
> >  > > > #  if defined(CONFIG_COBALT)
> >  > > > #include 
> >  > > > #define MACH COBALT
> >  > > > #  else
> >  > > > #define MACH PC
> >  > > > #  endif
> >  > > > never evaluates to true, therefore making 
> >  > > > fairly useless, at least under the circumstances.
> >
> >  > > Maybe it should be MIPS_COBALT ?
> >
> >  > that's the first thing that occurred to me, but that header file is
> >  > copyright sun microsystems and says nothing about MIPS, so that didn't
> >  > really settle the issue.  that's why i'd rather someone else resolve
> >  > this one way or the other.
> >
> > Actually, looking through the old kernel history, it looks like this
> > was added by Tim Hockin's (CCed) patch "Add Cobalt Networks support to
> > nvram driver".  Which added this to drivers/cobalt:
> >
> > +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT
> >
> > I guess Tim can clear up what's intended...

ok, that sounds like it might be a bigger issue than just a dead
CONFIG variable.  if that's all it is, i can submit a patch.  if it's
more than that, i'll leave it to someone higher up the food chain to
figure out what cobalt-related stuff should be yanked.

rday
--

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kconfig variable "COBALT" is not defined anywhere

2007-06-03 Thread Tim Hockin

There were other patches which added more COBALT support, but they
were dropped or lost or whatever.

I would not balk at having that code yanked.  I never got around to
doing proper Cobalt support for modern kernels. :(

On 6/3/07, Roland Dreier <[EMAIL PROTECTED]> wrote:

 > > >   there is no Kconfig file which defines the selectable option
 > > > "COBALT", which means that this snippet from drivers/char/nvram.c:
 > > >
 > > > #  if defined(CONFIG_COBALT)
 > > > #include 
 > > > #define MACH COBALT
 > > > #  else
 > > > #define MACH PC
 > > > #  endif
 > > > never evaluates to true, therefore making 
 > > > fairly useless, at least under the circumstances.

 > > Maybe it should be MIPS_COBALT ?

 > that's the first thing that occurred to me, but that header file is
 > copyright sun microsystems and says nothing about MIPS, so that didn't
 > really settle the issue.  that's why i'd rather someone else resolve
 > this one way or the other.

Actually, looking through the old kernel history, it looks like this
was added by Tim Hockin's (CCed) patch "Add Cobalt Networks support to
nvram driver".  Which added this to drivers/cobalt:

+bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT

I guess Tim can clear up what's intended...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kconfig variable COBALT is not defined anywhere

2007-06-03 Thread Tim Hockin


There were other patches which added more COBALT support, but they
were dropped or lost or whatever.

I would not balk at having that code yanked.  I never got around to
doing proper Cobalt support for modern kernels. :(

On 6/3/07, Roland Dreier [EMAIL PROTECTED] wrote:

  there is no Kconfig file which defines the selectable option
COBALT, which means that this snippet from drivers/char/nvram.c:
   
#  if defined(CONFIG_COBALT)
#include linux/cobalt-nvram.h
#define MACH COBALT
#  else
#define MACH PC
#  endif
never evaluates to true, therefore making linux/cobalt-nvram.h
fairly useless, at least under the circumstances.

   Maybe it should be MIPS_COBALT ?

  that's the first thing that occurred to me, but that header file is
  copyright sun microsystems and says nothing about MIPS, so that didn't
  really settle the issue.  that's why i'd rather someone else resolve
  this one way or the other.

Actually, looking through the old kernel history, it looks like this
was added by Tim Hockin's (CCed) patch Add Cobalt Networks support to
nvram driver.  Which added this to drivers/cobalt:

+bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT

I guess Tim can clear up what's intended...


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kconfig variable COBALT is not defined anywhere

2007-06-03 Thread Tim Hockin


I think the nvram is the only place left that uses CONFIG_COBALT

On 6/3/07, Robert P. J. Day [EMAIL PROTECTED] wrote:

On Sun, 3 Jun 2007, Tim Hockin wrote:

 There were other patches which added more COBALT support, but they
 were dropped or lost or whatever.

 I would not balk at having that code yanked.  I never got around to
 doing proper Cobalt support for modern kernels. :(

 On 6/3/07, Roland Dreier [EMAIL PROTECTED] wrote:
there is no Kconfig file which defines the selectable option
  COBALT, which means that this snippet from drivers/char/nvram.c:
 
  #  if defined(CONFIG_COBALT)
  #include linux/cobalt-nvram.h
  #define MACH COBALT
  #  else
  #define MACH PC
  #  endif
  never evaluates to true, therefore making linux/cobalt-nvram.h
  fairly useless, at least under the circumstances.
 
 Maybe it should be MIPS_COBALT ?
 
that's the first thing that occurred to me, but that header file is
copyright sun microsystems and says nothing about MIPS, so that didn't
really settle the issue.  that's why i'd rather someone else resolve
this one way or the other.
 
  Actually, looking through the old kernel history, it looks like this
  was added by Tim Hockin's (CCed) patch Add Cobalt Networks support to
  nvram driver.  Which added this to drivers/cobalt:
 
  +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT
 
  I guess Tim can clear up what's intended...

ok, that sounds like it might be a bigger issue than just a dead
CONFIG variable.  if that's all it is, i can submit a patch.  if it's
more than that, i'll leave it to someone higher up the food chain to
figure out what cobalt-related stuff should be yanked.

rday
--

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kconfig variable COBALT is not defined anywhere

2007-06-03 Thread Tim Hockin


That sounds correct.

On 6/3/07, Robert P. J. Day [EMAIL PROTECTED] wrote:

On Sun, 3 Jun 2007, Tim Hockin wrote:

 I think the nvram is the only place left that uses CONFIG_COBALT

sure, but once you remove this snippet near the top of
drivers/char/nvram.c:

...
#  if defined(CONFIG_COBALT)
#include linux/cobalt-nvram.h
#define MACH COBALT
#  else
#define MACH PC
#  endif
...

then everything else COBALT-related in that file should be tossed as
well, which would include stuff conditional on:

  #if MACH == COBALT

and so on.  just making sure that what you're saying is that *all*
COBALT-related content in that file can be thrown out.  i'll submit a
patch shortly and you can pass judgment.

rday
--

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86_64: mcelog tolerant level cleanup

2007-05-18 Thread Tim Hockin

On 5/18/07, Andi Kleen <[EMAIL PROTECTED]> wrote:

>  * If RIPV is set it is not safe to restart, so set the 'no way out'
>flag rather than the 'kill it' flag.

Why? It is not PCC. We cannot return of course, but killing isn't returning.

My understanding is that the absence of RIPV indicates that it is not
safe to restart, period.  Not that the running *task* is not safe* but
that the IP on the stack is not valid to restart at all.

>  * Don't panic() on correctable MCEs.

The idea behind this was that if you get an exception it is always a bit risky
because there are a few potential deadlocks that cannot be avoided.
And normally non UC is just polled which will never cause a panic.
So I don't quite see the value of this change.

It will still always panic when tolerant == 0, and of course you're
right correctable errors would skip over the panic() path anyway.  I
can roll back the "<0" part, though I don't see the difference now :)

> This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

That should be a separate patch

Andrew already sucked it into -mm - do you want me to break it out,
and re-submit?

> + 0: always panic on uncorrected errors, log corrected errors
> + 1: panic or SIGBUS on uncorrected errors, log corrected errors
> + 2: SIGBUS or log uncorrected errors, log corrected errors

Just saying SIGBUS is misleading because it isn't a catchable
signal.

should I change that to "kill" ?

Why did you remove the idle special case?

Because once the other tolerant rules are clarified, it's redundant
for tolerant < 2, and I think it's a bad special case for tolerant ==
2, and it's definately wrong for tolerant == 3.

Shall I re-roll?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86_64: mcelog tolerant level cleanup

2007-05-18 Thread Tim Hockin


On 5/18/07, Andi Kleen [EMAIL PROTECTED] wrote:


  * If RIPV is set it is not safe to restart, so set the 'no way out'
flag rather than the 'kill it' flag.

Why? It is not PCC. We cannot return of course, but killing isn't returning.


My understanding is that the absence of RIPV indicates that it is not
safe to restart, period.  Not that the running *task* is not safe* but
that the IP on the stack is not valid to restart at all.


  * Don't panic() on correctable MCEs.

The idea behind this was that if you get an exception it is always a bit risky
because there are a few potential deadlocks that cannot be avoided.
And normally non UC is just polled which will never cause a panic.
So I don't quite see the value of this change.


It will still always panic when tolerant == 0, and of course you're
right correctable errors would skip over the panic() path anyway.  I
can roll back the 0 part, though I don't see the difference now :)


 This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

That should be a separate patch


Andrew already sucked it into -mm - do you want me to break it out,
and re-submit?


 + 0: always panic on uncorrected errors, log corrected errors
 + 1: panic or SIGBUS on uncorrected errors, log corrected errors
 + 2: SIGBUS or log uncorrected errors, log corrected errors

Just saying SIGBUS is misleading because it isn't a catchable
signal.


should I change that to kill ?


Why did you remove the idle special case?


Because once the other tolerant rules are clarified, it's redundant
for tolerant  2, and I think it's a bad special case for tolerant ==
2, and it's definately wrong for tolerant == 3.

Shall I re-roll?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 222 matches

Mail list logo