Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-13 Thread Tejun Heo
Hello, Tim.

On Thu, Dec 12, 2013 at 04:23:18PM -0800, Tim Hockin wrote:
> Just to be clear - I say this because it doesn't feel right to impose
> my craziness on others, and it sucks when we try and are met with
> "you're crazy, go away".  And you have to admit that happens to
> Google. :)  Punching an escape valve that allows us to be crazy
> without hurting anyone else sounds ideal, IF and ONLY IF that escape
> valve is itself maintainable.

I don't think google being considered crazy is a good thing in
general, highly likely not something to be proud of.  It sure is
partly indicative of the specialization that you guys need but I
suspect is a much stronger signal for room for better engineering.

I'm fairly certain the blame is abundant for everybody to share.  The
point I'm trying to make is "let's please stop diverging".  It hurts
everybody.

> If the escape valve is userspace it's REALLY easy to iterate on our
> craziness.  If it is kernel space, it's somewhat less easy, but not
> impossible.

As I'm sure you've gathered from this thread, even punching the
initial hole is a sizable burden and contortion to the general memory
management and I'm sure as you guys develop further down the path
you'll encounter cases where you need further support or holes from
the kernel.  I can't anticipate the details but the fact that those
will follow is as evident as the day to me, especially given the
mindset leading to the current situation in the first place.

Please note that this part of discussion is more abstract than
necessary for this particular patchset or hole.  I'm quite doubtful
that system-level OOM handling with separate physical reserve is
likely to survive even just on technical details.  The reason why I'm
keeping at this abstract point is because this seems to be a
continuing trend rather than a single occurrence and I really hope it
changes.

> Well that's an awesome start.  We have or had patches to do a lot of
> this.  I don't know how well scrubbed they are for pushing or whether
> they apply at all to current head, though.

Awesome, this looks like something everyone agrees on. :)

> As an aside: mucking about with extra nesting levels to achieve a
> stable OOM semantic sounds doable, but it certainly sucks in a unified
> hierarchy.  We'll end up with 1, 2, or 3 (or more in esoteric cases?
> not sure) extra nesting levels for every other resource dimension.
> And lawd help us if we ever need to do something similar in a
> different resource dimension - the cross product is mind-bending.
> What we do with split-hierarchies is this but on a smaller scale.

Yes, agreed but I believe there are substantial benefits to having
certain level of structural constraints.  It encourages people to
ponder the underlying issues and make active trade-offs.  Not that
going off that extreme would be good either but we've gone too far
towards the other end.

This being a special issue with memcg, if this turns out to be a big
enough problem, I don't think having a provision to be able to handle
it without further nesting would be too crazy - e.g. the ability to
mark a single cgroup at the root level as for OOM handler or whatever
- as long as we stay within the boundaries of memcg and cgroup proper,
but we seem to have ways to go before worrying about that one.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin
On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote:
>> Yeah sorry.  Replying from my phone is awkward at best.  I know better :)
>
> Heh, sorry about being bitchy. :)
>
>> In my mind, the ONLY point of pulling system-OOM handling into
>> userspace is to make it easier for crazy people (Google) to implement
>> bizarre system-OOM policies.  Example:
>
> I think that's one of the places where we largely disagree.  If at all

Just to be clear - I say this because it doesn't feel right to impose
my craziness on others, and it sucks when we try and are met with
"you're crazy, go away".  And you have to admit that happens to
Google. :)  Punching an escape valve that allows us to be crazy
without hurting anyone else sounds ideal, IF and ONLY IF that escape
valve is itself maintainable.

If the escape valve is userspace it's REALLY easy to iterate on our
craziness.  If it is kernel space, it's somewhat less easy, but not
impossible.

> possible, I'd much prefer google's workload to be supported inside the
> general boundaries of the upstream kernel without having to punch a
> large hole in it.  To me, the general development history of memcg in
> general and this thread in particular seem to epitomize why it is a
> bad idea to have isolated, large and deep "crazy" use cases.  Punching
> the initial hole is the easy part; however, we all are quite limited
> in anticpating future needs and sooner or later that crazy use case is
> bound to evolve further towards the isolated extreme it departed
> towards and require more and larger holes and further contortions to
> accomodate such progress.
>
> The concern I have with the suggested solution is not necessarily that
> it's technically complex than it looks like on the surface - I'm sure
> it can be made to work one way or the other - but that it's a fairly
> large step toward an isolated extreme which memcg as a project
> probably should not head toward.
>
> There sure are cases where such exceptions can't be avoided and are
> good trade-offs but, here, we're talking about a major architectural
> decision which not only affects memcg but mm in general.  I'm afraid
> this doesn't sound like an no-brainer flexibility we can afford.
>
>> When we have a system OOM we want to do a walk of the administrative
>> memcg tree (which is only a couple levels deep, users can make
>> non-admin sub-memcgs), selecting the lowest priority entity at each
>> step (where both tasks and memcgs have a priority and the priority
>> range is much wider than the current OOM scores, and where memcg
>> priority is sometimes a function of memcg usage), until we reach a
>> leaf.
>>
>> Once we reach a leaf, I want to log some info about the memcg doing
>> the allocation, the memcg being terminated, and maybe some other bits
>> about the system (depending on the priority of the selected victim,
>> this may or may not be an "acceptable" situation).  Then I want to
>> kill *everything* under that memcg.  Then I want to "publish" some
>> information through a sane API (e.g. not dmesg scraping).
>>
>> This is basically our policy as we understand it today.  This is
>> notably different than it was a year ago, and it will probably evolve
>> further in the next year.
>
> I think per-memcg score and killing is something which makes
> fundamental sense.  In fact, killing a single process has never made
> much sense to me as that is a unit which ultimately is only meaningful
> to the kernel itself and not necessraily to userland, so no matter
> what I think we're gonna gain per-memcg behavior and it seems most,
> albeit not all, of what you described above should be implementable
> through that.

Well that's an awesome start.  We have or had patches to do a lot of
this.  I don't know how well scrubbed they are for pushing or whether
they apply at all to current head, though.

> Ultimately, if the use case calls for very fine level of control, I
> think the right thing to do is making nesting work properly which is
> likely to take some time.  In the meantime, even if such use case
> requires modifying the kernel to tailor the OOM behavior, I think
> sticking to kernel OOM provides a lot easier way to eventual
> convergence.  Userland system OOM basically means giving up and would
> lessen the motivation towards improving the shared infrastructures
> while adding significant pressure towards schizophreic diversion.
>
>> We have a long tail of kernel memory usage.  If we provision machines
>> so that the "do work here" first-level memcg excludes the average
>> kernel usage, we have a huge number of machines that will fail to
>> apply OOM policy because of actual overcommitment.  If we provision
>> for 95th or 99th percentile kernel usage, we're wasting large amounts
>> of memory that could be used to schedule jobs.  This is the
>> fundamental problem we face with static apportionment (and we face it
>> in a dozen other situations, too

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tejun Heo
Hello, Tim.

On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote:
> Yeah sorry.  Replying from my phone is awkward at best.  I know better :)

Heh, sorry about being bitchy. :)

> In my mind, the ONLY point of pulling system-OOM handling into
> userspace is to make it easier for crazy people (Google) to implement
> bizarre system-OOM policies.  Example:

I think that's one of the places where we largely disagree.  If at all
possible, I'd much prefer google's workload to be supported inside the
general boundaries of the upstream kernel without having to punch a
large hole in it.  To me, the general development history of memcg in
general and this thread in particular seem to epitomize why it is a
bad idea to have isolated, large and deep "crazy" use cases.  Punching
the initial hole is the easy part; however, we all are quite limited
in anticpating future needs and sooner or later that crazy use case is
bound to evolve further towards the isolated extreme it departed
towards and require more and larger holes and further contortions to
accomodate such progress.

The concern I have with the suggested solution is not necessarily that
it's technically complex than it looks like on the surface - I'm sure
it can be made to work one way or the other - but that it's a fairly
large step toward an isolated extreme which memcg as a project
probably should not head toward.

There sure are cases where such exceptions can't be avoided and are
good trade-offs but, here, we're talking about a major architectural
decision which not only affects memcg but mm in general.  I'm afraid
this doesn't sound like an no-brainer flexibility we can afford.

> When we have a system OOM we want to do a walk of the administrative
> memcg tree (which is only a couple levels deep, users can make
> non-admin sub-memcgs), selecting the lowest priority entity at each
> step (where both tasks and memcgs have a priority and the priority
> range is much wider than the current OOM scores, and where memcg
> priority is sometimes a function of memcg usage), until we reach a
> leaf.
> 
> Once we reach a leaf, I want to log some info about the memcg doing
> the allocation, the memcg being terminated, and maybe some other bits
> about the system (depending on the priority of the selected victim,
> this may or may not be an "acceptable" situation).  Then I want to
> kill *everything* under that memcg.  Then I want to "publish" some
> information through a sane API (e.g. not dmesg scraping).
> 
> This is basically our policy as we understand it today.  This is
> notably different than it was a year ago, and it will probably evolve
> further in the next year.

I think per-memcg score and killing is something which makes
fundamental sense.  In fact, killing a single process has never made
much sense to me as that is a unit which ultimately is only meaningful
to the kernel itself and not necessraily to userland, so no matter
what I think we're gonna gain per-memcg behavior and it seems most,
albeit not all, of what you described above should be implementable
through that.

Ultimately, if the use case calls for very fine level of control, I
think the right thing to do is making nesting work properly which is
likely to take some time.  In the meantime, even if such use case
requires modifying the kernel to tailor the OOM behavior, I think
sticking to kernel OOM provides a lot easier way to eventual
convergence.  Userland system OOM basically means giving up and would
lessen the motivation towards improving the shared infrastructures
while adding significant pressure towards schizophreic diversion.

> We have a long tail of kernel memory usage.  If we provision machines
> so that the "do work here" first-level memcg excludes the average
> kernel usage, we have a huge number of machines that will fail to
> apply OOM policy because of actual overcommitment.  If we provision
> for 95th or 99th percentile kernel usage, we're wasting large amounts
> of memory that could be used to schedule jobs.  This is the
> fundamental problem we face with static apportionment (and we face it
> in a dozen other situations, too).  Expressing this set-aside memory
> as "off-the-top" rather than absolute limits makes the whole system
> more flexible.

I agree that's pretty sad.  Maybe I shouldn't be surprised given the
far-from-perfect coverage of kmemcg at this point, but, again,
*everyone* wants [k]memcg coverage to be more complete and we have and
are still building the infrastructures to make that possible, so I'm
still of the opinion that making [k]memcg work better is the better
direction to pursue and given the short development history of kmemcg
I'm fairly sure there are quite a few low hanging fruits.

Another thing which *might* be relevant is the rigidity of the upper
limit and the vagueness of soft limit of the current implementation.
I have a rather strong suspicion that the way memcg config knobs
behave now - one finicky, the other whatever - is likely hinderin

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin
On Thu, Dec 12, 2013 at 6:21 AM, Tejun Heo  wrote:
> Hey, Tim.
>
> Sidenote: Please don't top-post with the whole body quoted below
> unless you're adding new cc's.  Please selectively quote the original
> message's body to remind the readers of the context and reply below
> it.  It's a basic lkml etiquette and one with good reasons.  If you
> have to top-post for whatever reason - say you're typing from a
> machine which doesn't allow easy editing of the original message,
> explain so at the top of the message, or better yet, wait till you can
> unless it's urgent.

Yeah sorry.  Replying from my phone is awkward at best.  I know better :)

> On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote:
>> The immediate problem I see with setting aside reserves "off the top"
>> is that we don't really know a priori how much memory the kernel
>> itself is going to use, which could still land us in an overcommitted
>> state.
>>
>> In other words, if I have your 128 MB machine, and I set aside 8 MB
>> for OOM handling, and give 120 MB for jobs, I have not accounted for
>> the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
>> 20 MB for jobs.  That should be enough right?  Hell if I know, and
>> nothing ensures that.
>
> Yes, sure thing, that's the reason why I mentioned "with some slack"
> in the original message and also that it might not be completely the
> same.  It doesn't allow you to aggressively use system level OOM
> handling as the sizing estimator for the root cgroup; however, it's
> more of an implementation details than something which should guide
> the overall architecture - it's a problem which lessens in severity as
> [k]memcg improves and its coverage becomes more complete, which is the
> direction we should be headed no matter what.

In my mind, the ONLY point of pulling system-OOM handling into
userspace is to make it easier for crazy people (Google) to implement
bizarre system-OOM policies.  Example:

When we have a system OOM we want to do a walk of the administrative
memcg tree (which is only a couple levels deep, users can make
non-admin sub-memcgs), selecting the lowest priority entity at each
step (where both tasks and memcgs have a priority and the priority
range is much wider than the current OOM scores, and where memcg
priority is sometimes a function of memcg usage), until we reach a
leaf.

Once we reach a leaf, I want to log some info about the memcg doing
the allocation, the memcg being terminated, and maybe some other bits
about the system (depending on the priority of the selected victim,
this may or may not be an "acceptable" situation).  Then I want to
kill *everything* under that memcg.  Then I want to "publish" some
information through a sane API (e.g. not dmesg scraping).

This is basically our policy as we understand it today.  This is
notably different than it was a year ago, and it will probably evolve
further in the next year.

Teaching the kernel all of this stuff has proven to be sort of
difficult to maintain and forward-port, and has been very slow to
evolve because of how painful it is to test and deploy new kernels.

Maybe we can find a way to push this level of policy down to the
kernel OOM killer?  When this was mentioned internally I got shot down
(gently, but shot down none the less).  Assuming we had
nearly-reliable (it doesn't have to be 100% guaranteed to be useful)
OOM-in-userspace, I can keep the adminstrative memcg metadata in
memory, implement killing as cruelly as I need, and do all of the
logging and publication after the OOM kill is done.  Most importantly
I can test and deploy new policy changes pretty trivially.

Handling per-memcg OOM is a different discussion.  Here is where we
want to be able to extract things like heap profiles or take stats
snapshots, grow memcgs (if so configured) etc.  Allowing our users to
have a moment of mercy before we put a bullet in their brain enables a
whole new realm of debugging, as well as a lot of valuable features.

> It'd depend on the workload but with memcg fully configured it
> shouldn't fluctuate wildly.  If it does, we need to hunt down whatever
> is causing such fluctuatation and include it in kmemcg, right?  That
> way, memcg as a whole improves for all use cases not just your niche
> one and I strongly believe that aligning as many use cases as possible
> along the same axis, rather than creating a large hole to stow away
> the exceptions, is vastly more beneficial to *everyone* in the long
> term.

We have a long tail of kernel memory usage.  If we provision machines
so that the "do work here" first-level memcg excludes the average
kernel usage, we have a huge number of machines that will fail to
apply OOM policy because of actual overcommitment.  If we provision
for 95th or 99th percentile kernel usage, we're wasting large amounts
of memory that could be used to schedule jobs.  This is the
fundamental problem we face with static apportionment (and we face it
in a dozen other situation

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Michal Hocko
On Thu 12-12-13 09:21:56, Tejun Heo wrote:
[...]
> There'd still be all the bells and whistles to configure and monitor
> system-level OOM and if there's justified need for improvements, we
> surely can and should do that;

You weren't on the CC of the original thread which has started here
https://lkml.org/lkml/2013/11/19/191. And the original request for
discussion was more about user defined _policies_ for the global
OOM rather than user space global OOM handler. I feel that there
are usacases where the current "kill a single task based on some
calculations" is far from optimal which leads to hacks which try to cope
with after oom condition somehow gracefully.

I do agree with you that pulling oom handling sounds too dangerous
even with all the code that it would need and I feel we should go a
different path than (ab)using memcg.oom_control interface for that.
I still think we need to have a way to tell the global OOM killer what
to do.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tejun Heo
Hello, Michal.

On Thu, Dec 12, 2013 at 05:32:22PM +0100, Michal Hocko wrote:
> You weren't on the CC of the original thread which has started here
> https://lkml.org/lkml/2013/11/19/191. And the original request for
> discussion was more about user defined _policies_ for the global
> OOM rather than user space global OOM handler. I feel that there
> are usacases where the current "kill a single task based on some
> calculations" is far from optimal which leads to hacks which try to cope
> with after oom condition somehow gracefully.
> 
> I do agree with you that pulling oom handling sounds too dangerous
> even with all the code that it would need and I feel we should go a
> different path than (ab)using memcg.oom_control interface for that.
> I still think we need to have a way to tell the global OOM killer what
> to do.

Oh yeah, sure, I have no fundamental objections against improving the
in-kernel system OOM handler, including making it cgroup-aware which
seems like a natural extension to me.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tejun Heo
Hey, Tim.

Sidenote: Please don't top-post with the whole body quoted below
unless you're adding new cc's.  Please selectively quote the original
message's body to remind the readers of the context and reply below
it.  It's a basic lkml etiquette and one with good reasons.  If you
have to top-post for whatever reason - say you're typing from a
machine which doesn't allow easy editing of the original message,
explain so at the top of the message, or better yet, wait till you can
unless it's urgent.

On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote:
> The immediate problem I see with setting aside reserves "off the top"
> is that we don't really know a priori how much memory the kernel
> itself is going to use, which could still land us in an overcommitted
> state.
> 
> In other words, if I have your 128 MB machine, and I set aside 8 MB
> for OOM handling, and give 120 MB for jobs, I have not accounted for
> the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
> 20 MB for jobs.  That should be enough right?  Hell if I know, and
> nothing ensures that.

Yes, sure thing, that's the reason why I mentioned "with some slack"
in the original message and also that it might not be completely the
same.  It doesn't allow you to aggressively use system level OOM
handling as the sizing estimator for the root cgroup; however, it's
more of an implementation details than something which should guide
the overall architecture - it's a problem which lessens in severity as
[k]memcg improves and its coverage becomes more complete, which is the
direction we should be headed no matter what.

It'd depend on the workload but with memcg fully configured it
shouldn't fluctuate wildly.  If it does, we need to hunt down whatever
is causing such fluctuatation and include it in kmemcg, right?  That
way, memcg as a whole improves for all use cases not just your niche
one and I strongly believe that aligning as many use cases as possible
along the same axis, rather than creating a large hole to stow away
the exceptions, is vastly more beneficial to *everyone* in the long
term.

There'd still be all the bells and whistles to configure and monitor
system-level OOM and if there's justified need for improvements, we
surely can and should do that; however, with the heavy lifting / hot
path offloaded to the per-memcg userland OOM handlers, I believe it's
reasonable to expect the burden on system OOM handler being noticeably
less, which is the way it should be.  That's the last guard against
the whole system completely locking up and we can't extend its
capabilities beyond that easily and we most likely don't even want to.

If I take back a step and look at the two options and their pros and
cons, which path we should take is rather obvious to me.  I hope you
see it too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-11 Thread Tim Hockin
The immediate problem I see with setting aside reserves "off the top"
is that we don't really know a priori how much memory the kernel
itself is going to use, which could still land us in an overcommitted
state.

In other words, if I have your 128 MB machine, and I set aside 8 MB
for OOM handling, and give 120 MB for jobs, I have not accounted for
the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
20 MB for jobs.  That should be enough right?  Hell if I know, and
nothing ensures that.

On Wed, Dec 11, 2013 at 4:42 AM, Tejun Heo  wrote:
> Yo,
>
> On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote:
>> > Well, the gotcha there is that you won't be able to do that with
>> > system level OOM handler either unless you create a separately
>> > reserved memory, which, again, can be achieved using hierarchical
>> > memcg setup already.  Am I missing something here?
>>
>> System oom conditions would only arise when the usage of memcgs A + B
>> above cause the page allocator to not be able to allocate memory without
>> oom killing something even though the limits of both A and B may not have
>> been reached yet.  No userspace oom handler can allocate memory with
>> access to memory reserves in the page allocator in such a context; it's
>> vital that if we are to handle system oom conditions in userspace that we
>> given them access to memory that other processes can't allocate.  You
>> could attach a userspace system oom handler to any memcg in this scenario
>> with memory.oom_reserve_in_bytes and since it has PF_OOM_HANDLER it would
>> be able to allocate in reserves in the page allocator and overcharge in
>> its memcg to handle it.  This isn't possible only with a hierarchical
>> memcg setup unless you ensure the sum of the limits of the top level
>> memcgs do not equal or exceed the sum of the min watermarks of all memory
>> zones, and we exceed that.
>
> Yes, exactly.  If system memory is 128M, create top level memcgs w/
> 120M and 8M each (well, with some slack of course) and then overcommit
> the descendants of 120M while putting OOM handlers and friends under
> 8M without overcommitting.
>
> ...
>> The stronger rationale is that you can't handle system oom in userspace
>> without this functionality and we need to do so.
>
> You're giving yourself an unreasonable precondition - overcommitting
> at root level and handling system OOM from userland - and then trying
> to contort everything to fit that.  How can possibly "overcommitting
> at root level" be a goal of and in itself?  Please take a step back
> and look at and explain the *problem* you're trying to solve.  You
> haven't explained why that *need*s to be the case at all.
>
> I wrote this at the start of the thread but you're still doing the
> same thing.  You're trying to create a hidden memcg level inside a
> memcg.  At the beginning of this thread, you were trying to do that
> for !root memcgs and now you're arguing that you *need* that for root
> memcg.  Because there's no other limit we can make use of, you're
> suggesting the use of kernel reserve memory for that purpose.  It
> seems like an absurd thing to do to me.  It could be that you might
> not be able to achieve exactly the same thing that way, but the right
> thing to do would be improving memcg in general so that it can instead
> of adding yet more layer of half-baked complexity, right?
>
> Even if there are some inherent advantages of system userland OOM
> handling with a separate physical memory reserve, which AFAICS you
> haven't succeeded at showing yet, this is a very invasive change and,
> as you said before, something with an *extremely* narrow use case.
> Wouldn't it be a better idea to improve the existing mechanisms - be
> that memcg in general or kernel OOM handling - to fit the niche use
> case better?  I mean, just think about all the corner cases.  How are
> you gonna handle priority inversion through locked pages or
> allocations given out to other tasks through slab?  You're suggesting
> opening a giant can of worms for extremely narrow benefit which
> doesn't even seem like actually needing opening the said can.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-11 Thread Tejun Heo
Yo,

On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote:
> > Well, the gotcha there is that you won't be able to do that with
> > system level OOM handler either unless you create a separately
> > reserved memory, which, again, can be achieved using hierarchical
> > memcg setup already.  Am I missing something here?
> 
> System oom conditions would only arise when the usage of memcgs A + B 
> above cause the page allocator to not be able to allocate memory without 
> oom killing something even though the limits of both A and B may not have 
> been reached yet.  No userspace oom handler can allocate memory with 
> access to memory reserves in the page allocator in such a context; it's 
> vital that if we are to handle system oom conditions in userspace that we 
> given them access to memory that other processes can't allocate.  You 
> could attach a userspace system oom handler to any memcg in this scenario 
> with memory.oom_reserve_in_bytes and since it has PF_OOM_HANDLER it would 
> be able to allocate in reserves in the page allocator and overcharge in 
> its memcg to handle it.  This isn't possible only with a hierarchical 
> memcg setup unless you ensure the sum of the limits of the top level 
> memcgs do not equal or exceed the sum of the min watermarks of all memory 
> zones, and we exceed that.

Yes, exactly.  If system memory is 128M, create top level memcgs w/
120M and 8M each (well, with some slack of course) and then overcommit
the descendants of 120M while putting OOM handlers and friends under
8M without overcommitting.

...
> The stronger rationale is that you can't handle system oom in userspace 
> without this functionality and we need to do so.

You're giving yourself an unreasonable precondition - overcommitting
at root level and handling system OOM from userland - and then trying
to contort everything to fit that.  How can possibly "overcommitting
at root level" be a goal of and in itself?  Please take a step back
and look at and explain the *problem* you're trying to solve.  You
haven't explained why that *need*s to be the case at all.

I wrote this at the start of the thread but you're still doing the
same thing.  You're trying to create a hidden memcg level inside a
memcg.  At the beginning of this thread, you were trying to do that
for !root memcgs and now you're arguing that you *need* that for root
memcg.  Because there's no other limit we can make use of, you're
suggesting the use of kernel reserve memory for that purpose.  It
seems like an absurd thing to do to me.  It could be that you might
not be able to achieve exactly the same thing that way, but the right
thing to do would be improving memcg in general so that it can instead
of adding yet more layer of half-baked complexity, right?

Even if there are some inherent advantages of system userland OOM
handling with a separate physical memory reserve, which AFAICS you
haven't succeeded at showing yet, this is a very invasive change and,
as you said before, something with an *extremely* narrow use case.
Wouldn't it be a better idea to improve the existing mechanisms - be
that memcg in general or kernel OOM handling - to fit the niche use
case better?  I mean, just think about all the corner cases.  How are
you gonna handle priority inversion through locked pages or
allocations given out to other tasks through slab?  You're suggesting
opening a giant can of worms for extremely narrow benefit which
doesn't even seem like actually needing opening the said can.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-11 Thread Mel Gorman
On Tue, Dec 10, 2013 at 03:55:48PM -0800, David Rientjes wrote:
> > Okay, are you saying that userland OOM handlers will be able to dip
> > into kernel reserve memory?  Maybe I'm mistaken but you realize that
> > that reserve is there to make things like task exits work under OOM
> > conditions, right?  The only way userland OOM handlers as you describe
> > would work would be creating a separate reserve for them.
> > 
> 
> Yes, PF_OOM_HANDLER processes would be able to allocate this amount as 
> specified by memory.oom_reserve_in_bytes below the per-zone watermarks and 
> the amount of reserves can already be controlled via min_free_kbytes, 
> which we already increase internally for thp.

THP increased min_free_kbytes for external fragmentation control as
it reduces the amount of mixing of the different migrate types within
pageblocks. It was not about reserves, increasing reserves was just the
most straight forward way of handling the problem.

This dicussion is closer to swap-over-network than to anything
THP did. Swap-over-network takes care to only allocate memory for
reserves if it the allocation was required for swapping and reject
all other allocation requests to the extent they can get throttled in
throttle_direct_reclaim. Once allocated from reserves for swapping,
care is taken that the allocations are not leaked to other users (e.g.
is_obj_pfmemalloc checks in slab).

It does not look like PF_OOM_HANDLER takes the same sort of care. Even
if it did, it's not quite the same. swap-over-network allocates from the
zone reserves *only* the memory required to writeback the pages. It can
be slow but it'll make forward progress. A userspace process with special
privileges could allocate any amount of memory for any reason so it would
need a pre-configured and limited reserve on top of the zone reserves or
run the risk of livelock.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-10 Thread David Rientjes
On Tue, 10 Dec 2013, Tejun Heo wrote:

> > Indeed.  The setup I'm specifically trying to attack is where the sum of 
> > the limits of all non-oom handling memcgs (A/b in my model, A in yours) 
> > exceed the amount of RAM.  If the system has 256MB,
> > 
> > /=256MB
> > A=126MB A-oom=2MB   B=188MB B-oom=4MB
> > 
> > or
> > 
> > /=256MB
> > C=128MB D=192MB
> > C/a=126MD/a=188MB
> > 
> > then it's possible for A + B or C/a + D/a to cause a system oom condition 
> > and meanwhile A-oom/tasks, B-oom/tasks, C/tasks, and D/tasks cannot 
> > allocate memory to handle it.
> 
> "tasks"?  You mean that tasks can't be read reliably once system-OOM
> is hit regardless of memcg configuration?
> 

Not referring to the files themselves, rather the processes listed by 
those files, sorry.  Those processes would not be able to do a ps, ls, or 
anything useful even if they are mlocked into memory because they cannot 
allocate memory in oom conditions.

> > Right, and apologies for not discussing the system oom handling here since 
> > its notification on the root memcg is currently being debated as well.  
> > The idea is that admins and users aren't going to be concerned about 
> > memory allocation through the page allocator vs memory charging through 
> > the memory controller; they simply want memory for their userspace oom 
> > handling.  And since the notification would be tied to the root memcg, it 
> > makes sense to make the amount of memory allowed to allocate exclusively 
> > for these handlers a memcg interface.  So the cleanest solution, in my 
> > opinion, was to add the interface as part of memcg.
> 
> I'm still not quite following the reasoning.  Can you please elaborate
> on what the distinction between "page allocator" and "charges through
> memory controller" has to do with this interface?
> 

The interface would allow both access to memory reserves through the page 
allocator as well as charging above the memcg limit, it is the only way to 
guarantee that memory can be allocated by processes attached to the memcg 
in oom conditions.  We must be able to do both, otherwise no matter what 
overcharge we allow them via memcg, it is still possible for the 
allocation itself to fail in the page allocator before we even get to that 
point.

The confusion here is because the access to memory reserves in the page 
allocator is not presented here because there is another on-going 
discussion about when to notify processes waiting on the root memcg's 
memory.oom_control about system oom conditions.  I can certainly post that 
patch as well, but it wouldn't apply without resolving that side-thread 
first.

The high order bit is that we need to be able to address system oom 
conditions as well as memcg oom conditions in userspace and system oom 
conditions require us to specify the processes that are allowed access to 
a special memory reserve.  We can't do that with sibling or parent memcgs 
without some new tunable like memory.allow_page_alloc_reserves, but we 
would also have to specify the amount of reserves allowed.  It seemed 
clean and straight-forward to specify this as both the system oom memory 
reserve amount and memcg limit overcharge amount within the same file, 
memory.oom_reserve_in_bytes as this patch does.

> > It's really the same thing, though, from the user perspective.  They don't 
> > care about page allocation failure vs memcg charge failure, they simply 
> > want to ensure that the memory set aside for memory.oom_reserve_in_bytes 
> > is available in oom conditions.  With the suggested alternatives:
> > 
> > /=256MB
> > A=126MB A-oom=2MB   B=188MB B-oom=4MB
> > 
> > or
> > 
> > /=256MB
> > C=128MB D=192MB
> > C/a=126MD/a=188MB
> > 
> > we can't distinguish between what is able to allocate below per-zone min 
> > watermarks in the page allocator as the oom reserve.  The key point is 
> > that the root memcg is not the only memcg concerned with page allocator 
> > memory reserves, it's any oom reserve.  If A's usage is 124MB and B's 
> > usage is 132MB, we can't specify that processes attached to B-oom should 
> > be able to bypass per-zone min watermarks without an interface such as 
> > that being proposed.
> 
> Okay, are you saying that userland OOM handlers will be able to dip
> into kernel reserve memory?  Maybe I'm mistaken but you realize that
> that reserve is there to make things like task exits work under OOM
> conditions, right?  The only way userland OOM handlers as you describe
> would work would be creating a separate reserve for them.
> 

Yes, PF_OOM_HANDLER processes would be able to allocate this amount as 
specified by memory.oom_reserve_in_bytes below the per-zone watermarks and 
the amount of reserves can already be controll

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-10 Thread Tejun Heo
Hey, David.

On Mon, Dec 09, 2013 at 12:10:44PM -0800, David Rientjes wrote:
> Indeed.  The setup I'm specifically trying to attack is where the sum of 
> the limits of all non-oom handling memcgs (A/b in my model, A in yours) 
> exceed the amount of RAM.  If the system has 256MB,
> 
>   /=256MB
>   A=126MB A-oom=2MB   B=188MB B-oom=4MB
> 
> or
> 
>   /=256MB
>   C=128MB D=192MB
>   C/a=126MD/a=188MB
> 
> then it's possible for A + B or C/a + D/a to cause a system oom condition 
> and meanwhile A-oom/tasks, B-oom/tasks, C/tasks, and D/tasks cannot 
> allocate memory to handle it.

"tasks"?  You mean that tasks can't be read reliably once system-OOM
is hit regardless of memcg configuration?

> Right, and apologies for not discussing the system oom handling here since 
> its notification on the root memcg is currently being debated as well.  
> The idea is that admins and users aren't going to be concerned about 
> memory allocation through the page allocator vs memory charging through 
> the memory controller; they simply want memory for their userspace oom 
> handling.  And since the notification would be tied to the root memcg, it 
> makes sense to make the amount of memory allowed to allocate exclusively 
> for these handlers a memcg interface.  So the cleanest solution, in my 
> opinion, was to add the interface as part of memcg.

I'm still not quite following the reasoning.  Can you please elaborate
on what the distinction between "page allocator" and "charges through
memory controller" has to do with this interface?

> It's really the same thing, though, from the user perspective.  They don't 
> care about page allocation failure vs memcg charge failure, they simply 
> want to ensure that the memory set aside for memory.oom_reserve_in_bytes 
> is available in oom conditions.  With the suggested alternatives:
> 
>   /=256MB
>   A=126MB A-oom=2MB   B=188MB B-oom=4MB
> 
> or
> 
>   /=256MB
>   C=128MB D=192MB
>   C/a=126MD/a=188MB
> 
> we can't distinguish between what is able to allocate below per-zone min 
> watermarks in the page allocator as the oom reserve.  The key point is 
> that the root memcg is not the only memcg concerned with page allocator 
> memory reserves, it's any oom reserve.  If A's usage is 124MB and B's 
> usage is 132MB, we can't specify that processes attached to B-oom should 
> be able to bypass per-zone min watermarks without an interface such as 
> that being proposed.

Okay, are you saying that userland OOM handlers will be able to dip
into kernel reserve memory?  Maybe I'm mistaken but you realize that
that reserve is there to make things like task exits work under OOM
conditions, right?  The only way userland OOM handlers as you describe
would work would be creating a separate reserve for them.

Aren't you basically suggesting two memcg domains - one which is
overcommitted and the other which isn't?  But if you want to do that,
wouldn't that be something which is a natural fit for memch hierarchy?
Not only that, such hierarchical setup would make sense for other
controllers too - you're really creating two fundamentally different
resource groups.

> It's much more powerful than that; you're referring to the mechanism to 
> guarantee future memory freeing so the system or memcg is no longer oom, 
> and that's only one case of possible handling.  I have a customer who 
> wants to save heap profiles at the time of oom as well, for example, and 
> their sole desire is to be able to capture memory statistics before the 
> oom kill takes place.  The sine qua non is that memory reserves allow 
> something to be done in such conditions: if you try to do a "ps" or "ls" 
> or cat a file in an oom memcg, you hang.  We need better functionality to 
> ensure that we can do some action prior to the oom kill itself, whether 
> that comes from userspace or the kernel.  We simply cannot rely on things 

Well, the gotcha there is that you won't be able to do that with
system level OOM handler either unless you create a separately
reserved memory, which, again, can be achieved using hierarchical
memcg setup already.  Am I missing something here?

> like memory thresholds or vmpressure to grab these heap profiles, there is 
> no guarantee that memory will not be exhausted and the oom kill would 
> already have taken place before the process handling the notification 
> wakes up.  (And any argument that it is possible by simply making the 
> threshold happen early enough is a non-starter: it does not guarantee the 
> heaps are collected for oom conditions and the oom kill can still occur 
> prematurely in machines that overcommit their memcg limits, as we do.)

I don't really follow your "guarantee" argument regarding OOM.  It's
not like we hav

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-09 Thread Johannes Weiner
On Mon, Dec 09, 2013 at 12:10:44PM -0800, David Rientjes wrote:
> On Fri, 6 Dec 2013, Tejun Heo wrote:
> 
> > > Tejun, how are you?
> > 
> > Doing pretty good.  How's yourself? :)
> > 
> 
> Not bad, busy with holidays and all that.
> 
> > > I agree that we wouldn't need such support if we are only addressing 
> > > memcg 
> > > oom conditions.  We could do things like A/memory.limit_in_bytes == 128M 
> > > and A/b/memory.limit_in_bytes == 126MB and then attach the process 
> > > waiting 
> > > on A/b/memory.oom_control to A and that would work perfect.
> > 
> > Or even just create a separate parallel cgroup A/memory.limit_in_bytes
> > == 126M A-oom/memory.limit_in_bytes = 2M and avoid the extra layer of
> > nesting.
> > 
> 
> Indeed.  The setup I'm specifically trying to attack is where the sum of 
> the limits of all non-oom handling memcgs (A/b in my model, A in yours) 
> exceed the amount of RAM.  If the system has 256MB,
> 
>   /=256MB
>   A=126MB A-oom=2MB   B=188MB B-oom=4MB
> 
> or
> 
>   /=256MB
>   C=128MB D=192MB
>   C/a=126MD/a=188MB
> 
> then it's possible for A + B or C/a + D/a to cause a system oom condition 
> and meanwhile A-oom/tasks, B-oom/tasks, C/tasks, and D/tasks cannot 
> allocate memory to handle it.

So your per-memcg handlers want access to PHYSICAL MEMORY reserves
during system-wide OOM, but this patch implements MEMORY CHARGE
reserves only, which are obviously meaningless during system-wide OOM.

In other words, this is an entirely different usecase than what this
patchset is really about.

You have to sell us on the problem first, then we can discuss a
solution.  Instead, you insist on the solution and keep changing the
problem whenever we find it no longer justifies your proposal.

> > > However, we also need to discuss system oom handling.  We have an 
> > > interest 
> > > in being able to allow userspace to handle system oom conditions since 
> > > the 
> > > policy will differ depending on machine and we can't encode every 
> > > possible 
> > > mechanism into the kernel.  For example, on system oom we want to kill a 
> > > process from the lowest priority top-level memcg.  We lack that ability 
> > > entirely in the kernel and since the sum of our top-level memcgs 
> > > memory.limit_in_bytes exceeds the amount of present RAM, we run into 
> > > these 
> > > oom conditions a _lot_.
> > > 
> > > So the first step, in my opinion, is to add a system oom notification on 
> > > the root memcg's memory.oom_control which currently allows registering an 
> > > eventfd() notification but never actually triggers.  I did that in a 
> > > patch 
> > > and it is was merged into -mm but was pulled out for later discussion.
> > 
> > Hmmm... this seems to be a different topic.  You're saying that it'd
> > be beneficial to add userland oom handling at the sytem level and if
> > that happens having per-memcg oom reserve would be consistent with the
> > system-wide one, right?
> 
> Right, and apologies for not discussing the system oom handling here since 
> its notification on the root memcg is currently being debated as well.  
> The idea is that admins and users aren't going to be concerned about 
> memory allocation through the page allocator vs memory charging through 
> the memory controller; they simply want memory for their userspace oom 
> handling.  And since the notification would be tied to the root memcg, it 
> makes sense to make the amount of memory allowed to allocate exclusively 
> for these handlers a memcg interface.  So the cleanest solution, in my 
> opinion, was to add the interface as part of memcg.
> 
> > While I can see some merit in that argument,
> > the whole thing is predicated on system level userland oom handling
> > being justified && even then I'm not quite sure whether "consistent
> > interface" is enough to have oom reserve in all memory cgroups.  It
> > feels a bit backwards because, here, the root memcg is the exception,
> > not the other way around.  Root is the only one which can't put oom
> > handler in a separate cgroup, so it could make more sense to special
> > case that rather than spreading the interface for global userland oom
> > to everyone else.
> > 
> 
> It's really the same thing, though, from the user perspective.  They don't 
> care about page allocation failure vs memcg charge failure, they simply 
> want to ensure that the memory set aside for memory.oom_reserve_in_bytes 
> is available in oom conditions.  With the suggested alternatives:
> 
>   /=256MB
>   A=126MB A-oom=2MB   B=188MB B-oom=4MB
> 
> or
> 
>   /=256MB
>   C=128MB D=192MB
>   C/a=126MD/a=188MB
> 
> we can't distinguish between what is able to allocate below per-zone min 
> watermarks in the page allocator as the oom rese

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-09 Thread David Rientjes
On Fri, 6 Dec 2013, Tejun Heo wrote:

> > Tejun, how are you?
> 
> Doing pretty good.  How's yourself? :)
> 

Not bad, busy with holidays and all that.

> > I agree that we wouldn't need such support if we are only addressing memcg 
> > oom conditions.  We could do things like A/memory.limit_in_bytes == 128M 
> > and A/b/memory.limit_in_bytes == 126MB and then attach the process waiting 
> > on A/b/memory.oom_control to A and that would work perfect.
> 
> Or even just create a separate parallel cgroup A/memory.limit_in_bytes
> == 126M A-oom/memory.limit_in_bytes = 2M and avoid the extra layer of
> nesting.
> 

Indeed.  The setup I'm specifically trying to attack is where the sum of 
the limits of all non-oom handling memcgs (A/b in my model, A in yours) 
exceed the amount of RAM.  If the system has 256MB,

/=256MB
A=126MB A-oom=2MB   B=188MB B-oom=4MB

or

/=256MB
C=128MB D=192MB
C/a=126MD/a=188MB

then it's possible for A + B or C/a + D/a to cause a system oom condition 
and meanwhile A-oom/tasks, B-oom/tasks, C/tasks, and D/tasks cannot 
allocate memory to handle it.

> > However, we also need to discuss system oom handling.  We have an interest 
> > in being able to allow userspace to handle system oom conditions since the 
> > policy will differ depending on machine and we can't encode every possible 
> > mechanism into the kernel.  For example, on system oom we want to kill a 
> > process from the lowest priority top-level memcg.  We lack that ability 
> > entirely in the kernel and since the sum of our top-level memcgs 
> > memory.limit_in_bytes exceeds the amount of present RAM, we run into these 
> > oom conditions a _lot_.
> > 
> > So the first step, in my opinion, is to add a system oom notification on 
> > the root memcg's memory.oom_control which currently allows registering an 
> > eventfd() notification but never actually triggers.  I did that in a patch 
> > and it is was merged into -mm but was pulled out for later discussion.
> 
> Hmmm... this seems to be a different topic.  You're saying that it'd
> be beneficial to add userland oom handling at the sytem level and if
> that happens having per-memcg oom reserve would be consistent with the
> system-wide one, right?

Right, and apologies for not discussing the system oom handling here since 
its notification on the root memcg is currently being debated as well.  
The idea is that admins and users aren't going to be concerned about 
memory allocation through the page allocator vs memory charging through 
the memory controller; they simply want memory for their userspace oom 
handling.  And since the notification would be tied to the root memcg, it 
makes sense to make the amount of memory allowed to allocate exclusively 
for these handlers a memcg interface.  So the cleanest solution, in my 
opinion, was to add the interface as part of memcg.

> While I can see some merit in that argument,
> the whole thing is predicated on system level userland oom handling
> being justified && even then I'm not quite sure whether "consistent
> interface" is enough to have oom reserve in all memory cgroups.  It
> feels a bit backwards because, here, the root memcg is the exception,
> not the other way around.  Root is the only one which can't put oom
> handler in a separate cgroup, so it could make more sense to special
> case that rather than spreading the interface for global userland oom
> to everyone else.
> 

It's really the same thing, though, from the user perspective.  They don't 
care about page allocation failure vs memcg charge failure, they simply 
want to ensure that the memory set aside for memory.oom_reserve_in_bytes 
is available in oom conditions.  With the suggested alternatives:

/=256MB
A=126MB A-oom=2MB   B=188MB B-oom=4MB

or

/=256MB
C=128MB D=192MB
C/a=126MD/a=188MB

we can't distinguish between what is able to allocate below per-zone min 
watermarks in the page allocator as the oom reserve.  The key point is 
that the root memcg is not the only memcg concerned with page allocator 
memory reserves, it's any oom reserve.  If A's usage is 124MB and B's 
usage is 132MB, we can't specify that processes attached to B-oom should 
be able to bypass per-zone min watermarks without an interface such as 
that being proposed.

> But, before that, system level userland OOM handling sounds scary to
> me.  I thought about userland OOM handling for memcgs and it does make
> some sense.  ie. there is a different action that userland oom handler
> can take which kernel oom handler can't - it can expand the limit of
> the offending cgroup, effectively using OOM handler as a sizing
> estimator.  I'm not sure whether that in itself is a good idea but
> then again

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-07 Thread Johannes Weiner
On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote:
> You more or less described the fundamental change - a score per memcg, with
> a recursive OOM killer which evaluates scores between siblings at the same
> level.
> 
> It gets a bit complicated because we have need if wider scoring ranges than
> are provided by default

If so, I'm sure you can make a convincing case to widen the internal
per-task score ranges.  The per-memcg score ranges have not even be
defined, so this is even easier.

> and because we score PIDs against mcgs at a given scope.

You are describing bits of a solution, not a problem.  And I can't
possibly infer a problem from this.

> We also have some tiebreaker heuristic (age).

Either periodically update the per-memcg score from userspace or
implement this in the kernel.  We have considered CPU usage
history/runtime etc. in the past when picking an OOM victim task.

But I'm again just speculating what your problem is, so this may or
may not be a feasible solution.

> We also have a handful of features that depend on OOM handling like the
> aforementioned automatically growing and changing the actual OOM score
> depending on usage in relation to various thresholds ( e.g. we sold you X,
> and we allow you to go over X but if you do, your likelihood of death in
> case of system OOM goes up.

You can trivially monitor threshold events from userspace with the
existing infrastructure and accordingly update the per-memcg score.

> Do you really want us to teach the kernel policies like this?  It would be
> way easier to do and test in userspace.

Maybe.  Providing fragments of your solution is not an efficient way
to communicate the problem.  And you have to sell the problem before
anybody can be expected to even consider your proposal as one of the
possible solutions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-07 Thread Johannes Weiner
Hello Tim!

On Sat, Dec 07, 2013 at 08:38:20AM -0800, Tim Hockin wrote:
> We actually started with kernel patches all h these lines - per-memcg
> scores and all of our crazy policy requirements.
> 
> It turns out that changing policies is hard.
>
> When David offered the opportunity to manage it all in user space it
> sounded like a great idea.
> 
> If this can be made to work as a high prio daemon with access to reserves,
> we would like it.

We can not talk solutions if you won't describe the problem.  It's
understandable that you can't talk about internal details, but it's
possible to describe a technical problem in a portable fashion such
that people can understand and evaluate it without knowing your whole
application.  Companies do this all the time.

"The way our blackbox works makes it really hard to hook it up to the
Linux kernel" is not a very convincing technical argument to change
the Linux kernel.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-06 Thread Tejun Heo
Yo, David.

On Thu, Dec 05, 2013 at 03:49:57PM -0800, David Rientjes wrote:
> Tejun, how are you?

Doing pretty good.  How's yourself? :)

> > Umm.. without delving into details, aren't you basically creating a
> > memory cgroup inside a memory cgroup?  Doesn't sound like a
> > particularly well thought-out plan to me.
> 
> I agree that we wouldn't need such support if we are only addressing memcg 
> oom conditions.  We could do things like A/memory.limit_in_bytes == 128M 
> and A/b/memory.limit_in_bytes == 126MB and then attach the process waiting 
> on A/b/memory.oom_control to A and that would work perfect.

Or even just create a separate parallel cgroup A/memory.limit_in_bytes
== 126M A-oom/memory.limit_in_bytes = 2M and avoid the extra layer of
nesting.

> However, we also need to discuss system oom handling.  We have an interest 
> in being able to allow userspace to handle system oom conditions since the 
> policy will differ depending on machine and we can't encode every possible 
> mechanism into the kernel.  For example, on system oom we want to kill a 
> process from the lowest priority top-level memcg.  We lack that ability 
> entirely in the kernel and since the sum of our top-level memcgs 
> memory.limit_in_bytes exceeds the amount of present RAM, we run into these 
> oom conditions a _lot_.
> 
> So the first step, in my opinion, is to add a system oom notification on 
> the root memcg's memory.oom_control which currently allows registering an 
> eventfd() notification but never actually triggers.  I did that in a patch 
> and it is was merged into -mm but was pulled out for later discussion.

Hmmm... this seems to be a different topic.  You're saying that it'd
be beneficial to add userland oom handling at the sytem level and if
that happens having per-memcg oom reserve would be consistent with the
system-wide one, right?  While I can see some merit in that argument,
the whole thing is predicated on system level userland oom handling
being justified && even then I'm not quite sure whether "consistent
interface" is enough to have oom reserve in all memory cgroups.  It
feels a bit backwards because, here, the root memcg is the exception,
not the other way around.  Root is the only one which can't put oom
handler in a separate cgroup, so it could make more sense to special
case that rather than spreading the interface for global userland oom
to everyone else.

But, before that, system level userland OOM handling sounds scary to
me.  I thought about userland OOM handling for memcgs and it does make
some sense.  ie. there is a different action that userland oom handler
can take which kernel oom handler can't - it can expand the limit of
the offending cgroup, effectively using OOM handler as a sizing
estimator.  I'm not sure whether that in itself is a good idea but
then again it might not be possible to clearly separate out sizing
from oom conditions.

Anyways, but for system level OOM handling, there's no other action
userland handler can take.  It's not like the OOM handler paging the
admin to install more memory is a reasonable mode of operation to
support.  The *only* action userland OOM handler can take is killing
something.  Now, if that's the case and we have kernel OOM handler
anyway, I think the best course of action is improving kernel OOM
handler and teach it to make the decisions that the userland handler
would consider good.  That should be doable, right?

The thing is OOM handling in userland is an inherently fragile thing
and it can *never* replace kernel OOM handling.  You may reserve any
amount of memory you want but there would still be cases that it may
fail.  It's not like we have owner-based allocation all through the
kernel or are willing to pay overhead for such thing.  Even if that
part can be guaranteed somehow (no idea how), the kernel still can
NEVER trust the userland OOM handler.  No matter what we do, we need a
kernel OOM handler with no resource dependency.

So, there isn't anything userland OOM handler can inherently do better
and we can't do away with kernel handler no matter what.  On both
accounts, it seems like the best course of action is making
system-wide kernel OOM handler to make better decisions if possible at
all.  If that's impossible, let's first think about why that's the
case before hastly opening this new can of worms.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-06 Thread Johannes Weiner
On Thu, Dec 05, 2013 at 03:49:57PM -0800, David Rientjes wrote:
> On Wed, 4 Dec 2013, Tejun Heo wrote:
> 
> > Hello,
> > 
> 
> Tejun, how are you?
> 
> > Umm.. without delving into details, aren't you basically creating a
> > memory cgroup inside a memory cgroup?  Doesn't sound like a
> > particularly well thought-out plan to me.
> > 
> 
> I agree that we wouldn't need such support if we are only addressing memcg 
> oom conditions.  We could do things like A/memory.limit_in_bytes == 128M 
> and A/b/memory.limit_in_bytes == 126MB and then attach the process waiting 
> on A/b/memory.oom_control to A and that would work perfect.
> 
> However, we also need to discuss system oom handling.  We have an interest 
> in being able to allow userspace to handle system oom conditions since the 
> policy will differ depending on machine and we can't encode every possible 
> mechanism into the kernel.  For example, on system oom we want to kill a 
> process from the lowest priority top-level memcg.  We lack that ability 
> entirely in the kernel and since the sum of our top-level memcgs 
> memory.limit_in_bytes exceeds the amount of present RAM, we run into these 
> oom conditions a _lot_.

A simple and natural solution to this is to have the global OOM killer
respect cgroups.  You go through all the effort of carefully grouping
tasks into bigger entities that you then arrange hierarchically.  The
global OOM killer should not just treat all tasks as equal peers.

We can add a per-cgroup OOM priority knob and have the global OOM
handler pick victim tasks from the one or more groups that have the
lowest priority.

Out of the box, every cgroup has the same priority, which means we can
add this feature without changing the default behavior.

> So the first step, in my opinion, is to add a system oom notification on 
> the root memcg's memory.oom_control which currently allows registering an 
> eventfd() notification but never actually triggers.  I did that in a patch 
> and it is was merged into -mm but was pulled out for later discussion.
> 
> Then, we need to ensure that the userspace that is registered to handle 
> such events and that is difficult to do when the system is oom.  The 
> proposal is to allow such processes, now marked as PF_OOM_HANDLER, to be 
> able to access pre-defined per-zone memory reserves in the page allocator.  
> The only special handling for PF_OOM_HANDLER in the page allocator itself 
> would be under such oom conditions (memcg oom conditions have no problem 
> allocating the memory, only charging it).  The amount of reserves would be 
> defined as memory.oom_reserve_in_bytes from within the root memcg as 
> defined by this patch, i.e. allow this amount of memory to be allocated in 
> the page allocator for PF_OOM_HANDLER below the per-zone min watermarks.
> 
> This, I believe, is the cleanest interface for users who choose to use a 
> non-default policy by setting memory.oom_reserve_in_bytes and constrains 
> all of the code to memcg which you have to configure for such support.
> 
> The system oom condition is not addressed in this patch series, although 
> the PF_OOM_HANDLER bit can be used for that purpose.  I didn't post that 
> patch because the notification on the root memcg's memory.oom_control in 
> such conditions is currently being debated, so we need to solve that issue 
> first.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-05 Thread David Rientjes
On Wed, 4 Dec 2013, Tejun Heo wrote:

> Hello,
> 

Tejun, how are you?

> Umm.. without delving into details, aren't you basically creating a
> memory cgroup inside a memory cgroup?  Doesn't sound like a
> particularly well thought-out plan to me.
> 

I agree that we wouldn't need such support if we are only addressing memcg 
oom conditions.  We could do things like A/memory.limit_in_bytes == 128M 
and A/b/memory.limit_in_bytes == 126MB and then attach the process waiting 
on A/b/memory.oom_control to A and that would work perfect.

However, we also need to discuss system oom handling.  We have an interest 
in being able to allow userspace to handle system oom conditions since the 
policy will differ depending on machine and we can't encode every possible 
mechanism into the kernel.  For example, on system oom we want to kill a 
process from the lowest priority top-level memcg.  We lack that ability 
entirely in the kernel and since the sum of our top-level memcgs 
memory.limit_in_bytes exceeds the amount of present RAM, we run into these 
oom conditions a _lot_.

So the first step, in my opinion, is to add a system oom notification on 
the root memcg's memory.oom_control which currently allows registering an 
eventfd() notification but never actually triggers.  I did that in a patch 
and it is was merged into -mm but was pulled out for later discussion.

Then, we need to ensure that the userspace that is registered to handle 
such events and that is difficult to do when the system is oom.  The 
proposal is to allow such processes, now marked as PF_OOM_HANDLER, to be 
able to access pre-defined per-zone memory reserves in the page allocator.  
The only special handling for PF_OOM_HANDLER in the page allocator itself 
would be under such oom conditions (memcg oom conditions have no problem 
allocating the memory, only charging it).  The amount of reserves would be 
defined as memory.oom_reserve_in_bytes from within the root memcg as 
defined by this patch, i.e. allow this amount of memory to be allocated in 
the page allocator for PF_OOM_HANDLER below the per-zone min watermarks.

This, I believe, is the cleanest interface for users who choose to use a 
non-default policy by setting memory.oom_reserve_in_bytes and constrains 
all of the code to memcg which you have to configure for such support.

The system oom condition is not addressed in this patch series, although 
the PF_OOM_HANDLER bit can be used for that purpose.  I didn't post that 
patch because the notification on the root memcg's memory.oom_control in 
such conditions is currently being debated, so we need to solve that issue 
first.

Your opinions and suggestions are more than helpful, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-04 Thread Tejun Heo
Hello,

On Wed, Dec 04, 2013 at 05:49:04PM -0800, David Rientjes wrote:
> That's not what this series is addressing, though, and in fact it's quite 
> the opposite.  It acknowledges that userspace oom handlers need to 
> allocate and that anything else would be too difficult to maintain 
> (thereby agreeing with the above), so we must set aside memory that they 
> are exclusively allowed to access.  For the vast majority of users who 
> will not use userspace oom handlers, they can just use the default value 
> of memory.oom_reserve_in_bytes == 0 and they incur absolutely no side-
> effects as a result of this series.

Umm.. without delving into details, aren't you basically creating a
memory cgroup inside a memory cgroup?  Doesn't sound like a
particularly well thought-out plan to me.

> For those who do use userspace oom handlers, like Google, this allows us 
> to set aside memory to allow the userspace oom handlers to kill a process, 
> dump the heap, send a signal, drop caches, etc. when waking up.

Seems kinda obvious.  Put it in a separate cgroup?  You're basically
saying it doesn't want to be under the same memory limit as the
processes that it's looking over.  That's like the definition of being
in a different cgroup.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-04 Thread David Rientjes
On Wed, 4 Dec 2013, Johannes Weiner wrote:

> > Now that a per-process flag is available, define it for processes that
> > handle userspace oom notifications.  This is an optimization to avoid
> > mantaining a list of such processes attached to a memcg at any given time
> > and iterating it at charge time.
> > 
> > This flag gets set whenever a process has registered for an oom
> > notification and is cleared whenever it unregisters.
> > 
> > When memcg reclaim has failed to free any memory, it is necessary for
> > userspace oom handlers to be able to dip into reserves to pagefault text,
> > allocate kernel memory to read the "tasks" file, allocate heap, etc.
> 
> The task handling the OOM of a memcg can obviously not be part of that
> same memcg.
> 

Not without memory.oom_reserve_in_bytes that this series adds, that's 
true.  Michal expressed interest in the idea of memcg oom reserves in the 
past, so I thought I'd share the series.

> On Tue, 3 Dec 2013 at 15:35:48 +0800, Li Zefan wrote:
> > On Mon, 2 Dec 2013 at 11:44:06 -0500, Johannes Weiner wrote:
> > > On Fri, Nov 29, 2013 at 03:05:25PM -0500, Tejun Heo wrote:
> > > > Whoa, so we support oom handler inside the memcg that it handles?
> > > > Does that work reliably?  Changing the above detail in this patch
> > > > isn't difficult (and we'll later need to update kernfs too) but
> > > > supporting such setup properly would be a *lot* of commitment and I'm
> > > > very doubtful we'd be able to achieve that by just carefully avoiding
> > > > memory allocation in the operations that usreland oom handler uses -
> > > > that set is destined to expand over time, extremely fragile and will
> > > > be hellish to maintain.
> > > > 

It works reliably with this patch series, yes.  I'm not sure what change 
this is referring to that would avoid memory allocation for userspace oom 
handlers, and I'd agree that it would be difficult to maintain a 
no-allocation policy for a subset of processes that are destined to handle 
oom handlers.

That's not what this series is addressing, though, and in fact it's quite 
the opposite.  It acknowledges that userspace oom handlers need to 
allocate and that anything else would be too difficult to maintain 
(thereby agreeing with the above), so we must set aside memory that they 
are exclusively allowed to access.  For the vast majority of users who 
will not use userspace oom handlers, they can just use the default value 
of memory.oom_reserve_in_bytes == 0 and they incur absolutely no side-
effects as a result of this series.

For those who do use userspace oom handlers, like Google, this allows us 
to set aside memory to allow the userspace oom handlers to kill a process, 
dump the heap, send a signal, drop caches, etc. when waking up.

> > > > So, I'm not at all excited about commiting to this guarantee.  This
> > > > one is an easy one but it looks like the first step onto dizzying
> > > > slippery slope.
> > > > 
> > > > Am I misunderstanding something here?  Are you and Johannes firm on
> > > > supporting this?
> > >
> > > Handling a memcg OOM from userspace running inside that OOM memcg is
> > > completely crazy.  I mean, think about this for just two seconds...
> > > Really?
> > >
> > > I get that people are doing it right now, and if you can get away with
> > > it for now, good for you.  But you have to be aware how crazy this is
> > > and if it breaks you get to keep the pieces and we are not going to
> > > accomodate this in the kernel.  Fix your crazy userspace.
> > 

The rest of this email communicates only one thing: someone thinks it's 
crazy.  And I agree it would be crazy if we don't allow that class of 
process to have access to a pre-defined amount of memory to handle the 
situation, which this series adds.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-03 Thread Johannes Weiner
On Tue, Dec 03, 2013 at 09:20:17PM -0800, David Rientjes wrote:
> Now that a per-process flag is available, define it for processes that
> handle userspace oom notifications.  This is an optimization to avoid
> mantaining a list of such processes attached to a memcg at any given time
> and iterating it at charge time.
> 
> This flag gets set whenever a process has registered for an oom
> notification and is cleared whenever it unregisters.
> 
> When memcg reclaim has failed to free any memory, it is necessary for
> userspace oom handlers to be able to dip into reserves to pagefault text,
> allocate kernel memory to read the "tasks" file, allocate heap, etc.

The task handling the OOM of a memcg can obviously not be part of that
same memcg.

I've said this many times in the past, but here is the most recent
thread from Tejun, me, and Li on this topic:

---

On Tue, 3 Dec 2013 at 15:35:48 +0800, Li Zefan wrote:
> On Mon, 2 Dec 2013 at 11:44:06 -0500, Johannes Weiner wrote:
> > On Fri, Nov 29, 2013 at 03:05:25PM -0500, Tejun Heo wrote:
> > > Whoa, so we support oom handler inside the memcg that it handles?
> > > Does that work reliably?  Changing the above detail in this patch
> > > isn't difficult (and we'll later need to update kernfs too) but
> > > supporting such setup properly would be a *lot* of commitment and I'm
> > > very doubtful we'd be able to achieve that by just carefully avoiding
> > > memory allocation in the operations that usreland oom handler uses -
> > > that set is destined to expand over time, extremely fragile and will
> > > be hellish to maintain.
> > > 
> > > So, I'm not at all excited about commiting to this guarantee.  This
> > > one is an easy one but it looks like the first step onto dizzying
> > > slippery slope.
> > > 
> > > Am I misunderstanding something here?  Are you and Johannes firm on
> > > supporting this?
> >
> > Handling a memcg OOM from userspace running inside that OOM memcg is
> > completely crazy.  I mean, think about this for just two seconds...
> > Really?
> >
> > I get that people are doing it right now, and if you can get away with
> > it for now, good for you.  But you have to be aware how crazy this is
> > and if it breaks you get to keep the pieces and we are not going to
> > accomodate this in the kernel.  Fix your crazy userspace.
> 
> +1

---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/