Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Tejun Heo
Hello, Vladimir.

On Fri, Sep 04, 2015 at 09:21:11PM +0300, Vladimir Davydov wrote:
> Now I think task_work reclaim initially proposed by Tejun would be a
> much better fix.

Cool, I'll update the patch.

> I'm terribly sorry for being so annoying and stubborn and want to thank
> you for all your feedback!

Heh, I'm not all that confident about my position.  A lot of it could
be from lack of experience and failing to see the gradients.  Please
keep me in check if I get lost.

Thanks a lot!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Vladimir Davydov
Hi Tejun, Michal

On Fri, Sep 04, 2015 at 11:44:48AM -0400, Tejun Heo wrote:
...
> > I admit I may be mistaken, but if I'm right, we may end up with really
> > complex memcg reclaim logic trying to closely mimic behavior of buddy
> > alloc with all its historic peculiarities. That's why I don't want to
> > rush ahead "fixing" memcg reclaim before an agreement among all
> > interested people is reached...
> 
> I think that's a bit out of proportion.  I'm not suggesting bringing
> in all complexities of global reclaim.  There's no reason to and what
> memcg deals with is inherently way simpler than actual memory
> allocation.  The original patch was about fixing systematic failure
> around GFP_NOWAIT close to the high limit.  We might want to do
> background reclaim close to max but as long as high limit functions
> correctly, that's much less of a problem at least on the v2 interface.

Looking through this thread once again and weighting my arguments vs
yours, I start to understand that I'm totally wrong and these patches
are not proper fixes for the problem.

Having these patches in the kernel only helps when we are hitting the
hard limit, which shouldn't occur often if memory.high works properly.
Even if memory.high is not used, the only negative effect we would get
w/o them is allocating a slab from a wrong node or getting a low order
page where we could get a high order one. Both should be rare and both
aren't critical. I think I got carried away with all those obscure
"reclaimer peculiarities" at some point.

Now I think task_work reclaim initially proposed by Tejun would be a
much better fix.

I'm terribly sorry for being so annoying and stubborn and want to thank
you for all your feedback!

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Tejun Heo
Hello, Vladimir.

On Fri, Sep 04, 2015 at 02:15:50PM +0300, Vladimir Davydov wrote:
> Trying a high-order page before falling back on lower order is not
> something really common. It implicitly relies on the fact that
> reclaiming memory for a new continuous high-order page is much more
> expensive than getting the same amount of order-1 pages. This is true
> for buddy alloc, but not for memcg. That's why playing such a trick with
> try_charge is wrong IMO. If such a trick becomes common, I think we will
> have to introduce a helper for it, because otherwise a change in buddy
> alloc internal logic (e.g. a defrag optimization making high order pages
> cheaper) may affect its users.

I'm having trouble following why this matters.  The layering here is
pretty clear regardless of how slab is trespassing into page
allocator's role.  memcg of course doesn't care whether an allocation
is high-order or order-1.  All it does is imposing extra restrictions
when allocating memory and all that's necessary is reasonably
satisfying the expectations expressed by the specified gfp mask.

> That said, I totally agree that memcg should handle GFP_NOWAIT, but I'm
> opposed to the idea that it should handle the tricks that rely on
> internal buddy alloc logic similar to those used by SLAB and SLUB. We'd
> better strive to hide these tricks in buddy alloc helpers and never use
> them directly.

All these don't really matter once memcg handles GFP_NOWAIT in a
reasonable manner, right?  memcg doesn't need all the fancy tricks of
the page allocator.  All it needs is honoring the intentions expressed
by the gfp mask in a reasonable way w/o systematic failures.
 
> That's why I think we need these patches and they aren't workarounds
> that can be reverted once try_charge has been taught to handle
> GFP_NOWAIT properly.

So, if this is separate slab improvements, I have no objections but
independent of that, we need to be able to handle back-to-back
GFP_NOWAIT cases and w/ the high limit punting to the return path
should work well enough.

> > You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
> > I'm not sure how much we can commit to that statement.  GFP_KERNEL
> > allocating huge amount of memory in a single go is a kernel bug.
> > GFP_NOWAIT optimization in a hot path which is accessible to userland
> > isn't and we'll be growing more and more of them.  We need to be
> > protected against back-to-back GFP_NOWAIT allocations.
> 
> AFAIU if someone tries to allocate with GFP_NOWAIT (i.e. w/o
> __GFP_NOFAIL or __GFP_HIGH), he/she must be prepared to allocation
> failures, so there should be a safe fall back path, which fixes things
> in normal context. It doesn't mean we shouldn't do anything to satisfy
> such optimistic requests from memcg, but we may occasionally fail them.

Yes, it can fail under stress or if unluckly; however, it shouldn't
fail consistently under nominal conditions or be able to run over high
limit unchecked.

> OTOH if someone allocates with GFP_KERNEL, he/she should be prepared to
> get NULL, but in this case the whole operation will usually be aborted.
> Therefore with the possibility of all GFP_KERNEL being transformed to
> GFP_NOWAIT inside slab, memcg has to be extra cautious, because failing
> a usual GFP_NOWAIT in such a case may result not in falling back on slow
> path, but in user-visible effects like failing to open a file with
> ENOMEM. This is really difficult to achieve and I doubt it's worth
> complicating memcg code, because we can just fix SLAB/SLUB.

I'm not following you at all here.  slab too of course should fall
back to more robust gfp mask if NOWAIT fails and as long as those
failures are exceptions, it's fine.

> Regarding __GFP_NOFAIL and __GFP_HIGH, IMO we can let them go uncharged
> or charge them forcefully even if they breach the limit, because there
> shouldn't be many of them (if there were really a lot of them, they
> could deplete memory reserves and hang the system).
> 
> If all these assumptions are true, we don't need to do anything (apart
> from forcefully charging high prio allocations may be) for kmemcg to
> work satisfactory. For optimizing optimistic GFP_NOWAIT callers one can
> use memory.high instead or along with memory.max. Reclaiming memory.high
> in kernel while holding various locks can result in prio inversions
> though, but that's a different story, which could be fixed by task_work
> reclaim.

GFP_NOWAIT has a systematic problem which needs to be fixed.

> I admit I may be mistaken, but if I'm right, we may end up with really
> complex memcg reclaim logic trying to closely mimic behavior of buddy
> alloc with all its historic peculiarities. That's why I don't want to
> rush ahead "fixing" memcg reclaim before an agreement among all
> interested people is reached...

I think that's a bit out of proportion.  I'm not suggesting bringing
in all complexities of global reclaim.  There's no reason to and what
memcg deals with is inherently way 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Michal Hocko
On Wed 02-09-15 12:30:39, Vladimir Davydov wrote:
> [
>   I'll try to summarize my point in one hunk instead of spreading it all
>   over the e-mail, because IMO it's becoming a kind of difficult to
>   follow. If you think that there's a question I dodge, please let me
>   now and I'll try to address it separately.
> 
>   Also, adding Johannes to Cc (I noticed that I accidentally left him
>   out), because this discussion seems to be fundamental and may affect
>   our further steps dramatically.
> ]
> 
> On Tue, Sep 01, 2015 at 08:38:50PM +0200, Michal Hocko wrote:
[...]
> > I guess we are still not at the same page here. If the slab has a subtle
> > behavior (and from what you are saying it seems it has the same behavior
> > at the global scope) then we should strive to fix it rather than making
> > it more obscure just to not expose GFP_NOWAIT to memcg which is not
> > handled properly currently wrt. high limit (more on that below) which
> > was the primary motivation for the patch AFAIU.
> 
> Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> alloc_pages with the caller's context, it does the job normally done by
> alloc_pages itself. It's not what is done massively.
> 
> Leaving slab charge path as is looks really ugly to me. Look, slab
> iterates over all nodes, inspecting if they have free pages and fails
> even if they do due to the memcg constraint...

Yes, I understand what you are saying. The way how SLAB does its thing
is really subtle. The special combination of flags even prevents the
background reclaim which is weird. There was probably a good reason for
that but the point I've tried to make is that if the heuristic relies on
non-reclaiming behavior for the global case then the memcg should copy
that as much as possible. The allocator has to be prepared for the
non-sleeping allocation failure and the fact that memcg causes it sooner
is just natural because that is what the memcg is used for.

I see how you try to optimize around this subtle behavior but that only
makes it even more subtle long term.

> My point is that what slab does is a pretty low level thing, normal
> users call alloc_pages or kmalloc with flags corresponding to their
> context. Of course, there may be special users trying optimistically
> GFP_NOWAIT, but they aren't massive, and that simplifies things for
> memcg a lot.

memcg code _absolutely_ has to deal with NOWAIT requests somehow. I can
see more and more of them coming long term. Because it makes a lot of
sense to do an opportunistic allocation with a fallback. And that was
the whole point. You have started by tweaking SL.B whereas memcg is
where we should start see the resulting behavior and then think about
SL.B specific fix.

> I mean if we can rely on the fact that the number of
> GFP_NOWAIT allocations that can occur in a row is limited we can use
> direct reclaim (like memory.high) and/or task_work reclaim to fix
> GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
> most its heuristics. I don't think that copying those heuristics is the
> right thing to do, because in memcg case the same problems may be
> resolved much easier, because we don't actually experience real memory
> shortage when hitting the limit.

I am not really sure I understand what you mean here. What kind of
heuristics you have in mind? All that memcg code cares about is the keep
high limit contained and converge as much as possible.
 
> Moreover, we already treat some flags not in the same way as in case of
> slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
> instead of retrying infinitely.

Yes we rely on the global MM to handle those. Which is a reasonable
compromise IMO. Such a strong liability cannot realistically be handled
inside memcg without causing more problems.

> We ignore __GFP_THISNODE thing and we just cannot take it into account.

yes because it is allocation and not reclaim related mode. There is a
reason it is not part of GFP_RECLAIM_MASK.

> We ignore allocation order, because that makes no sense for memcg.

We are not ignoring it completely because we base our reclaim target on
it.

> To sum it up. Basically, there are two ways of handling kmemcg charges:
> 
>  1. Make the memcg try_charge mimic alloc_pages behavior.
>  2. Make API functions (kmalloc, etc) work in memcg as if they were
> called from the root cgroup, while keeping interactions between the
> low level subsys (slab) and memcg private.
> 
> Way 1 might look appealing at the first glance, but at the same time it
> is much more complex, because alloc_pages has grown over the years to
> handle a lot of subtle situations that may arise on global memory
> pressure, but impossible in memcg. What does way 1 give us then? We
> can't insert try_charge directly to alloc_pages and have to spread its
> calls all over the code anyway, so why is it better? Easier to use it in
> places 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Vladimir Davydov
On Thu, Sep 03, 2015 at 12:32:43PM -0400, Tejun Heo wrote:
> On Wed, Sep 02, 2015 at 12:30:39PM +0300, Vladimir Davydov wrote:
> ...
> > To sum it up. Basically, there are two ways of handling kmemcg charges:
> > 
> >  1. Make the memcg try_charge mimic alloc_pages behavior.
> >  2. Make API functions (kmalloc, etc) work in memcg as if they were
> > called from the root cgroup, while keeping interactions between the
> > low level subsys (slab) and memcg private.
> > 
> > Way 1 might look appealing at the first glance, but at the same time it
> > is much more complex, because alloc_pages has grown over the years to
> > handle a lot of subtle situations that may arise on global memory
> > pressure, but impossible in memcg. What does way 1 give us then? We
> > can't insert try_charge directly to alloc_pages and have to spread its
> > calls all over the code anyway, so why is it better? Easier to use it in
> > places where users depend on buddy allocator peculiarities? There are
> > not many such users.
> 
> Maybe this is from inexperience but wouldn't 1 also be simpler than
> the global case for the same reasons that doing 2 is simpler?  It's
> not like the fact that memory shortage inside memcg usually doesn't
> mean global shortage goes away depending on whether we take 1 or 2.
> 
> That said, it is true that slab is an integral part of kmemcg and I
> can't see how it can be made oblivious of memcg operations, so yeah
> one way or the other slab has to know the details and we may have to
> do some unusual things at that layer.
> 
> > I understand that the idea of way 1 is to provide a well-defined memcg
> > API independent of the rest of the code, but that's just impossible. You
> > need special casing anyway. E.g. you need those get/put_kmem_cache
> > helpers, which exist solely for SLAB/SLUB. You need all this special
> > stuff for growing per-memcg array in list_lru and kmem_cache, which
> > exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> > even handle kmem_cache destruction on memcg offline differently for SLAB
> > and SLUB for performance reasons.
> 
> It isn't a black or white thing.  Sure, slab should be involved in
> kmemcg but at the same time if we can keep the amount of exposure in
> check, that's the better way to go.
> 
> > Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> > for optimization, but their API is well defined, so we just make kmalloc
> > work as expected while providing inter-subsys calls, like
> > memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> > mentioned kmem users that allocate memory using alloc_pages. There is an
> > API function for them too, alloc_kmem_pages. Everything behind the API
> > is hidden and may be done in such a way to achieve optimal performance.
> 
> Ditto.  Nobody is arguing that we can get it out completely but at the
> same time handling of GFP_NOWAIT seems like a pretty fundamental
> proprety that we'd wanna maintain at memcg boundary.

Agree, but SLAB/SLUB aren't just calling GFP_NOWAIT. They're doing
pretty low level tricks, which aren't common for the rest of the system.

Inspecting all nodes with __GFP_THISNODE and w/o __GFP_WAIT before
calling reclaimer is what can and should be done by buddy allocator.
I've never seen anyone doing things like this apart from SLAB (note SLUB
doesn't do this). SLAB does this for historical reasons. We could fix
it, but that would require rewriting SLAB code to a great extent, which
isn't preferable, because we can easily break something.

Trying a high-order page before falling back on lower order is not
something really common. It implicitly relies on the fact that
reclaiming memory for a new continuous high-order page is much more
expensive than getting the same amount of order-1 pages. This is true
for buddy alloc, but not for memcg. That's why playing such a trick with
try_charge is wrong IMO. If such a trick becomes common, I think we will
have to introduce a helper for it, because otherwise a change in buddy
alloc internal logic (e.g. a defrag optimization making high order pages
cheaper) may affect its users.

That said, I totally agree that memcg should handle GFP_NOWAIT, but I'm
opposed to the idea that it should handle the tricks that rely on
internal buddy alloc logic similar to those used by SLAB and SLUB. We'd
better strive to hide these tricks in buddy alloc helpers and never use
them directly.

That's why I think we need these patches and they aren't workarounds
that can be reverted once try_charge has been taught to handle
GFP_NOWAIT properly.

> 
> You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
> I'm not sure how much we can commit to that statement.  GFP_KERNEL
> allocating huge amount of memory in a single go is a kernel bug.
> GFP_NOWAIT optimization in a hot path which is accessible to userland
> isn't and we'll be growing more and more of them.  We need to be
> protected against back-to-back 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Vladimir Davydov
On Thu, Sep 03, 2015 at 12:32:43PM -0400, Tejun Heo wrote:
> On Wed, Sep 02, 2015 at 12:30:39PM +0300, Vladimir Davydov wrote:
> ...
> > To sum it up. Basically, there are two ways of handling kmemcg charges:
> > 
> >  1. Make the memcg try_charge mimic alloc_pages behavior.
> >  2. Make API functions (kmalloc, etc) work in memcg as if they were
> > called from the root cgroup, while keeping interactions between the
> > low level subsys (slab) and memcg private.
> > 
> > Way 1 might look appealing at the first glance, but at the same time it
> > is much more complex, because alloc_pages has grown over the years to
> > handle a lot of subtle situations that may arise on global memory
> > pressure, but impossible in memcg. What does way 1 give us then? We
> > can't insert try_charge directly to alloc_pages and have to spread its
> > calls all over the code anyway, so why is it better? Easier to use it in
> > places where users depend on buddy allocator peculiarities? There are
> > not many such users.
> 
> Maybe this is from inexperience but wouldn't 1 also be simpler than
> the global case for the same reasons that doing 2 is simpler?  It's
> not like the fact that memory shortage inside memcg usually doesn't
> mean global shortage goes away depending on whether we take 1 or 2.
> 
> That said, it is true that slab is an integral part of kmemcg and I
> can't see how it can be made oblivious of memcg operations, so yeah
> one way or the other slab has to know the details and we may have to
> do some unusual things at that layer.
> 
> > I understand that the idea of way 1 is to provide a well-defined memcg
> > API independent of the rest of the code, but that's just impossible. You
> > need special casing anyway. E.g. you need those get/put_kmem_cache
> > helpers, which exist solely for SLAB/SLUB. You need all this special
> > stuff for growing per-memcg array in list_lru and kmem_cache, which
> > exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> > even handle kmem_cache destruction on memcg offline differently for SLAB
> > and SLUB for performance reasons.
> 
> It isn't a black or white thing.  Sure, slab should be involved in
> kmemcg but at the same time if we can keep the amount of exposure in
> check, that's the better way to go.
> 
> > Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> > for optimization, but their API is well defined, so we just make kmalloc
> > work as expected while providing inter-subsys calls, like
> > memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> > mentioned kmem users that allocate memory using alloc_pages. There is an
> > API function for them too, alloc_kmem_pages. Everything behind the API
> > is hidden and may be done in such a way to achieve optimal performance.
> 
> Ditto.  Nobody is arguing that we can get it out completely but at the
> same time handling of GFP_NOWAIT seems like a pretty fundamental
> proprety that we'd wanna maintain at memcg boundary.

Agree, but SLAB/SLUB aren't just calling GFP_NOWAIT. They're doing
pretty low level tricks, which aren't common for the rest of the system.

Inspecting all nodes with __GFP_THISNODE and w/o __GFP_WAIT before
calling reclaimer is what can and should be done by buddy allocator.
I've never seen anyone doing things like this apart from SLAB (note SLUB
doesn't do this). SLAB does this for historical reasons. We could fix
it, but that would require rewriting SLAB code to a great extent, which
isn't preferable, because we can easily break something.

Trying a high-order page before falling back on lower order is not
something really common. It implicitly relies on the fact that
reclaiming memory for a new continuous high-order page is much more
expensive than getting the same amount of order-1 pages. This is true
for buddy alloc, but not for memcg. That's why playing such a trick with
try_charge is wrong IMO. If such a trick becomes common, I think we will
have to introduce a helper for it, because otherwise a change in buddy
alloc internal logic (e.g. a defrag optimization making high order pages
cheaper) may affect its users.

That said, I totally agree that memcg should handle GFP_NOWAIT, but I'm
opposed to the idea that it should handle the tricks that rely on
internal buddy alloc logic similar to those used by SLAB and SLUB. We'd
better strive to hide these tricks in buddy alloc helpers and never use
them directly.

That's why I think we need these patches and they aren't workarounds
that can be reverted once try_charge has been taught to handle
GFP_NOWAIT properly.

> 
> You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
> I'm not sure how much we can commit to that statement.  GFP_KERNEL
> allocating huge amount of memory in a single go is a kernel bug.
> GFP_NOWAIT optimization in a hot path which is accessible to userland
> isn't and we'll be growing more and more of them.  We need to be
> protected against back-to-back 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Michal Hocko
On Wed 02-09-15 12:30:39, Vladimir Davydov wrote:
> [
>   I'll try to summarize my point in one hunk instead of spreading it all
>   over the e-mail, because IMO it's becoming a kind of difficult to
>   follow. If you think that there's a question I dodge, please let me
>   now and I'll try to address it separately.
> 
>   Also, adding Johannes to Cc (I noticed that I accidentally left him
>   out), because this discussion seems to be fundamental and may affect
>   our further steps dramatically.
> ]
> 
> On Tue, Sep 01, 2015 at 08:38:50PM +0200, Michal Hocko wrote:
[...]
> > I guess we are still not at the same page here. If the slab has a subtle
> > behavior (and from what you are saying it seems it has the same behavior
> > at the global scope) then we should strive to fix it rather than making
> > it more obscure just to not expose GFP_NOWAIT to memcg which is not
> > handled properly currently wrt. high limit (more on that below) which
> > was the primary motivation for the patch AFAIU.
> 
> Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> alloc_pages with the caller's context, it does the job normally done by
> alloc_pages itself. It's not what is done massively.
> 
> Leaving slab charge path as is looks really ugly to me. Look, slab
> iterates over all nodes, inspecting if they have free pages and fails
> even if they do due to the memcg constraint...

Yes, I understand what you are saying. The way how SLAB does its thing
is really subtle. The special combination of flags even prevents the
background reclaim which is weird. There was probably a good reason for
that but the point I've tried to make is that if the heuristic relies on
non-reclaiming behavior for the global case then the memcg should copy
that as much as possible. The allocator has to be prepared for the
non-sleeping allocation failure and the fact that memcg causes it sooner
is just natural because that is what the memcg is used for.

I see how you try to optimize around this subtle behavior but that only
makes it even more subtle long term.

> My point is that what slab does is a pretty low level thing, normal
> users call alloc_pages or kmalloc with flags corresponding to their
> context. Of course, there may be special users trying optimistically
> GFP_NOWAIT, but they aren't massive, and that simplifies things for
> memcg a lot.

memcg code _absolutely_ has to deal with NOWAIT requests somehow. I can
see more and more of them coming long term. Because it makes a lot of
sense to do an opportunistic allocation with a fallback. And that was
the whole point. You have started by tweaking SL.B whereas memcg is
where we should start see the resulting behavior and then think about
SL.B specific fix.

> I mean if we can rely on the fact that the number of
> GFP_NOWAIT allocations that can occur in a row is limited we can use
> direct reclaim (like memory.high) and/or task_work reclaim to fix
> GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
> most its heuristics. I don't think that copying those heuristics is the
> right thing to do, because in memcg case the same problems may be
> resolved much easier, because we don't actually experience real memory
> shortage when hitting the limit.

I am not really sure I understand what you mean here. What kind of
heuristics you have in mind? All that memcg code cares about is the keep
high limit contained and converge as much as possible.
 
> Moreover, we already treat some flags not in the same way as in case of
> slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
> instead of retrying infinitely.

Yes we rely on the global MM to handle those. Which is a reasonable
compromise IMO. Such a strong liability cannot realistically be handled
inside memcg without causing more problems.

> We ignore __GFP_THISNODE thing and we just cannot take it into account.

yes because it is allocation and not reclaim related mode. There is a
reason it is not part of GFP_RECLAIM_MASK.

> We ignore allocation order, because that makes no sense for memcg.

We are not ignoring it completely because we base our reclaim target on
it.

> To sum it up. Basically, there are two ways of handling kmemcg charges:
> 
>  1. Make the memcg try_charge mimic alloc_pages behavior.
>  2. Make API functions (kmalloc, etc) work in memcg as if they were
> called from the root cgroup, while keeping interactions between the
> low level subsys (slab) and memcg private.
> 
> Way 1 might look appealing at the first glance, but at the same time it
> is much more complex, because alloc_pages has grown over the years to
> handle a lot of subtle situations that may arise on global memory
> pressure, but impossible in memcg. What does way 1 give us then? We
> can't insert try_charge directly to alloc_pages and have to spread its
> calls all over the code anyway, so why is it better? Easier to use it in
> places 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Tejun Heo
Hello, Vladimir.

On Fri, Sep 04, 2015 at 02:15:50PM +0300, Vladimir Davydov wrote:
> Trying a high-order page before falling back on lower order is not
> something really common. It implicitly relies on the fact that
> reclaiming memory for a new continuous high-order page is much more
> expensive than getting the same amount of order-1 pages. This is true
> for buddy alloc, but not for memcg. That's why playing such a trick with
> try_charge is wrong IMO. If such a trick becomes common, I think we will
> have to introduce a helper for it, because otherwise a change in buddy
> alloc internal logic (e.g. a defrag optimization making high order pages
> cheaper) may affect its users.

I'm having trouble following why this matters.  The layering here is
pretty clear regardless of how slab is trespassing into page
allocator's role.  memcg of course doesn't care whether an allocation
is high-order or order-1.  All it does is imposing extra restrictions
when allocating memory and all that's necessary is reasonably
satisfying the expectations expressed by the specified gfp mask.

> That said, I totally agree that memcg should handle GFP_NOWAIT, but I'm
> opposed to the idea that it should handle the tricks that rely on
> internal buddy alloc logic similar to those used by SLAB and SLUB. We'd
> better strive to hide these tricks in buddy alloc helpers and never use
> them directly.

All these don't really matter once memcg handles GFP_NOWAIT in a
reasonable manner, right?  memcg doesn't need all the fancy tricks of
the page allocator.  All it needs is honoring the intentions expressed
by the gfp mask in a reasonable way w/o systematic failures.
 
> That's why I think we need these patches and they aren't workarounds
> that can be reverted once try_charge has been taught to handle
> GFP_NOWAIT properly.

So, if this is separate slab improvements, I have no objections but
independent of that, we need to be able to handle back-to-back
GFP_NOWAIT cases and w/ the high limit punting to the return path
should work well enough.

> > You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
> > I'm not sure how much we can commit to that statement.  GFP_KERNEL
> > allocating huge amount of memory in a single go is a kernel bug.
> > GFP_NOWAIT optimization in a hot path which is accessible to userland
> > isn't and we'll be growing more and more of them.  We need to be
> > protected against back-to-back GFP_NOWAIT allocations.
> 
> AFAIU if someone tries to allocate with GFP_NOWAIT (i.e. w/o
> __GFP_NOFAIL or __GFP_HIGH), he/she must be prepared to allocation
> failures, so there should be a safe fall back path, which fixes things
> in normal context. It doesn't mean we shouldn't do anything to satisfy
> such optimistic requests from memcg, but we may occasionally fail them.

Yes, it can fail under stress or if unluckly; however, it shouldn't
fail consistently under nominal conditions or be able to run over high
limit unchecked.

> OTOH if someone allocates with GFP_KERNEL, he/she should be prepared to
> get NULL, but in this case the whole operation will usually be aborted.
> Therefore with the possibility of all GFP_KERNEL being transformed to
> GFP_NOWAIT inside slab, memcg has to be extra cautious, because failing
> a usual GFP_NOWAIT in such a case may result not in falling back on slow
> path, but in user-visible effects like failing to open a file with
> ENOMEM. This is really difficult to achieve and I doubt it's worth
> complicating memcg code, because we can just fix SLAB/SLUB.

I'm not following you at all here.  slab too of course should fall
back to more robust gfp mask if NOWAIT fails and as long as those
failures are exceptions, it's fine.

> Regarding __GFP_NOFAIL and __GFP_HIGH, IMO we can let them go uncharged
> or charge them forcefully even if they breach the limit, because there
> shouldn't be many of them (if there were really a lot of them, they
> could deplete memory reserves and hang the system).
> 
> If all these assumptions are true, we don't need to do anything (apart
> from forcefully charging high prio allocations may be) for kmemcg to
> work satisfactory. For optimizing optimistic GFP_NOWAIT callers one can
> use memory.high instead or along with memory.max. Reclaiming memory.high
> in kernel while holding various locks can result in prio inversions
> though, but that's a different story, which could be fixed by task_work
> reclaim.

GFP_NOWAIT has a systematic problem which needs to be fixed.

> I admit I may be mistaken, but if I'm right, we may end up with really
> complex memcg reclaim logic trying to closely mimic behavior of buddy
> alloc with all its historic peculiarities. That's why I don't want to
> rush ahead "fixing" memcg reclaim before an agreement among all
> interested people is reached...

I think that's a bit out of proportion.  I'm not suggesting bringing
in all complexities of global reclaim.  There's no reason to and what
memcg deals with is inherently way 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Vladimir Davydov
Hi Tejun, Michal

On Fri, Sep 04, 2015 at 11:44:48AM -0400, Tejun Heo wrote:
...
> > I admit I may be mistaken, but if I'm right, we may end up with really
> > complex memcg reclaim logic trying to closely mimic behavior of buddy
> > alloc with all its historic peculiarities. That's why I don't want to
> > rush ahead "fixing" memcg reclaim before an agreement among all
> > interested people is reached...
> 
> I think that's a bit out of proportion.  I'm not suggesting bringing
> in all complexities of global reclaim.  There's no reason to and what
> memcg deals with is inherently way simpler than actual memory
> allocation.  The original patch was about fixing systematic failure
> around GFP_NOWAIT close to the high limit.  We might want to do
> background reclaim close to max but as long as high limit functions
> correctly, that's much less of a problem at least on the v2 interface.

Looking through this thread once again and weighting my arguments vs
yours, I start to understand that I'm totally wrong and these patches
are not proper fixes for the problem.

Having these patches in the kernel only helps when we are hitting the
hard limit, which shouldn't occur often if memory.high works properly.
Even if memory.high is not used, the only negative effect we would get
w/o them is allocating a slab from a wrong node or getting a low order
page where we could get a high order one. Both should be rare and both
aren't critical. I think I got carried away with all those obscure
"reclaimer peculiarities" at some point.

Now I think task_work reclaim initially proposed by Tejun would be a
much better fix.

I'm terribly sorry for being so annoying and stubborn and want to thank
you for all your feedback!

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-04 Thread Tejun Heo
Hello, Vladimir.

On Fri, Sep 04, 2015 at 09:21:11PM +0300, Vladimir Davydov wrote:
> Now I think task_work reclaim initially proposed by Tejun would be a
> much better fix.

Cool, I'll update the patch.

> I'm terribly sorry for being so annoying and stubborn and want to thank
> you for all your feedback!

Heh, I'm not all that confident about my position.  A lot of it could
be from lack of experience and failing to see the gradients.  Please
keep me in check if I get lost.

Thanks a lot!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-03 Thread Tejun Heo
Hello, Vladimir.

On Wed, Sep 02, 2015 at 12:30:39PM +0300, Vladimir Davydov wrote:
...
> To sum it up. Basically, there are two ways of handling kmemcg charges:
> 
>  1. Make the memcg try_charge mimic alloc_pages behavior.
>  2. Make API functions (kmalloc, etc) work in memcg as if they were
> called from the root cgroup, while keeping interactions between the
> low level subsys (slab) and memcg private.
> 
> Way 1 might look appealing at the first glance, but at the same time it
> is much more complex, because alloc_pages has grown over the years to
> handle a lot of subtle situations that may arise on global memory
> pressure, but impossible in memcg. What does way 1 give us then? We
> can't insert try_charge directly to alloc_pages and have to spread its
> calls all over the code anyway, so why is it better? Easier to use it in
> places where users depend on buddy allocator peculiarities? There are
> not many such users.

Maybe this is from inexperience but wouldn't 1 also be simpler than
the global case for the same reasons that doing 2 is simpler?  It's
not like the fact that memory shortage inside memcg usually doesn't
mean global shortage goes away depending on whether we take 1 or 2.

That said, it is true that slab is an integral part of kmemcg and I
can't see how it can be made oblivious of memcg operations, so yeah
one way or the other slab has to know the details and we may have to
do some unusual things at that layer.

> I understand that the idea of way 1 is to provide a well-defined memcg
> API independent of the rest of the code, but that's just impossible. You
> need special casing anyway. E.g. you need those get/put_kmem_cache
> helpers, which exist solely for SLAB/SLUB. You need all this special
> stuff for growing per-memcg array in list_lru and kmem_cache, which
> exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> even handle kmem_cache destruction on memcg offline differently for SLAB
> and SLUB for performance reasons.

It isn't a black or white thing.  Sure, slab should be involved in
kmemcg but at the same time if we can keep the amount of exposure in
check, that's the better way to go.

> Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> for optimization, but their API is well defined, so we just make kmalloc
> work as expected while providing inter-subsys calls, like
> memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> mentioned kmem users that allocate memory using alloc_pages. There is an
> API function for them too, alloc_kmem_pages. Everything behind the API
> is hidden and may be done in such a way to achieve optimal performance.

Ditto.  Nobody is arguing that we can get it out completely but at the
same time handling of GFP_NOWAIT seems like a pretty fundamental
proprety that we'd wanna maintain at memcg boundary.

You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
I'm not sure how much we can commit to that statement.  GFP_KERNEL
allocating huge amount of memory in a single go is a kernel bug.
GFP_NOWAIT optimization in a hot path which is accessible to userland
isn't and we'll be growing more and more of them.  We need to be
protected against back-to-back GFP_NOWAIT allocations.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-03 Thread Vladimir Davydov
On Wed, Sep 02, 2015 at 01:16:47PM -0500, Christoph Lameter wrote:
> On Wed, 2 Sep 2015, Vladimir Davydov wrote:
> 
> > Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> > with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> > alloc_pages with the caller's context, it does the job normally done by
> > alloc_pages itself. It's not what is done massively.
> >
> > Leaving slab charge path as is looks really ugly to me. Look, slab
> > iterates over all nodes, inspecting if they have free pages and fails
> > even if they do due to the memcg constraint...
> 
> Well yes it needs to do that due to the way NUMA support was designed in.
> SLAB needs to check the per node caches if objects are present before
> going to more remote nodes. Sorry about this. I realized the design issue
> in 2006 and SLUB was the result in 2007 of an alternate design to let the
> page allocator do its proper job.

Yeah, SLUB is OK in this respect.

> 
> > To sum it up. Basically, there are two ways of handling kmemcg charges:
> >
> >  1. Make the memcg try_charge mimic alloc_pages behavior.
> >  2. Make API functions (kmalloc, etc) work in memcg as if they were
> > called from the root cgroup, while keeping interactions between the
> > low level subsys (slab) and memcg private.
> >
> > Way 1 might look appealing at the first glance, but at the same time it
> > is much more complex, because alloc_pages has grown over the years to
> > handle a lot of subtle situations that may arise on global memory
> > pressure, but impossible in memcg. What does way 1 give us then? We
> > can't insert try_charge directly to alloc_pages and have to spread its
> > calls all over the code anyway, so why is it better? Easier to use it in
> > places where users depend on buddy allocator peculiarities? There are
> > not many such users.
> 
> Would it be possible to have a special alloc_pages_memcg with different
> semantics?
> 
> On the other hand alloc_pages() has grown to handle all the special cases.
> Why cant it also handle the special memcg case? There are numerous other

Because we don't want to place memcg handling in alloc_pages(). AFAIU
this is because memcg by its design works at a higher layer than buddy
alloc. We can't just charge a page on alloc and uncharge it on free.
Sometimes we need to charge a page to a memcg which is different from
the current one, sometimes we need to move a page charge between cgroups
adjusting lru in the meantime (e.g. for handling readahead or swapin).
Placing memcg charging in alloc_pages() would IMO only obscure memcg
logic, because handling of the same page would be spread over subsystems
at different layers. I may be completely wrong though.

> allocators that cache memory in the kernel from networking to
> the bizarre compressed swap approaches. How does memcg handle that? Isnt

Frontswap/zswap entries are accounted to memsw counter like conventional
swap. I don't think we need to charge them to mem, because zswap size is
limited. The user allows to use some RAM as swap transparently to
running processes, so charging them to mem would be unexpected IMO.

Skbs are charged to a different counter, but not charged to kmem for
now. It is to be fixed.

> that situation similar to what the slab allocators do?

I wouldn't say so. Other users just use kmalloc or alloc_pages to grow
their buffers. kmalloc is accounted. For those who work at page
granularity and hence call alloc_pages directly, there is
alloc_kmem_pages helper.

> 
> > exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> > even handle kmem_cache destruction on memcg offline differently for SLAB
> > and SLUB for performance reasons.
> 
> Ugly. Internal allocator design impacts container handling.

The point is that memcg charges pages, while kmalloc works at a finer
level of granularity. As a result, we have two orthogonal strategies for
charging kmalloc:

 1. Teach memcg charge arbitrarily sized chunks and store info about
memcg near each active object in slab.
 2. Create per memcg copy of each kmem cache (this is the scheme that is
in use currently).

Whichever way we choose, memcg and slab have to cooperate and so slab
internal design impacts memcg handling.

> 
> > Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> > for optimization, but their API is well defined, so we just make kmalloc
> > work as expected while providing inter-subsys calls, like
> > memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> > mentioned kmem users that allocate memory using alloc_pages. There is an
> > API function for them too, alloc_kmem_pages. Everything behind the API
> > is hidden and may be done in such a way to achieve optimal performance.
> 
> Can we also hide cgroups memory handling behind the page based schemes
> without having extra handling for the slab allocators?
> 

I doubt so - see above.

Thanks,
Vladimir
--
To unsubscribe from this list: send 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-03 Thread Tejun Heo
Hello, Vladimir.

On Wed, Sep 02, 2015 at 12:30:39PM +0300, Vladimir Davydov wrote:
...
> To sum it up. Basically, there are two ways of handling kmemcg charges:
> 
>  1. Make the memcg try_charge mimic alloc_pages behavior.
>  2. Make API functions (kmalloc, etc) work in memcg as if they were
> called from the root cgroup, while keeping interactions between the
> low level subsys (slab) and memcg private.
> 
> Way 1 might look appealing at the first glance, but at the same time it
> is much more complex, because alloc_pages has grown over the years to
> handle a lot of subtle situations that may arise on global memory
> pressure, but impossible in memcg. What does way 1 give us then? We
> can't insert try_charge directly to alloc_pages and have to spread its
> calls all over the code anyway, so why is it better? Easier to use it in
> places where users depend on buddy allocator peculiarities? There are
> not many such users.

Maybe this is from inexperience but wouldn't 1 also be simpler than
the global case for the same reasons that doing 2 is simpler?  It's
not like the fact that memory shortage inside memcg usually doesn't
mean global shortage goes away depending on whether we take 1 or 2.

That said, it is true that slab is an integral part of kmemcg and I
can't see how it can be made oblivious of memcg operations, so yeah
one way or the other slab has to know the details and we may have to
do some unusual things at that layer.

> I understand that the idea of way 1 is to provide a well-defined memcg
> API independent of the rest of the code, but that's just impossible. You
> need special casing anyway. E.g. you need those get/put_kmem_cache
> helpers, which exist solely for SLAB/SLUB. You need all this special
> stuff for growing per-memcg array in list_lru and kmem_cache, which
> exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> even handle kmem_cache destruction on memcg offline differently for SLAB
> and SLUB for performance reasons.

It isn't a black or white thing.  Sure, slab should be involved in
kmemcg but at the same time if we can keep the amount of exposure in
check, that's the better way to go.

> Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> for optimization, but their API is well defined, so we just make kmalloc
> work as expected while providing inter-subsys calls, like
> memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> mentioned kmem users that allocate memory using alloc_pages. There is an
> API function for them too, alloc_kmem_pages. Everything behind the API
> is hidden and may be done in such a way to achieve optimal performance.

Ditto.  Nobody is arguing that we can get it out completely but at the
same time handling of GFP_NOWAIT seems like a pretty fundamental
proprety that we'd wanna maintain at memcg boundary.

You said elsewhere that GFP_NOWAIT happening back-to-back is unlikely.
I'm not sure how much we can commit to that statement.  GFP_KERNEL
allocating huge amount of memory in a single go is a kernel bug.
GFP_NOWAIT optimization in a hot path which is accessible to userland
isn't and we'll be growing more and more of them.  We need to be
protected against back-to-back GFP_NOWAIT allocations.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-03 Thread Vladimir Davydov
On Wed, Sep 02, 2015 at 01:16:47PM -0500, Christoph Lameter wrote:
> On Wed, 2 Sep 2015, Vladimir Davydov wrote:
> 
> > Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> > with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> > alloc_pages with the caller's context, it does the job normally done by
> > alloc_pages itself. It's not what is done massively.
> >
> > Leaving slab charge path as is looks really ugly to me. Look, slab
> > iterates over all nodes, inspecting if they have free pages and fails
> > even if they do due to the memcg constraint...
> 
> Well yes it needs to do that due to the way NUMA support was designed in.
> SLAB needs to check the per node caches if objects are present before
> going to more remote nodes. Sorry about this. I realized the design issue
> in 2006 and SLUB was the result in 2007 of an alternate design to let the
> page allocator do its proper job.

Yeah, SLUB is OK in this respect.

> 
> > To sum it up. Basically, there are two ways of handling kmemcg charges:
> >
> >  1. Make the memcg try_charge mimic alloc_pages behavior.
> >  2. Make API functions (kmalloc, etc) work in memcg as if they were
> > called from the root cgroup, while keeping interactions between the
> > low level subsys (slab) and memcg private.
> >
> > Way 1 might look appealing at the first glance, but at the same time it
> > is much more complex, because alloc_pages has grown over the years to
> > handle a lot of subtle situations that may arise on global memory
> > pressure, but impossible in memcg. What does way 1 give us then? We
> > can't insert try_charge directly to alloc_pages and have to spread its
> > calls all over the code anyway, so why is it better? Easier to use it in
> > places where users depend on buddy allocator peculiarities? There are
> > not many such users.
> 
> Would it be possible to have a special alloc_pages_memcg with different
> semantics?
> 
> On the other hand alloc_pages() has grown to handle all the special cases.
> Why cant it also handle the special memcg case? There are numerous other

Because we don't want to place memcg handling in alloc_pages(). AFAIU
this is because memcg by its design works at a higher layer than buddy
alloc. We can't just charge a page on alloc and uncharge it on free.
Sometimes we need to charge a page to a memcg which is different from
the current one, sometimes we need to move a page charge between cgroups
adjusting lru in the meantime (e.g. for handling readahead or swapin).
Placing memcg charging in alloc_pages() would IMO only obscure memcg
logic, because handling of the same page would be spread over subsystems
at different layers. I may be completely wrong though.

> allocators that cache memory in the kernel from networking to
> the bizarre compressed swap approaches. How does memcg handle that? Isnt

Frontswap/zswap entries are accounted to memsw counter like conventional
swap. I don't think we need to charge them to mem, because zswap size is
limited. The user allows to use some RAM as swap transparently to
running processes, so charging them to mem would be unexpected IMO.

Skbs are charged to a different counter, but not charged to kmem for
now. It is to be fixed.

> that situation similar to what the slab allocators do?

I wouldn't say so. Other users just use kmalloc or alloc_pages to grow
their buffers. kmalloc is accounted. For those who work at page
granularity and hence call alloc_pages directly, there is
alloc_kmem_pages helper.

> 
> > exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> > even handle kmem_cache destruction on memcg offline differently for SLAB
> > and SLUB for performance reasons.
> 
> Ugly. Internal allocator design impacts container handling.

The point is that memcg charges pages, while kmalloc works at a finer
level of granularity. As a result, we have two orthogonal strategies for
charging kmalloc:

 1. Teach memcg charge arbitrarily sized chunks and store info about
memcg near each active object in slab.
 2. Create per memcg copy of each kmem cache (this is the scheme that is
in use currently).

Whichever way we choose, memcg and slab have to cooperate and so slab
internal design impacts memcg handling.

> 
> > Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> > for optimization, but their API is well defined, so we just make kmalloc
> > work as expected while providing inter-subsys calls, like
> > memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> > mentioned kmem users that allocate memory using alloc_pages. There is an
> > API function for them too, alloc_kmem_pages. Everything behind the API
> > is hidden and may be done in such a way to achieve optimal performance.
> 
> Can we also hide cgroups memory handling behind the page based schemes
> without having extra handling for the slab allocators?
> 

I doubt so - see above.

Thanks,
Vladimir
--
To unsubscribe from this list: send 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-02 Thread Christoph Lameter
On Wed, 2 Sep 2015, Vladimir Davydov wrote:

> Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> alloc_pages with the caller's context, it does the job normally done by
> alloc_pages itself. It's not what is done massively.
>
> Leaving slab charge path as is looks really ugly to me. Look, slab
> iterates over all nodes, inspecting if they have free pages and fails
> even if they do due to the memcg constraint...

Well yes it needs to do that due to the way NUMA support was designed in.
SLAB needs to check the per node caches if objects are present before
going to more remote nodes. Sorry about this. I realized the design issue
in 2006 and SLUB was the result in 2007 of an alternate design to let the
page allocator do its proper job.

> To sum it up. Basically, there are two ways of handling kmemcg charges:
>
>  1. Make the memcg try_charge mimic alloc_pages behavior.
>  2. Make API functions (kmalloc, etc) work in memcg as if they were
> called from the root cgroup, while keeping interactions between the
> low level subsys (slab) and memcg private.
>
> Way 1 might look appealing at the first glance, but at the same time it
> is much more complex, because alloc_pages has grown over the years to
> handle a lot of subtle situations that may arise on global memory
> pressure, but impossible in memcg. What does way 1 give us then? We
> can't insert try_charge directly to alloc_pages and have to spread its
> calls all over the code anyway, so why is it better? Easier to use it in
> places where users depend on buddy allocator peculiarities? There are
> not many such users.

Would it be possible to have a special alloc_pages_memcg with different
semantics?

On the other hand alloc_pages() has grown to handle all the special cases.
Why cant it also handle the special memcg case? There are numerous other
allocators that cache memory in the kernel from networking to
the bizarre compressed swap approaches. How does memcg handle that? Isnt
that situation similar to what the slab allocators do?

> exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> even handle kmem_cache destruction on memcg offline differently for SLAB
> and SLUB for performance reasons.

Ugly. Internal allocator design impacts container handling.

> Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> for optimization, but their API is well defined, so we just make kmalloc
> work as expected while providing inter-subsys calls, like
> memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> mentioned kmem users that allocate memory using alloc_pages. There is an
> API function for them too, alloc_kmem_pages. Everything behind the API
> is hidden and may be done in such a way to achieve optimal performance.

Can we also hide cgroups memory handling behind the page based schemes
without having extra handling for the slab allocators?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-02 Thread Vladimir Davydov
[
  I'll try to summarize my point in one hunk instead of spreading it all
  over the e-mail, because IMO it's becoming a kind of difficult to
  follow. If you think that there's a question I dodge, please let me
  now and I'll try to address it separately.

  Also, adding Johannes to Cc (I noticed that I accidentally left him
  out), because this discussion seems to be fundamental and may affect
  our further steps dramatically.
]

On Tue, Sep 01, 2015 at 08:38:50PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 19:55:54, Vladimir Davydov wrote:
> > On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> > > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> [...]
> > > > > How the fallback is implemented and whether trying other node before
> > > > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the 
> > > > > behavior
> > > > > for memcg enabled setups subtly different. And that is bad.
> > > > 
> > > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > > > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > > > invoking memcg reclaim once a free page is found.
> > > 
> > > So you are saying that the SLAB kmem accounting in this particular path
> > > is suboptimal because the fallback mode doesn't retry local node with
> > > the reclaim enabled before falling back to other nodes?
> > 
> > I'm just pointing out some subtle behavior changes in slab you were
> > opposed to.
> 
> I guess we are still not at the same page here. If the slab has a subtle
> behavior (and from what you are saying it seems it has the same behavior
> at the global scope) then we should strive to fix it rather than making
> it more obscure just to not expose GFP_NOWAIT to memcg which is not
> handled properly currently wrt. high limit (more on that below) which
> was the primary motivation for the patch AFAIU.

Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.

Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...

My point is that what slab does is a pretty low level thing, normal
users call alloc_pages or kmalloc with flags corresponding to their
context. Of course, there may be special users trying optimistically
GFP_NOWAIT, but they aren't massive, and that simplifies things for
memcg a lot. I mean if we can rely on the fact that the number of
GFP_NOWAIT allocations that can occur in a row is limited we can use
direct reclaim (like memory.high) and/or task_work reclaim to fix
GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
most its heuristics. I don't think that copying those heuristics is the
right thing to do, because in memcg case the same problems may be
resolved much easier, because we don't actually experience real memory
shortage when hitting the limit.

Moreover, we already treat some flags not in the same way as in case of
slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
instead of retrying infinitely. We ignore __GFP_THISNODE thing and we
just cannot take it into account. We ignore allocation order, because
that makes no sense for memcg.

To sum it up. Basically, there are two ways of handling kmemcg charges:

 1. Make the memcg try_charge mimic alloc_pages behavior.
 2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.

Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.

I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/put_kmem_cache
helpers, which exist solely for SLAB/SLUB. You need all this special
stuff for growing per-memcg array in list_lru and kmem_cache, which
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-02 Thread Christoph Lameter
On Wed, 2 Sep 2015, Vladimir Davydov wrote:

> Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
> with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
> alloc_pages with the caller's context, it does the job normally done by
> alloc_pages itself. It's not what is done massively.
>
> Leaving slab charge path as is looks really ugly to me. Look, slab
> iterates over all nodes, inspecting if they have free pages and fails
> even if they do due to the memcg constraint...

Well yes it needs to do that due to the way NUMA support was designed in.
SLAB needs to check the per node caches if objects are present before
going to more remote nodes. Sorry about this. I realized the design issue
in 2006 and SLUB was the result in 2007 of an alternate design to let the
page allocator do its proper job.

> To sum it up. Basically, there are two ways of handling kmemcg charges:
>
>  1. Make the memcg try_charge mimic alloc_pages behavior.
>  2. Make API functions (kmalloc, etc) work in memcg as if they were
> called from the root cgroup, while keeping interactions between the
> low level subsys (slab) and memcg private.
>
> Way 1 might look appealing at the first glance, but at the same time it
> is much more complex, because alloc_pages has grown over the years to
> handle a lot of subtle situations that may arise on global memory
> pressure, but impossible in memcg. What does way 1 give us then? We
> can't insert try_charge directly to alloc_pages and have to spread its
> calls all over the code anyway, so why is it better? Easier to use it in
> places where users depend on buddy allocator peculiarities? There are
> not many such users.

Would it be possible to have a special alloc_pages_memcg with different
semantics?

On the other hand alloc_pages() has grown to handle all the special cases.
Why cant it also handle the special memcg case? There are numerous other
allocators that cache memory in the kernel from networking to
the bizarre compressed swap approaches. How does memcg handle that? Isnt
that situation similar to what the slab allocators do?

> exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
> even handle kmem_cache destruction on memcg offline differently for SLAB
> and SLUB for performance reasons.

Ugly. Internal allocator design impacts container handling.

> Way 2 gives us more space to maneuver IMO. SLAB/SLUB may do weird tricks
> for optimization, but their API is well defined, so we just make kmalloc
> work as expected while providing inter-subsys calls, like
> memcg_charge_slab, for SLAB/SLUB that have their own conventions. You
> mentioned kmem users that allocate memory using alloc_pages. There is an
> API function for them too, alloc_kmem_pages. Everything behind the API
> is hidden and may be done in such a way to achieve optimal performance.

Can we also hide cgroups memory handling behind the page based schemes
without having extra handling for the slab allocators?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-02 Thread Vladimir Davydov
[
  I'll try to summarize my point in one hunk instead of spreading it all
  over the e-mail, because IMO it's becoming a kind of difficult to
  follow. If you think that there's a question I dodge, please let me
  now and I'll try to address it separately.

  Also, adding Johannes to Cc (I noticed that I accidentally left him
  out), because this discussion seems to be fundamental and may affect
  our further steps dramatically.
]

On Tue, Sep 01, 2015 at 08:38:50PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 19:55:54, Vladimir Davydov wrote:
> > On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> > > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> [...]
> > > > > How the fallback is implemented and whether trying other node before
> > > > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the 
> > > > > behavior
> > > > > for memcg enabled setups subtly different. And that is bad.
> > > > 
> > > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > > > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > > > invoking memcg reclaim once a free page is found.
> > > 
> > > So you are saying that the SLAB kmem accounting in this particular path
> > > is suboptimal because the fallback mode doesn't retry local node with
> > > the reclaim enabled before falling back to other nodes?
> > 
> > I'm just pointing out some subtle behavior changes in slab you were
> > opposed to.
> 
> I guess we are still not at the same page here. If the slab has a subtle
> behavior (and from what you are saying it seems it has the same behavior
> at the global scope) then we should strive to fix it rather than making
> it more obscure just to not expose GFP_NOWAIT to memcg which is not
> handled properly currently wrt. high limit (more on that below) which
> was the primary motivation for the patch AFAIU.

Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.

Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...

My point is that what slab does is a pretty low level thing, normal
users call alloc_pages or kmalloc with flags corresponding to their
context. Of course, there may be special users trying optimistically
GFP_NOWAIT, but they aren't massive, and that simplifies things for
memcg a lot. I mean if we can rely on the fact that the number of
GFP_NOWAIT allocations that can occur in a row is limited we can use
direct reclaim (like memory.high) and/or task_work reclaim to fix
GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
most its heuristics. I don't think that copying those heuristics is the
right thing to do, because in memcg case the same problems may be
resolved much easier, because we don't actually experience real memory
shortage when hitting the limit.

Moreover, we already treat some flags not in the same way as in case of
slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
instead of retrying infinitely. We ignore __GFP_THISNODE thing and we
just cannot take it into account. We ignore allocation order, because
that makes no sense for memcg.

To sum it up. Basically, there are two ways of handling kmemcg charges:

 1. Make the memcg try_charge mimic alloc_pages behavior.
 2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.

Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.

I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/put_kmem_cache
helpers, which exist solely for SLAB/SLUB. You need all this special
stuff for growing per-memcg array in list_lru and kmem_cache, which
exists solely for memcg-vs-list_lru and memcg-vs-slab interactions. We
even handle kmem_cache destruction on memcg offline differently for SLAB
and SLUB for 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Michal Hocko
On Tue 01-09-15 19:55:54, Vladimir Davydov wrote:
> On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
[...]
> > > > How the fallback is implemented and whether trying other node before
> > > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > > > for memcg enabled setups subtly different. And that is bad.
> > > 
> > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > > invoking memcg reclaim once a free page is found.
> > 
> > So you are saying that the SLAB kmem accounting in this particular path
> > is suboptimal because the fallback mode doesn't retry local node with
> > the reclaim enabled before falling back to other nodes?
> 
> I'm just pointing out some subtle behavior changes in slab you were
> opposed to.

I guess we are still not at the same page here. If the slab has a subtle
behavior (and from what you are saying it seems it has the same behavior
at the global scope) then we should strive to fix it rather than making
it more obscure just to not expose GFP_NOWAIT to memcg which is not
handled properly currently wrt. high limit (more on that below) which
was the primary motivation for the patch AFAIU.

> > I would consider it quite surprising as well even for the global case
> > because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
> > 
> > > You are talking about memcg/kmem accounting as if it were done in the
> > > buddy allocator on top of which the slab layer is built knowing nothing
> > > about memcg accounting on the lower layer. That's not true and that
> > > simply can't be true. Kmem accounting is implemented at the slab layer.
> > > Memcg provides its memcg_charge_slab/uncharge methods solely for
> > > slab core, so it's OK to have some calling conventions between them.
> > > What we are really obliged to do is to preserve behavior of slab's
> > > external API, i.e. kmalloc and friends.
> > 
> > I guess I understand what you are saying here but it sounds like special
> > casing which tries to be clever because the current code understands
> > both the lower level allocator and kmem charge paths to decide how to
> 
> What do you mean by saying "it understands the lower level allocator"?

I mean it requires/abuses special behavior from the page allocator like
__GFP_THISNODE && !wait for the hot path. 

> AFAIK we have memcg callbacks only in special places, like page fault
> handler or kmalloc.

But anybody might opt-in to be charged. I can see some other buffers
which are even not accounted for right now will be charged in future.

> > juggle with them. This is imho bad and hard to maintain long term.
> 
> We already juggle. Just grep where and how we insert
> mem_cgroup_try_charge.

We should always preserve the gfp context (at least its reclaim
part). If we are not then it is a bug.
 
> > > > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > > > pages. If there is no memcg limit, we will allocate a high order
> > > > > slab page, which is in accordance with SLUB internal logic. With
> > > > > memcg limit set, we are likely to fail to charge high order page
> > > > > (because we currently try to charge high order pages w/o 
> > > > > __GFP_WAIT)
> > > > > and fallback on a low order page. The latter is unexpected and
> > > > > unjustified.
> > > > 
> > > > And this case very similar and I even argue that it shows more
> > > > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > > > for an allocation _without_ reclaim because that would be unnecessarily
> > > > too costly and there is other less expensive fallback. But memcg would
> > > 
> > > You are ignoring the fact that, in contrast to alloc_pages, for memcg
> > > there is practically no difference between charging a 4-order page or a
> > > 1-order page.
> > 
> > But this is an implementation details which might change anytime in
> > future.
> 
> The fact that memcg reclaim does not invoke compactor is indeed an
> implementation detail, but how can it change?

Compaction is indeed not something memcg reclaim cares about right now
or will care in foreseeable future. I meant something else. order-1 vs.
ordern-N differ in the reclaim target which then controls the potential
latency of the reclaim. The fact that order-1 and order-4 do not really
make any difference _right now_ because of the large SWAP_CLUSTER_MAX is
the implementation detail I was referring to.
 
> > > OTOH, using 1-order pages where we could go with 4-order
> > > pages increases page fragmentation at the global level. This subtly
> > > breaks internal SLUB optimization. Once 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov
On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> > > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> {...}
> > > >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > > > memory on the preferred node. W/o memcg limit set, the allocation
> > > > will happen from the preferred node, which is OK. If there is memcg
> > > > limit, we can currently fail to allocate from the preferred node if
> > > > we are near the limit. We issue memcg reclaim and go to fallback
> > > > alloc then, which will most probably allocate from a different node,
> > > > although there is no reason for that. This is a bug.
> > > 
> > > I am not familiar with the SLAB internals much but how is it different
> > > from the global case. If the preferred node is full then __GFP_THISNODE
> > > request will make it fail early even without giving GFP_NOWAIT
> > > additional access to atomic memory reserves. The fact that memcg case
> > > fails earlier is perfectly expected because the restriction is tighter
> > > than the global case.
> > 
> > memcg restrictions are orthogonal to NUMA: failing an allocation from a
> > particular node does not mean failing memcg charge and vice versa.
> 
> Sure memcg doesn't care about NUMA it just puts an additional constrain
> on top of all existing ones. The point I've tried to make is that the
> logic is currently same whether it is page allocator (with the node
> restriction) or memcg (cumulative amount restriction) are behaving
> consistently. Neither of them try to reclaim in order to achieve its
> goals. How conservative is memcg about allowing GFP_NOWAIT allocation
> is a separate issue and all those details belong to memcg proper same
> as the allocation strategy for these allocations belongs to the page
> allocator.
>  
> > > How the fallback is implemented and whether trying other node before
> > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > > for memcg enabled setups subtly different. And that is bad.
> > 
> > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > invoking memcg reclaim once a free page is found.
> 
> So you are saying that the SLAB kmem accounting in this particular path
> is suboptimal because the fallback mode doesn't retry local node with
> the reclaim enabled before falling back to other nodes?

I'm just pointing out some subtle behavior changes in slab you were
opposed to.

> I would consider it quite surprising as well even for the global case
> because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
> 
> > You are talking about memcg/kmem accounting as if it were done in the
> > buddy allocator on top of which the slab layer is built knowing nothing
> > about memcg accounting on the lower layer. That's not true and that
> > simply can't be true. Kmem accounting is implemented at the slab layer.
> > Memcg provides its memcg_charge_slab/uncharge methods solely for
> > slab core, so it's OK to have some calling conventions between them.
> > What we are really obliged to do is to preserve behavior of slab's
> > external API, i.e. kmalloc and friends.
> 
> I guess I understand what you are saying here but it sounds like special
> casing which tries to be clever because the current code understands
> both the lower level allocator and kmem charge paths to decide how to

What do you mean by saying "it understands the lower level allocator"?
AFAIK we have memcg callbacks only in special places, like page fault
handler or kmalloc.

> juggle with them. This is imho bad and hard to maintain long term.

We already juggle. Just grep where and how we insert
mem_cgroup_try_charge.

> 
> > > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > > pages. If there is no memcg limit, we will allocate a high order
> > > > slab page, which is in accordance with SLUB internal logic. With
> > > > memcg limit set, we are likely to fail to charge high order page
> > > > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > > > and fallback on a low order page. The latter is unexpected and
> > > > unjustified.
> > > 
> > > And this case very similar and I even argue that it shows more
> > > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > > for an allocation _without_ reclaim because that would be unnecessarily
> > > too costly and there is other less expensive fallback. But memcg would
> > 
> > You are ignoring the fact that, in contrast to alloc_pages, for memcg
> > there is practically no difference between charging a 4-order page or a
> > 1-order 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Michal Hocko
On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
{...}
> > >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > > memory on the preferred node. W/o memcg limit set, the allocation
> > > will happen from the preferred node, which is OK. If there is memcg
> > > limit, we can currently fail to allocate from the preferred node if
> > > we are near the limit. We issue memcg reclaim and go to fallback
> > > alloc then, which will most probably allocate from a different node,
> > > although there is no reason for that. This is a bug.
> > 
> > I am not familiar with the SLAB internals much but how is it different
> > from the global case. If the preferred node is full then __GFP_THISNODE
> > request will make it fail early even without giving GFP_NOWAIT
> > additional access to atomic memory reserves. The fact that memcg case
> > fails earlier is perfectly expected because the restriction is tighter
> > than the global case.
> 
> memcg restrictions are orthogonal to NUMA: failing an allocation from a
> particular node does not mean failing memcg charge and vice versa.

Sure memcg doesn't care about NUMA it just puts an additional constrain
on top of all existing ones. The point I've tried to make is that the
logic is currently same whether it is page allocator (with the node
restriction) or memcg (cumulative amount restriction) are behaving
consistently. Neither of them try to reclaim in order to achieve its
goals. How conservative is memcg about allowing GFP_NOWAIT allocation
is a separate issue and all those details belong to memcg proper same
as the allocation strategy for these allocations belongs to the page
allocator.
 
> > How the fallback is implemented and whether trying other node before
> > reclaiming from the preferred one is reasonable I dunno. This is for
> > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > for memcg enabled setups subtly different. And that is bad.
> 
> Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> differently: SLAB will walk over all NUMA nodes for nothing instead of
> invoking memcg reclaim once a free page is found.

So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.

> You are talking about memcg/kmem accounting as if it were done in the
> buddy allocator on top of which the slab layer is built knowing nothing
> about memcg accounting on the lower layer. That's not true and that
> simply can't be true. Kmem accounting is implemented at the slab layer.
> Memcg provides its memcg_charge_slab/uncharge methods solely for
> slab core, so it's OK to have some calling conventions between them.
> What we are really obliged to do is to preserve behavior of slab's
> external API, i.e. kmalloc and friends.

I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
juggle with them. This is imho bad and hard to maintain long term.

> > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > pages. If there is no memcg limit, we will allocate a high order
> > > slab page, which is in accordance with SLUB internal logic. With
> > > memcg limit set, we are likely to fail to charge high order page
> > > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > > and fallback on a low order page. The latter is unexpected and
> > > unjustified.
> > 
> > And this case very similar and I even argue that it shows more
> > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > for an allocation _without_ reclaim because that would be unnecessarily
> > too costly and there is other less expensive fallback. But memcg would
> 
> You are ignoring the fact that, in contrast to alloc_pages, for memcg
> there is practically no difference between charging a 4-order page or a
> 1-order page.

But this is an implementation details which might change anytime in
future.

> OTOH, using 1-order pages where we could go with 4-order
> pages increases page fragmentation at the global level. This subtly
> breaks internal SLUB optimization. Once again, kmem accounting is not
> something staying aside from slab core, it's a part of slab core.

This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov
On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> > On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> > 
> > > > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > > > silently ignored if kmem accounting is enabled:
> > > > 
> > > >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > > > 
> > > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > > > first. As a result, if there is enough free pages, memcg reclaim will
> > > > not get invoked on kmem allocations, which will lead to uncontrollable
> > > > growth of memory usage no matter what memory.high is set to.
> > > 
> > > Right but isn't that what the caller explicitly asked for?
> > 
> > No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
> > might ignore that and charge memcg w/o __GFP_WAIT.
> 
> I was referring to the slab allocator as the caller. Sorry for not being
> clear about that.
> 
> > > Why should we ignore that for kmem accounting? It seems like a fix at
> > > a wrong layer to me.
> > 
> > Let's forget about memory.high for a minute.
> >
> >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > memory on the preferred node. W/o memcg limit set, the allocation
> > will happen from the preferred node, which is OK. If there is memcg
> > limit, we can currently fail to allocate from the preferred node if
> > we are near the limit. We issue memcg reclaim and go to fallback
> > alloc then, which will most probably allocate from a different node,
> > although there is no reason for that. This is a bug.
> 
> I am not familiar with the SLAB internals much but how is it different
> from the global case. If the preferred node is full then __GFP_THISNODE
> request will make it fail early even without giving GFP_NOWAIT
> additional access to atomic memory reserves. The fact that memcg case
> fails earlier is perfectly expected because the restriction is tighter
> than the global case.

memcg restrictions are orthogonal to NUMA: failing an allocation from a
particular node does not mean failing memcg charge and vice versa.

> 
> How the fallback is implemented and whether trying other node before
> reclaiming from the preferred one is reasonable I dunno. This is for
> SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> for memcg enabled setups subtly different. And that is bad.

Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.

You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.

> 
> >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > pages. If there is no memcg limit, we will allocate a high order
> > slab page, which is in accordance with SLUB internal logic. With
> > memcg limit set, we are likely to fail to charge high order page
> > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > and fallback on a low order page. The latter is unexpected and
> > unjustified.
> 
> And this case very similar and I even argue that it shows more
> brokenness with your patch. The SLUB allocator has _explicitly_ asked
> for an allocation _without_ reclaim because that would be unnecessarily
> too costly and there is other less expensive fallback. But memcg would

You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page. OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.

> be ignoring this with your patch AFAIU and break the optimization. There
> are other cases like that. E.g. THP pages are allocated without GFP_WAIT
> when defrag is disabled.

It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Michal Hocko
On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> 
> > > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > > silently ignored if kmem accounting is enabled:
> > > 
> > >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > > 
> > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > > first. As a result, if there is enough free pages, memcg reclaim will
> > > not get invoked on kmem allocations, which will lead to uncontrollable
> > > growth of memory usage no matter what memory.high is set to.
> > 
> > Right but isn't that what the caller explicitly asked for?
> 
> No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
> might ignore that and charge memcg w/o __GFP_WAIT.

I was referring to the slab allocator as the caller. Sorry for not being
clear about that.

> > Why should we ignore that for kmem accounting? It seems like a fix at
> > a wrong layer to me.
> 
> Let's forget about memory.high for a minute.
>
>  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> memory on the preferred node. W/o memcg limit set, the allocation
> will happen from the preferred node, which is OK. If there is memcg
> limit, we can currently fail to allocate from the preferred node if
> we are near the limit. We issue memcg reclaim and go to fallback
> alloc then, which will most probably allocate from a different node,
> although there is no reason for that. This is a bug.

I am not familiar with the SLAB internals much but how is it different
from the global case. If the preferred node is full then __GFP_THISNODE
request will make it fail early even without giving GFP_NOWAIT
additional access to atomic memory reserves. The fact that memcg case
fails earlier is perfectly expected because the restriction is tighter
than the global case.

How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.

>  2. SLUB. Someone calls kmalloc and there is enough free high order
> pages. If there is no memcg limit, we will allocate a high order
> slab page, which is in accordance with SLUB internal logic. With
> memcg limit set, we are likely to fail to charge high order page
> (because we currently try to charge high order pages w/o __GFP_WAIT)
> and fallback on a low order page. The latter is unexpected and
> unjustified.

And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.

> That being said, this is the fix at the right layer.
> 
> > Either we should start failing GFP_NOWAIT charges when we are above
> > high wmark or deploy an additional catchup mechanism as suggested by
> > Tejun.
> 
> The mechanism proposed by Tejun won't help us to avoid allocation
> failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

Why would be that a problem. The _hard_ limit is reached and reclaim
cannot make any progress. An allocation failure is to be expected.
GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
before failing.
 
> To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
> the gap between limit and usage is getting too small. It may be done
> from a workqueue or from task_work, but currently I don't see any reason
> why complicate and not just start reclaim directly, just like
> memory.high does.

Yes we can do better than we do right now. But that doesn't mean we
should put hacks all over the place and lie about the allocation
context.

> I mean, currently you can protect against GFP_NOWAIT failures by setting
> memory.high to be 1-2 MB lower than memory.high and this *will* work,
> because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
> will alternate with normal GFP_KERNEL allocations sooner or later. It
> does not mean we should encourage users to set memory.high to protect
> against such failures, because, as pointed out by Tejun, logic behind
> memory.high is currently opaque and can change, but we can introduce
> memcg-internal watermarks that would work exactly as memory.high and
> hence help us against GFP_NOWAIT/GFP_NOFS failures.

I am not against something like watermarks and doing more pro-active
reclaim but this is far from easy to do - which is one of the reason we
do not have it yet. The idea from Tejun about the 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 03:22:22PM -0500, Christoph Lameter wrote:
> On Mon, 31 Aug 2015, Vladimir Davydov wrote:
> 
> > I totally agree that we should strive to make a kmem user feel roughly
> > the same in memcg as if it were running on a host with equal amount of
> > RAM. There are two ways to achieve that:
> >
> >  1. Make the API functions, i.e. kmalloc and friends, behave inside
> > memcg roughly the same way as they do in the root cgroup.
> >  2. Make the internal memcg functions, i.e. try_charge and friends,
> > behave roughly the same way as alloc_pages.
> >
> > I find way 1 more flexible, because we don't have to blindly follow
> > heuristics used on global memory reclaim and therefore have more
> > opportunities to achieve the same goal.
> 
> The heuristics need to integrate well if its in a cgroup or not. In
> general make use of cgroups as transparent as possible to the rest of the
> code.

Half of kmem accounting implementation resides in SLAB/SLUB. We can't
just make use of cgroups there transparent. For the rest of the code
using kmalloc, cgroups are transparent.

Indeed, we can make memcg_charge_slab behave exactly like alloc_pages,
we can even put it to alloc_pages (where it used to be), but why if the
only user of memcg_charge_slab is SLAB/SLUB core?

I think we'd have more space to manoeuvre if we just taught SLAB/SLUB to
use memcg_charge_slab wisely (as it used to until recently), because
memcg charge/reclaim is quite different from global alloc/reclaim:

 - it isn't aware of NUMA nodes, so trying to charge w/o __GFP_WAIT
   while inspecting nodes, like in case of SLAB, is meaningless

 - it isn't aware of high order page allocations, so trying to charge
   w/o __GFP_WAIT while trying optimistically to get a high order page,
   like in case of SLUB, is meaningless too

 - it can always let a high prio allocation go unaccounted, so IMO there
   is no point in introducing emergency reserves (__GFP_MEMALLOC
   handling)

 - it can always charge a GFP_NOWAIT allocation even if it exceeds the
   limit, issuing direct reclaim when a GFP_KERNEL allocation comes or
   from a task work, because there is no risk of depleting memory
   reserves; so it isn't obvious to me whether we really need an aync
   thread handling memcg reclaim like kswapd

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Michal Hocko
On Tue 01-09-15 19:55:54, Vladimir Davydov wrote:
> On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
[...]
> > > > How the fallback is implemented and whether trying other node before
> > > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > > > for memcg enabled setups subtly different. And that is bad.
> > > 
> > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > > invoking memcg reclaim once a free page is found.
> > 
> > So you are saying that the SLAB kmem accounting in this particular path
> > is suboptimal because the fallback mode doesn't retry local node with
> > the reclaim enabled before falling back to other nodes?
> 
> I'm just pointing out some subtle behavior changes in slab you were
> opposed to.

I guess we are still not at the same page here. If the slab has a subtle
behavior (and from what you are saying it seems it has the same behavior
at the global scope) then we should strive to fix it rather than making
it more obscure just to not expose GFP_NOWAIT to memcg which is not
handled properly currently wrt. high limit (more on that below) which
was the primary motivation for the patch AFAIU.

> > I would consider it quite surprising as well even for the global case
> > because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
> > 
> > > You are talking about memcg/kmem accounting as if it were done in the
> > > buddy allocator on top of which the slab layer is built knowing nothing
> > > about memcg accounting on the lower layer. That's not true and that
> > > simply can't be true. Kmem accounting is implemented at the slab layer.
> > > Memcg provides its memcg_charge_slab/uncharge methods solely for
> > > slab core, so it's OK to have some calling conventions between them.
> > > What we are really obliged to do is to preserve behavior of slab's
> > > external API, i.e. kmalloc and friends.
> > 
> > I guess I understand what you are saying here but it sounds like special
> > casing which tries to be clever because the current code understands
> > both the lower level allocator and kmem charge paths to decide how to
> 
> What do you mean by saying "it understands the lower level allocator"?

I mean it requires/abuses special behavior from the page allocator like
__GFP_THISNODE && !wait for the hot path. 

> AFAIK we have memcg callbacks only in special places, like page fault
> handler or kmalloc.

But anybody might opt-in to be charged. I can see some other buffers
which are even not accounted for right now will be charged in future.

> > juggle with them. This is imho bad and hard to maintain long term.
> 
> We already juggle. Just grep where and how we insert
> mem_cgroup_try_charge.

We should always preserve the gfp context (at least its reclaim
part). If we are not then it is a bug.
 
> > > > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > > > pages. If there is no memcg limit, we will allocate a high order
> > > > > slab page, which is in accordance with SLUB internal logic. With
> > > > > memcg limit set, we are likely to fail to charge high order page
> > > > > (because we currently try to charge high order pages w/o 
> > > > > __GFP_WAIT)
> > > > > and fallback on a low order page. The latter is unexpected and
> > > > > unjustified.
> > > > 
> > > > And this case very similar and I even argue that it shows more
> > > > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > > > for an allocation _without_ reclaim because that would be unnecessarily
> > > > too costly and there is other less expensive fallback. But memcg would
> > > 
> > > You are ignoring the fact that, in contrast to alloc_pages, for memcg
> > > there is practically no difference between charging a 4-order page or a
> > > 1-order page.
> > 
> > But this is an implementation details which might change anytime in
> > future.
> 
> The fact that memcg reclaim does not invoke compactor is indeed an
> implementation detail, but how can it change?

Compaction is indeed not something memcg reclaim cares about right now
or will care in foreseeable future. I meant something else. order-1 vs.
ordern-N differ in the reclaim target which then controls the potential
latency of the reclaim. The fact that order-1 and order-4 do not really
make any difference _right now_ because of the large SWAP_CLUSTER_MAX is
the implementation detail I was referring to.
 
> > > OTOH, using 1-order pages where we could go with 4-order
> > > pages increases page fragmentation at the global level. This subtly
> > > breaks internal SLUB optimization. Once 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 03:22:22PM -0500, Christoph Lameter wrote:
> On Mon, 31 Aug 2015, Vladimir Davydov wrote:
> 
> > I totally agree that we should strive to make a kmem user feel roughly
> > the same in memcg as if it were running on a host with equal amount of
> > RAM. There are two ways to achieve that:
> >
> >  1. Make the API functions, i.e. kmalloc and friends, behave inside
> > memcg roughly the same way as they do in the root cgroup.
> >  2. Make the internal memcg functions, i.e. try_charge and friends,
> > behave roughly the same way as alloc_pages.
> >
> > I find way 1 more flexible, because we don't have to blindly follow
> > heuristics used on global memory reclaim and therefore have more
> > opportunities to achieve the same goal.
> 
> The heuristics need to integrate well if its in a cgroup or not. In
> general make use of cgroups as transparent as possible to the rest of the
> code.

Half of kmem accounting implementation resides in SLAB/SLUB. We can't
just make use of cgroups there transparent. For the rest of the code
using kmalloc, cgroups are transparent.

Indeed, we can make memcg_charge_slab behave exactly like alloc_pages,
we can even put it to alloc_pages (where it used to be), but why if the
only user of memcg_charge_slab is SLAB/SLUB core?

I think we'd have more space to manoeuvre if we just taught SLAB/SLUB to
use memcg_charge_slab wisely (as it used to until recently), because
memcg charge/reclaim is quite different from global alloc/reclaim:

 - it isn't aware of NUMA nodes, so trying to charge w/o __GFP_WAIT
   while inspecting nodes, like in case of SLAB, is meaningless

 - it isn't aware of high order page allocations, so trying to charge
   w/o __GFP_WAIT while trying optimistically to get a high order page,
   like in case of SLUB, is meaningless too

 - it can always let a high prio allocation go unaccounted, so IMO there
   is no point in introducing emergency reserves (__GFP_MEMALLOC
   handling)

 - it can always charge a GFP_NOWAIT allocation even if it exceeds the
   limit, issuing direct reclaim when a GFP_KERNEL allocation comes or
   from a task work, because there is no risk of depleting memory
   reserves; so it isn't obvious to me whether we really need an aync
   thread handling memcg reclaim like kswapd

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Michal Hocko
On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> 
> > > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > > silently ignored if kmem accounting is enabled:
> > > 
> > >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > > 
> > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > > first. As a result, if there is enough free pages, memcg reclaim will
> > > not get invoked on kmem allocations, which will lead to uncontrollable
> > > growth of memory usage no matter what memory.high is set to.
> > 
> > Right but isn't that what the caller explicitly asked for?
> 
> No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
> might ignore that and charge memcg w/o __GFP_WAIT.

I was referring to the slab allocator as the caller. Sorry for not being
clear about that.

> > Why should we ignore that for kmem accounting? It seems like a fix at
> > a wrong layer to me.
> 
> Let's forget about memory.high for a minute.
>
>  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> memory on the preferred node. W/o memcg limit set, the allocation
> will happen from the preferred node, which is OK. If there is memcg
> limit, we can currently fail to allocate from the preferred node if
> we are near the limit. We issue memcg reclaim and go to fallback
> alloc then, which will most probably allocate from a different node,
> although there is no reason for that. This is a bug.

I am not familiar with the SLAB internals much but how is it different
from the global case. If the preferred node is full then __GFP_THISNODE
request will make it fail early even without giving GFP_NOWAIT
additional access to atomic memory reserves. The fact that memcg case
fails earlier is perfectly expected because the restriction is tighter
than the global case.

How the fallback is implemented and whether trying other node before
reclaiming from the preferred one is reasonable I dunno. This is for
SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
for memcg enabled setups subtly different. And that is bad.

>  2. SLUB. Someone calls kmalloc and there is enough free high order
> pages. If there is no memcg limit, we will allocate a high order
> slab page, which is in accordance with SLUB internal logic. With
> memcg limit set, we are likely to fail to charge high order page
> (because we currently try to charge high order pages w/o __GFP_WAIT)
> and fallback on a low order page. The latter is unexpected and
> unjustified.

And this case very similar and I even argue that it shows more
brokenness with your patch. The SLUB allocator has _explicitly_ asked
for an allocation _without_ reclaim because that would be unnecessarily
too costly and there is other less expensive fallback. But memcg would
be ignoring this with your patch AFAIU and break the optimization. There
are other cases like that. E.g. THP pages are allocated without GFP_WAIT
when defrag is disabled.

> That being said, this is the fix at the right layer.
> 
> > Either we should start failing GFP_NOWAIT charges when we are above
> > high wmark or deploy an additional catchup mechanism as suggested by
> > Tejun.
> 
> The mechanism proposed by Tejun won't help us to avoid allocation
> failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

Why would be that a problem. The _hard_ limit is reached and reclaim
cannot make any progress. An allocation failure is to be expected.
GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
before failing.
 
> To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
> the gap between limit and usage is getting too small. It may be done
> from a workqueue or from task_work, but currently I don't see any reason
> why complicate and not just start reclaim directly, just like
> memory.high does.

Yes we can do better than we do right now. But that doesn't mean we
should put hacks all over the place and lie about the allocation
context.

> I mean, currently you can protect against GFP_NOWAIT failures by setting
> memory.high to be 1-2 MB lower than memory.high and this *will* work,
> because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
> will alternate with normal GFP_KERNEL allocations sooner or later. It
> does not mean we should encourage users to set memory.high to protect
> against such failures, because, as pointed out by Tejun, logic behind
> memory.high is currently opaque and can change, but we can introduce
> memcg-internal watermarks that would work exactly as memory.high and
> hence help us against GFP_NOWAIT/GFP_NOFS failures.

I am not against something like watermarks and doing more pro-active
reclaim but this is far from easy to do - which is one of the reason we
do not have it yet. The idea from Tejun about the 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov
On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> > > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> {...}
> > > >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > > > memory on the preferred node. W/o memcg limit set, the allocation
> > > > will happen from the preferred node, which is OK. If there is memcg
> > > > limit, we can currently fail to allocate from the preferred node if
> > > > we are near the limit. We issue memcg reclaim and go to fallback
> > > > alloc then, which will most probably allocate from a different node,
> > > > although there is no reason for that. This is a bug.
> > > 
> > > I am not familiar with the SLAB internals much but how is it different
> > > from the global case. If the preferred node is full then __GFP_THISNODE
> > > request will make it fail early even without giving GFP_NOWAIT
> > > additional access to atomic memory reserves. The fact that memcg case
> > > fails earlier is perfectly expected because the restriction is tighter
> > > than the global case.
> > 
> > memcg restrictions are orthogonal to NUMA: failing an allocation from a
> > particular node does not mean failing memcg charge and vice versa.
> 
> Sure memcg doesn't care about NUMA it just puts an additional constrain
> on top of all existing ones. The point I've tried to make is that the
> logic is currently same whether it is page allocator (with the node
> restriction) or memcg (cumulative amount restriction) are behaving
> consistently. Neither of them try to reclaim in order to achieve its
> goals. How conservative is memcg about allowing GFP_NOWAIT allocation
> is a separate issue and all those details belong to memcg proper same
> as the allocation strategy for these allocations belongs to the page
> allocator.
>  
> > > How the fallback is implemented and whether trying other node before
> > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > > for memcg enabled setups subtly different. And that is bad.
> > 
> > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > invoking memcg reclaim once a free page is found.
> 
> So you are saying that the SLAB kmem accounting in this particular path
> is suboptimal because the fallback mode doesn't retry local node with
> the reclaim enabled before falling back to other nodes?

I'm just pointing out some subtle behavior changes in slab you were
opposed to.

> I would consider it quite surprising as well even for the global case
> because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
> 
> > You are talking about memcg/kmem accounting as if it were done in the
> > buddy allocator on top of which the slab layer is built knowing nothing
> > about memcg accounting on the lower layer. That's not true and that
> > simply can't be true. Kmem accounting is implemented at the slab layer.
> > Memcg provides its memcg_charge_slab/uncharge methods solely for
> > slab core, so it's OK to have some calling conventions between them.
> > What we are really obliged to do is to preserve behavior of slab's
> > external API, i.e. kmalloc and friends.
> 
> I guess I understand what you are saying here but it sounds like special
> casing which tries to be clever because the current code understands
> both the lower level allocator and kmem charge paths to decide how to

What do you mean by saying "it understands the lower level allocator"?
AFAIK we have memcg callbacks only in special places, like page fault
handler or kmalloc.

> juggle with them. This is imho bad and hard to maintain long term.

We already juggle. Just grep where and how we insert
mem_cgroup_try_charge.

> 
> > > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > > pages. If there is no memcg limit, we will allocate a high order
> > > > slab page, which is in accordance with SLUB internal logic. With
> > > > memcg limit set, we are likely to fail to charge high order page
> > > > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > > > and fallback on a low order page. The latter is unexpected and
> > > > unjustified.
> > > 
> > > And this case very similar and I even argue that it shows more
> > > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > > for an allocation _without_ reclaim because that would be unnecessarily
> > > too costly and there is other less expensive fallback. But memcg would
> > 
> > You are ignoring the fact that, in contrast to alloc_pages, for memcg
> > there is practically no difference between charging a 4-order page or a
> > 1-order 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Michal Hocko
On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
{...}
> > >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > > memory on the preferred node. W/o memcg limit set, the allocation
> > > will happen from the preferred node, which is OK. If there is memcg
> > > limit, we can currently fail to allocate from the preferred node if
> > > we are near the limit. We issue memcg reclaim and go to fallback
> > > alloc then, which will most probably allocate from a different node,
> > > although there is no reason for that. This is a bug.
> > 
> > I am not familiar with the SLAB internals much but how is it different
> > from the global case. If the preferred node is full then __GFP_THISNODE
> > request will make it fail early even without giving GFP_NOWAIT
> > additional access to atomic memory reserves. The fact that memcg case
> > fails earlier is perfectly expected because the restriction is tighter
> > than the global case.
> 
> memcg restrictions are orthogonal to NUMA: failing an allocation from a
> particular node does not mean failing memcg charge and vice versa.

Sure memcg doesn't care about NUMA it just puts an additional constrain
on top of all existing ones. The point I've tried to make is that the
logic is currently same whether it is page allocator (with the node
restriction) or memcg (cumulative amount restriction) are behaving
consistently. Neither of them try to reclaim in order to achieve its
goals. How conservative is memcg about allowing GFP_NOWAIT allocation
is a separate issue and all those details belong to memcg proper same
as the allocation strategy for these allocations belongs to the page
allocator.
 
> > How the fallback is implemented and whether trying other node before
> > reclaiming from the preferred one is reasonable I dunno. This is for
> > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > for memcg enabled setups subtly different. And that is bad.
> 
> Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> differently: SLAB will walk over all NUMA nodes for nothing instead of
> invoking memcg reclaim once a free page is found.

So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.

> You are talking about memcg/kmem accounting as if it were done in the
> buddy allocator on top of which the slab layer is built knowing nothing
> about memcg accounting on the lower layer. That's not true and that
> simply can't be true. Kmem accounting is implemented at the slab layer.
> Memcg provides its memcg_charge_slab/uncharge methods solely for
> slab core, so it's OK to have some calling conventions between them.
> What we are really obliged to do is to preserve behavior of slab's
> external API, i.e. kmalloc and friends.

I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
juggle with them. This is imho bad and hard to maintain long term.

> > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > pages. If there is no memcg limit, we will allocate a high order
> > > slab page, which is in accordance with SLUB internal logic. With
> > > memcg limit set, we are likely to fail to charge high order page
> > > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > > and fallback on a low order page. The latter is unexpected and
> > > unjustified.
> > 
> > And this case very similar and I even argue that it shows more
> > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > for an allocation _without_ reclaim because that would be unnecessarily
> > too costly and there is other less expensive fallback. But memcg would
> 
> You are ignoring the fact that, in contrast to alloc_pages, for memcg
> there is practically no difference between charging a 4-order page or a
> 1-order page.

But this is an implementation details which might change anytime in
future.

> OTOH, using 1-order pages where we could go with 4-order
> pages increases page fragmentation at the global level. This subtly
> breaks internal SLUB optimization. Once again, kmem accounting is not
> something staying aside from slab core, it's a part of slab core.

This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov
On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> > On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> > 
> > > > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > > > silently ignored if kmem accounting is enabled:
> > > > 
> > > >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > > > 
> > > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > > > first. As a result, if there is enough free pages, memcg reclaim will
> > > > not get invoked on kmem allocations, which will lead to uncontrollable
> > > > growth of memory usage no matter what memory.high is set to.
> > > 
> > > Right but isn't that what the caller explicitly asked for?
> > 
> > No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
> > might ignore that and charge memcg w/o __GFP_WAIT.
> 
> I was referring to the slab allocator as the caller. Sorry for not being
> clear about that.
> 
> > > Why should we ignore that for kmem accounting? It seems like a fix at
> > > a wrong layer to me.
> > 
> > Let's forget about memory.high for a minute.
> >
> >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > memory on the preferred node. W/o memcg limit set, the allocation
> > will happen from the preferred node, which is OK. If there is memcg
> > limit, we can currently fail to allocate from the preferred node if
> > we are near the limit. We issue memcg reclaim and go to fallback
> > alloc then, which will most probably allocate from a different node,
> > although there is no reason for that. This is a bug.
> 
> I am not familiar with the SLAB internals much but how is it different
> from the global case. If the preferred node is full then __GFP_THISNODE
> request will make it fail early even without giving GFP_NOWAIT
> additional access to atomic memory reserves. The fact that memcg case
> fails earlier is perfectly expected because the restriction is tighter
> than the global case.

memcg restrictions are orthogonal to NUMA: failing an allocation from a
particular node does not mean failing memcg charge and vice versa.

> 
> How the fallback is implemented and whether trying other node before
> reclaiming from the preferred one is reasonable I dunno. This is for
> SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> for memcg enabled setups subtly different. And that is bad.

Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.

You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.

> 
> >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > pages. If there is no memcg limit, we will allocate a high order
> > slab page, which is in accordance with SLUB internal logic. With
> > memcg limit set, we are likely to fail to charge high order page
> > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > and fallback on a low order page. The latter is unexpected and
> > unjustified.
> 
> And this case very similar and I even argue that it shows more
> brokenness with your patch. The SLUB allocator has _explicitly_ asked
> for an allocation _without_ reclaim because that would be unnecessarily
> too costly and there is other less expensive fallback. But memcg would

You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page. OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.

> be ignoring this with your patch AFAIU and break the optimization. There
> are other cases like that. E.g. THP pages are allocated without GFP_WAIT
> when defrag is disabled.

It might be wrong. If we can't find a continuous 2Mb page, we should
probably give up instead of calling compactor. For memcg it might be
better to reclaim some space for 2Mb page right now and map a 2Mb page
instead of reclaiming space for 512 4Kb pages a moment later, because in
memcg case there is absolutely no 

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Christoph Lameter
On Mon, 31 Aug 2015, Vladimir Davydov wrote:

> I totally agree that we should strive to make a kmem user feel roughly
> the same in memcg as if it were running on a host with equal amount of
> RAM. There are two ways to achieve that:
>
>  1. Make the API functions, i.e. kmalloc and friends, behave inside
> memcg roughly the same way as they do in the root cgroup.
>  2. Make the internal memcg functions, i.e. try_charge and friends,
> behave roughly the same way as alloc_pages.
>
> I find way 1 more flexible, because we don't have to blindly follow
> heuristics used on global memory reclaim and therefore have more
> opportunities to achieve the same goal.

The heuristics need to integrate well if its in a cgroup or not. In
general make use of cgroups as transparent as possible to the rest of the
code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 01:03:09PM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
> ...
> > If we want to allow slab/slub implementation to invoke try_charge
> > wherever it wants, we need to introduce an asynchronous thread doing
> > reclaim when a memcg is approaching its limit (or teach kswapd do that).
> 
> In the long term, I think this is the way to go.

Quite probably, or we can use task_work, or direct reclaim instead. It's
not that obvious to me yet which one is the best.

> 
> > That's a way to go, but what's the point to complicate things
> > prematurely while it seems we can fix the problem by using the technique
> > similar to the one behind memory.high?
> 
> Cuz we're now scattering workarounds to multiple places and I'm sure
> we'll add more try_charge() users (e.g. we want to fold in tcp memcg
> under the same knobs) and we'll have to worry about the same problem
> all over again and will inevitably miss some cases leading to subtle
> failures.

I don't think we will need to insert try_charge_kmem anywhere else,
because all kmem users either allocate memory using kmalloc and friends
or using alloc_pages. kmalloc is accounted. For those who prefer
alloc_pages, there is alloc_kmem_pages helper.

> 
> > Nevertheless, even if we introduced such a thread, it'd be just insane
> > to allow slab/slub blindly insert try_charge. Let me repeat the examples
> > of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> > try_charge I gave above:
> > 
> >  - memcg knows nothing about NUMA nodes, so what's the point in failing
> >!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
> >  - memcg knows nothing about high order pages, so what's the point in
> >failing !__GFP_WAIT allocations used by SLUB to try to allocate a
> >high order page?
> 
> Both are optimistic speculative actions and as long as memcg can
> guarantee that those requests will succeed under normal circumstances,
> as does the system-wide mm does, it isn't a problem.
> 
> In general, we want to make sure inside-cgroup behaviors as close to
> system-wide behaviors as possible, scoped but equivalent in kind.
> Doing things differently, while inevitable in certain cases, is likely
> to get messy in the long term.

I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
RAM. There are two ways to achieve that:

 1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
 2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.

I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello,

On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
...
> If we want to allow slab/slub implementation to invoke try_charge
> wherever it wants, we need to introduce an asynchronous thread doing
> reclaim when a memcg is approaching its limit (or teach kswapd do that).

In the long term, I think this is the way to go.

> That's a way to go, but what's the point to complicate things
> prematurely while it seems we can fix the problem by using the technique
> similar to the one behind memory.high?

Cuz we're now scattering workarounds to multiple places and I'm sure
we'll add more try_charge() users (e.g. we want to fold in tcp memcg
under the same knobs) and we'll have to worry about the same problem
all over again and will inevitably miss some cases leading to subtle
failures.

> Nevertheless, even if we introduced such a thread, it'd be just insane
> to allow slab/slub blindly insert try_charge. Let me repeat the examples
> of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> try_charge I gave above:
> 
>  - memcg knows nothing about NUMA nodes, so what's the point in failing
>!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
>  - memcg knows nothing about high order pages, so what's the point in
>failing !__GFP_WAIT allocations used by SLUB to try to allocate a
>high order page?

Both are optimistic speculative actions and as long as memcg can
guarantee that those requests will succeed under normal circumstances,
as does the system-wide mm does, it isn't a problem.

In general, we want to make sure inside-cgroup behaviors as close to
system-wide behaviors as possible, scoped but equivalent in kind.
Doing things differently, while inevitable in certain cases, is likely
to get messy in the long term.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 11:47:56AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> > We have to be cautious about placing memcg_charge in slab/slub. To
> > understand why, consider SLAB case, which first tries to allocate from
> > all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> > falls back on an allocation from any node w/ __GFP_WAIT. This is its
> > internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> > then, when we are near the memcg limit, we will go over all NUMA nodes
> > in vain, then finally fall back to __GFP_WAIT allocation, which will get
> > a slab from a random node. Not only we do more work than necessary due
> > to walking over all NUMA nodes for nothing, but we also break SLAB
> > internal logic! And you just can't fix it in memcg, because memcg knows
> > nothing about the internal logic of SLAB, how it handles NUMA nodes.
> > 
> > SLUB has a different problem. It tries to avoid high-order allocations
> > if there is a risk of invoking costly memory compactor. It has nothing
> > to do with memcg, because memcg does not care if the charge is for a
> > high order page or not.
> 
> Maybe I'm missing something but aren't both issues caused by memcg
> failing to provide headroom for NOWAIT allocations when the
> consumption gets close to the max limit? 

That's correct.

> Regardless of the specific usage, !__GFP_WAIT means "give me memory if
> it can be spared w/o inducing direct time-consuming maintenance work"
> and the contract around it is that such requests will mostly succeed
> under nominal conditions.  Also, slab/slub might not stay as the only
> user of try_charge().

Indeed, there might be other users trying GFP_NOWAIT before falling back
to GFP_KERNEL, but they are not doing that constantly and hence cause no
problems. If SLAB/SLUB plays such tricks, the problem becomes massive:
under certain conditions *every* try_charge may be invoked w/o
__GFP_WAIT, resulting in memory.high breaching and hitting memory.max.

Generally speaking, handing over reclaim responsibility to task_work
won't help, because there might be cases when a process spends quite a
lot of time in kernel invoking lots of GFP_KERNEL allocations before
returning to userspace. Without fixing slab/slub, such a process will
charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
memory.max. If there are no other active processes in the cgroup, the
cgroup can stay with memory.high excess for a relatively long time
(suppose the process was throttled in kernel), possibly hurting the rest
of the system. What is worse, if the process happens to invoke a real
GFP_NOWAIT allocation when it's about to hit the limit, it will fail.

If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?

Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
try_charge I gave above:

 - memcg knows nothing about NUMA nodes, so what's the point in failing
   !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
 - memcg knows nothing about high order pages, so what's the point in
   failing !__GFP_WAIT allocations used by SLUB to try to allocate a
   high order page?

Thanks,
Vladimir

> I still think solving this from memcg side is the right direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello,

On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> We have to be cautious about placing memcg_charge in slab/slub. To
> understand why, consider SLAB case, which first tries to allocate from
> all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> falls back on an allocation from any node w/ __GFP_WAIT. This is its
> internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> then, when we are near the memcg limit, we will go over all NUMA nodes
> in vain, then finally fall back to __GFP_WAIT allocation, which will get
> a slab from a random node. Not only we do more work than necessary due
> to walking over all NUMA nodes for nothing, but we also break SLAB
> internal logic! And you just can't fix it in memcg, because memcg knows
> nothing about the internal logic of SLAB, how it handles NUMA nodes.
> 
> SLUB has a different problem. It tries to avoid high-order allocations
> if there is a risk of invoking costly memory compactor. It has nothing
> to do with memcg, because memcg does not care if the charge is for a
> high order page or not.

Maybe I'm missing something but aren't both issues caused by memcg
failing to provide headroom for NOWAIT allocations when the
consumption gets close to the max limit?  Regardless of the specific
usage, !__GFP_WAIT means "give me memory if it can be spared w/o
inducing direct time-consuming maintenance work" and the contract
around it is that such requests will mostly succeed under nominal
conditions.  Also, slab/slub might not stay as the only user of
try_charge().  I still think solving this from memcg side is the right
direction.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 10:46:04AM -0400, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
> ...
> > That being said, this is the fix at the right layer.
> 
> While this *might* be a necessary workaround for the hard limit case
> right now, this is by no means the fix at the right layer.  The
> expectation is that mm keeps a reasonable amount of memory available
> for allocations which can't block.  These allocations may fail from
> time to time depending on luck and under extreme memory pressure but
> the caller should be able to depend on it as a speculative allocation
> mechanism which doesn't fail willy-nilly.
> 
> Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
> slab or slub.

I never denied that there is GFP_NOWAIT/GFP_NOFS problem in memcg. I
even proposed ways to cope with it in one of the previous e-mails.

Nevertheless, we just can't allow slab/slub internals call memcg_charge
whenever they want as I pointed out in a parallel thread.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 10:39:39AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> > slab/slub can issue alloc_pages() any time with any flags they want and
> > it won't be accounted to memcg, because kmem is accounted at slab/slub
> > layer, not in buddy.
> 
> Hmmm?  I meant the eventual calling into try_charge w/ GFP_NOWAIT.
> Speculative usage of GFP_NOWAIT is bound to increase and we don't want
> to put on extra restrictions from memcg side. 

We already put restrictions on slab/slub from memcg side, because kmem
accounting is a part of slab/slub. They have to cooperate in order to
get things working. If slab/slub wants to make a speculative allocation
for some reason, it should just put memcg_charge out of this speculative
alloc section. This is what this patch set does.

We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.

SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.

Thanks,
Vladimir

> For memory.high,
> punting to the return path is a pretty stright-forward solution which
> should make the problem go away almost entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello, Vladimir.

On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
...
> That being said, this is the fix at the right layer.

While this *might* be a necessary workaround for the hard limit case
right now, this is by no means the fix at the right layer.  The
expectation is that mm keeps a reasonable amount of memory available
for allocations which can't block.  These allocations may fail from
time to time depending on luck and under extreme memory pressure but
the caller should be able to depend on it as a speculative allocation
mechanism which doesn't fail willy-nilly.

Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
slab or slub.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> slab/slub can issue alloc_pages() any time with any flags they want and
> it won't be accounted to memcg, because kmem is accounted at slab/slub
> layer, not in buddy.

Hmmm?  I meant the eventual calling into try_charge w/ GFP_NOWAIT.
Speculative usage of GFP_NOWAIT is bound to increase and we don't want
to put on extra restrictions from memcg side.  For memory.high,
punting to the return path is a pretty stright-forward solution which
should make the problem go away almost entirely.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 09:43:35AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > Right but isn't that what the caller explicitly asked for? Why should we
> > ignore that for kmem accounting? It seems like a fix at a wrong layer to
> > me. Either we should start failing GFP_NOWAIT charges when we are above
> > high wmark or deploy an additional catchup mechanism as suggested by
> > Tejun. I like the later more because it allows to better handle GFP_NOFS
> > requests as well and there are many sources of these from kmem paths.
> 
> Yeah, this is beginning to look like we're trying to solve the problem
> at the wrong layer.  slab/slub or whatever else should be able to use
> GFP_NOWAIT in whatever frequency they want for speculative
> allocations.

slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:

> > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > silently ignored if kmem accounting is enabled:
> > 
> >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > 
> > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > first. As a result, if there is enough free pages, memcg reclaim will
> > not get invoked on kmem allocations, which will lead to uncontrollable
> > growth of memory usage no matter what memory.high is set to.
> 
> Right but isn't that what the caller explicitly asked for?

No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.

> Why should we ignore that for kmem accounting? It seems like a fix at
> a wrong layer to me.

Let's forget about memory.high for a minute.

 1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.

 2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.

That being said, this is the fix at the right layer.

> Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun.

The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.

I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.

Thanks,
Vladimir

> I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello,

On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> Right but isn't that what the caller explicitly asked for? Why should we
> ignore that for kmem accounting? It seems like a fix at a wrong layer to
> me. Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun. I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.

Yeah, this is beginning to look like we're trying to solve the problem
at the wrong layer.  slab/slub or whatever else should be able to use
GFP_NOWAIT in whatever frequency they want for speculative
allocations.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Michal Hocko
On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> Hi,
> 
> Tejun reported that sometimes memcg/memory.high threshold seems to be
> silently ignored if kmem accounting is enabled:
> 
>   http://www.spinics.net/lists/linux-mm/msg93613.html
> 
> It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> first. As a result, if there is enough free pages, memcg reclaim will
> not get invoked on kmem allocations, which will lead to uncontrollable
> growth of memory usage no matter what memory.high is set to.

Right but isn't that what the caller explicitly asked for? Why should we
ignore that for kmem accounting? It seems like a fix at a wrong layer to
me. Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun. I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.
 
> This patch set attempts to fix this issue. For more details please see
> comments to individual patches.
> 
> Thanks,
> 
> Vladimir Davydov (2):
>   mm/slab: skip memcg reclaim only if in atomic context
>   mm/slub: do not bypass memcg reclaim for high-order page allocation
> 
>  mm/slab.c | 32 +++-
>  mm/slub.c | 24 +++-
>  2 files changed, 22 insertions(+), 34 deletions(-)
> 
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello, Vladimir.

On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
...
> That being said, this is the fix at the right layer.

While this *might* be a necessary workaround for the hard limit case
right now, this is by no means the fix at the right layer.  The
expectation is that mm keeps a reasonable amount of memory available
for allocations which can't block.  These allocations may fail from
time to time depending on luck and under extreme memory pressure but
the caller should be able to depend on it as a speculative allocation
mechanism which doesn't fail willy-nilly.

Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
slab or slub.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 10:39:39AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> > slab/slub can issue alloc_pages() any time with any flags they want and
> > it won't be accounted to memcg, because kmem is accounted at slab/slub
> > layer, not in buddy.
> 
> Hmmm?  I meant the eventual calling into try_charge w/ GFP_NOWAIT.
> Speculative usage of GFP_NOWAIT is bound to increase and we don't want
> to put on extra restrictions from memcg side. 

We already put restrictions on slab/slub from memcg side, because kmem
accounting is a part of slab/slub. They have to cooperate in order to
get things working. If slab/slub wants to make a speculative allocation
for some reason, it should just put memcg_charge out of this speculative
alloc section. This is what this patch set does.

We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.

SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.

Thanks,
Vladimir

> For memory.high,
> punting to the return path is a pretty stright-forward solution which
> should make the problem go away almost entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 10:46:04AM -0400, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
> ...
> > That being said, this is the fix at the right layer.
> 
> While this *might* be a necessary workaround for the hard limit case
> right now, this is by no means the fix at the right layer.  The
> expectation is that mm keeps a reasonable amount of memory available
> for allocations which can't block.  These allocations may fail from
> time to time depending on luck and under extreme memory pressure but
> the caller should be able to depend on it as a speculative allocation
> mechanism which doesn't fail willy-nilly.
> 
> Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
> slab or slub.

I never denied that there is GFP_NOWAIT/GFP_NOFS problem in memcg. I
even proposed ways to cope with it in one of the previous e-mails.

Nevertheless, we just can't allow slab/slub internals call memcg_charge
whenever they want as I pointed out in a parallel thread.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello,

On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> We have to be cautious about placing memcg_charge in slab/slub. To
> understand why, consider SLAB case, which first tries to allocate from
> all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> falls back on an allocation from any node w/ __GFP_WAIT. This is its
> internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> then, when we are near the memcg limit, we will go over all NUMA nodes
> in vain, then finally fall back to __GFP_WAIT allocation, which will get
> a slab from a random node. Not only we do more work than necessary due
> to walking over all NUMA nodes for nothing, but we also break SLAB
> internal logic! And you just can't fix it in memcg, because memcg knows
> nothing about the internal logic of SLAB, how it handles NUMA nodes.
> 
> SLUB has a different problem. It tries to avoid high-order allocations
> if there is a risk of invoking costly memory compactor. It has nothing
> to do with memcg, because memcg does not care if the charge is for a
> high order page or not.

Maybe I'm missing something but aren't both issues caused by memcg
failing to provide headroom for NOWAIT allocations when the
consumption gets close to the max limit?  Regardless of the specific
usage, !__GFP_WAIT means "give me memory if it can be spared w/o
inducing direct time-consuming maintenance work" and the contract
around it is that such requests will mostly succeed under nominal
conditions.  Also, slab/slub might not stay as the only user of
try_charge().  I still think solving this from memcg side is the right
direction.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 11:47:56AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> > We have to be cautious about placing memcg_charge in slab/slub. To
> > understand why, consider SLAB case, which first tries to allocate from
> > all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> > falls back on an allocation from any node w/ __GFP_WAIT. This is its
> > internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> > then, when we are near the memcg limit, we will go over all NUMA nodes
> > in vain, then finally fall back to __GFP_WAIT allocation, which will get
> > a slab from a random node. Not only we do more work than necessary due
> > to walking over all NUMA nodes for nothing, but we also break SLAB
> > internal logic! And you just can't fix it in memcg, because memcg knows
> > nothing about the internal logic of SLAB, how it handles NUMA nodes.
> > 
> > SLUB has a different problem. It tries to avoid high-order allocations
> > if there is a risk of invoking costly memory compactor. It has nothing
> > to do with memcg, because memcg does not care if the charge is for a
> > high order page or not.
> 
> Maybe I'm missing something but aren't both issues caused by memcg
> failing to provide headroom for NOWAIT allocations when the
> consumption gets close to the max limit? 

That's correct.

> Regardless of the specific usage, !__GFP_WAIT means "give me memory if
> it can be spared w/o inducing direct time-consuming maintenance work"
> and the contract around it is that such requests will mostly succeed
> under nominal conditions.  Also, slab/slub might not stay as the only
> user of try_charge().

Indeed, there might be other users trying GFP_NOWAIT before falling back
to GFP_KERNEL, but they are not doing that constantly and hence cause no
problems. If SLAB/SLUB plays such tricks, the problem becomes massive:
under certain conditions *every* try_charge may be invoked w/o
__GFP_WAIT, resulting in memory.high breaching and hitting memory.max.

Generally speaking, handing over reclaim responsibility to task_work
won't help, because there might be cases when a process spends quite a
lot of time in kernel invoking lots of GFP_KERNEL allocations before
returning to userspace. Without fixing slab/slub, such a process will
charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
memory.max. If there are no other active processes in the cgroup, the
cgroup can stay with memory.high excess for a relatively long time
(suppose the process was throttled in kernel), possibly hurting the rest
of the system. What is worse, if the process happens to invoke a real
GFP_NOWAIT allocation when it's about to hit the limit, it will fail.

If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?

Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
try_charge I gave above:

 - memcg knows nothing about NUMA nodes, so what's the point in failing
   !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
 - memcg knows nothing about high order pages, so what's the point in
   failing !__GFP_WAIT allocations used by SLUB to try to allocate a
   high order page?

Thanks,
Vladimir

> I still think solving this from memcg side is the right direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello,

On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
...
> If we want to allow slab/slub implementation to invoke try_charge
> wherever it wants, we need to introduce an asynchronous thread doing
> reclaim when a memcg is approaching its limit (or teach kswapd do that).

In the long term, I think this is the way to go.

> That's a way to go, but what's the point to complicate things
> prematurely while it seems we can fix the problem by using the technique
> similar to the one behind memory.high?

Cuz we're now scattering workarounds to multiple places and I'm sure
we'll add more try_charge() users (e.g. we want to fold in tcp memcg
under the same knobs) and we'll have to worry about the same problem
all over again and will inevitably miss some cases leading to subtle
failures.

> Nevertheless, even if we introduced such a thread, it'd be just insane
> to allow slab/slub blindly insert try_charge. Let me repeat the examples
> of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> try_charge I gave above:
> 
>  - memcg knows nothing about NUMA nodes, so what's the point in failing
>!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
>  - memcg knows nothing about high order pages, so what's the point in
>failing !__GFP_WAIT allocations used by SLUB to try to allocate a
>high order page?

Both are optimistic speculative actions and as long as memcg can
guarantee that those requests will succeed under normal circumstances,
as does the system-wide mm does, it isn't a problem.

In general, we want to make sure inside-cgroup behaviors as close to
system-wide behaviors as possible, scoped but equivalent in kind.
Doing things differently, while inevitable in certain cases, is likely
to get messy in the long term.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 01:03:09PM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
> ...
> > If we want to allow slab/slub implementation to invoke try_charge
> > wherever it wants, we need to introduce an asynchronous thread doing
> > reclaim when a memcg is approaching its limit (or teach kswapd do that).
> 
> In the long term, I think this is the way to go.

Quite probably, or we can use task_work, or direct reclaim instead. It's
not that obvious to me yet which one is the best.

> 
> > That's a way to go, but what's the point to complicate things
> > prematurely while it seems we can fix the problem by using the technique
> > similar to the one behind memory.high?
> 
> Cuz we're now scattering workarounds to multiple places and I'm sure
> we'll add more try_charge() users (e.g. we want to fold in tcp memcg
> under the same knobs) and we'll have to worry about the same problem
> all over again and will inevitably miss some cases leading to subtle
> failures.

I don't think we will need to insert try_charge_kmem anywhere else,
because all kmem users either allocate memory using kmalloc and friends
or using alloc_pages. kmalloc is accounted. For those who prefer
alloc_pages, there is alloc_kmem_pages helper.

> 
> > Nevertheless, even if we introduced such a thread, it'd be just insane
> > to allow slab/slub blindly insert try_charge. Let me repeat the examples
> > of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> > try_charge I gave above:
> > 
> >  - memcg knows nothing about NUMA nodes, so what's the point in failing
> >!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
> >  - memcg knows nothing about high order pages, so what's the point in
> >failing !__GFP_WAIT allocations used by SLUB to try to allocate a
> >high order page?
> 
> Both are optimistic speculative actions and as long as memcg can
> guarantee that those requests will succeed under normal circumstances,
> as does the system-wide mm does, it isn't a problem.
> 
> In general, we want to make sure inside-cgroup behaviors as close to
> system-wide behaviors as possible, scoped but equivalent in kind.
> Doing things differently, while inevitable in certain cases, is likely
> to get messy in the long term.

I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
RAM. There are two ways to achieve that:

 1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
 2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.

I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Christoph Lameter
On Mon, 31 Aug 2015, Vladimir Davydov wrote:

> I totally agree that we should strive to make a kmem user feel roughly
> the same in memcg as if it were running on a host with equal amount of
> RAM. There are two ways to achieve that:
>
>  1. Make the API functions, i.e. kmalloc and friends, behave inside
> memcg roughly the same way as they do in the root cgroup.
>  2. Make the internal memcg functions, i.e. try_charge and friends,
> behave roughly the same way as alloc_pages.
>
> I find way 1 more flexible, because we don't have to blindly follow
> heuristics used on global memory reclaim and therefore have more
> opportunities to achieve the same goal.

The heuristics need to integrate well if its in a cgroup or not. In
general make use of cgroups as transparent as possible to the rest of the
code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
Hello,

On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> Right but isn't that what the caller explicitly asked for? Why should we
> ignore that for kmem accounting? It seems like a fix at a wrong layer to
> me. Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun. I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.

Yeah, this is beginning to look like we're trying to solve the problem
at the wrong layer.  slab/slub or whatever else should be able to use
GFP_NOWAIT in whatever frequency they want for speculative
allocations.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 09:43:35AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > Right but isn't that what the caller explicitly asked for? Why should we
> > ignore that for kmem accounting? It seems like a fix at a wrong layer to
> > me. Either we should start failing GFP_NOWAIT charges when we are above
> > high wmark or deploy an additional catchup mechanism as suggested by
> > Tejun. I like the later more because it allows to better handle GFP_NOFS
> > requests as well and there are many sources of these from kmem paths.
> 
> Yeah, this is beginning to look like we're trying to solve the problem
> at the wrong layer.  slab/slub or whatever else should be able to use
> GFP_NOWAIT in whatever frequency they want for speculative
> allocations.

slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Michal Hocko
On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> Hi,
> 
> Tejun reported that sometimes memcg/memory.high threshold seems to be
> silently ignored if kmem accounting is enabled:
> 
>   http://www.spinics.net/lists/linux-mm/msg93613.html
> 
> It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> first. As a result, if there is enough free pages, memcg reclaim will
> not get invoked on kmem allocations, which will lead to uncontrollable
> growth of memory usage no matter what memory.high is set to.

Right but isn't that what the caller explicitly asked for? Why should we
ignore that for kmem accounting? It seems like a fix at a wrong layer to
me. Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun. I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.
 
> This patch set attempts to fix this issue. For more details please see
> comments to individual patches.
> 
> Thanks,
> 
> Vladimir Davydov (2):
>   mm/slab: skip memcg reclaim only if in atomic context
>   mm/slub: do not bypass memcg reclaim for high-order page allocation
> 
>  mm/slab.c | 32 +++-
>  mm/slub.c | 24 +++-
>  2 files changed, 22 insertions(+), 34 deletions(-)
> 
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Tejun Heo
On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> slab/slub can issue alloc_pages() any time with any flags they want and
> it won't be accounted to memcg, because kmem is accounted at slab/slub
> layer, not in buddy.

Hmmm?  I meant the eventual calling into try_charge w/ GFP_NOWAIT.
Speculative usage of GFP_NOWAIT is bound to increase and we don't want
to put on extra restrictions from memcg side.  For memory.high,
punting to the return path is a pretty stright-forward solution which
should make the problem go away almost entirely.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov
On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:

> > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > silently ignored if kmem accounting is enabled:
> > 
> >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > 
> > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > first. As a result, if there is enough free pages, memcg reclaim will
> > not get invoked on kmem allocations, which will lead to uncontrollable
> > growth of memory usage no matter what memory.high is set to.
> 
> Right but isn't that what the caller explicitly asked for?

No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.

> Why should we ignore that for kmem accounting? It seems like a fix at
> a wrong layer to me.

Let's forget about memory.high for a minute.

 1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.

 2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.

That being said, this is the fix at the right layer.

> Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun.

The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.

I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.

Thanks,
Vladimir

> I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-30 Thread Vladimir Davydov
Hi,

Tejun reported that sometimes memcg/memory.high threshold seems to be
silently ignored if kmem accounting is enabled:

  http://www.spinics.net/lists/linux-mm/msg93613.html

It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.

This patch set attempts to fix this issue. For more details please see
comments to individual patches.

Thanks,

Vladimir Davydov (2):
  mm/slab: skip memcg reclaim only if in atomic context
  mm/slub: do not bypass memcg reclaim for high-order page allocation

 mm/slab.c | 32 +++-
 mm/slub.c | 24 +++-
 2 files changed, 22 insertions(+), 34 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-30 Thread Vladimir Davydov
Hi,

Tejun reported that sometimes memcg/memory.high threshold seems to be
silently ignored if kmem accounting is enabled:

  http://www.spinics.net/lists/linux-mm/msg93613.html

It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.

This patch set attempts to fix this issue. For more details please see
comments to individual patches.

Thanks,

Vladimir Davydov (2):
  mm/slab: skip memcg reclaim only if in atomic context
  mm/slub: do not bypass memcg reclaim for high-order page allocation

 mm/slab.c | 32 +++-
 mm/slub.c | 24 +++-
 2 files changed, 22 insertions(+), 34 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/