[PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy v2

2015-11-04 Thread Johannes Weiner
Hi,

this is version 2 of the patches to add socket memory accounting to
the unified hierarchy memory controller. Changes from v1 include:

- No accounting overhead unless a dedicated cgroup is created and the
  memory controller instructed to track that group's memory footprint.
  Distribution kernels enable CONFIG_MEMCG, and users (incl. systemd)
  might create cgroups only for process control or resources other
  than memory. As noted by David and Michal, these setups shouldn't
  pay any overhead for this.

- Continue to enter the socket pressure state when hitting the memory
  controller's hard limit. Vladimir noted that there is at least some
  value in telling other sockets in the cgroup to not increase their
  transmit windows when one of them is already dropping packets.

- Drop the controversial vmpressure rework. Instead of changing the
  level where pressure is noted, keep noting pressure in its origin
  and then make the pressure check hierarchical. As noted by Michal
  and Vladimir, we shouldn't risk changing user-visible behavior.

---

Socket buffer memory can make up a significant share of a workload's
memory footprint that can be directly linked to userspace activity,
and so it needs to be part of the memory controller to provide proper
resource isolation/containment.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. Upon pressure, the VM reclaims and shrinks and puts pressure on
whatever memory consumer in that pool is within its reach.

For socket memory, pressure feedback is provided through vmpressure
events. When the VM has trouble freeing memory, the network code is
instructed to stop growing the cgroup's transmit windows.

---

This series begins with a rework of the existing tcp memory controller
that simplifies and cleans up the code while allowing us to have only
one set of networking hooks for both memory controller versions. The
original behavior of the existing tcp controller should be preserved.

It then adds socket accounting to the v2 memory controller, including
the use of the per-cpu charge cache and async memory.high enforcement
from socket memory charges.

Lastly, vmpressure is hooked up to the socket code so that it stops
growing transmit windows when the VM has trouble reclaiming memory.

 include/linux/memcontrol.h   |  98 ---
 include/linux/page_counter.h |   6 +-
 include/net/sock.h   | 137 ++---
 include/net/tcp.h|   5 +-
 include/net/tcp_memcontrol.h |   7 --
 mm/backing-dev.c |   2 +-
 mm/hugetlb_cgroup.c  |   3 +-
 mm/memcontrol.c  | 262 +
 mm/page_counter.c|  14 +--
 mm/vmpressure.c  |  25 +++-
 mm/vmscan.c  |  31 ++---
 net/core/sock.c  |  78 +++-
 net/ipv4/sysctl_net_ipv4.c   |   1 -
 net/ipv4/tcp.c   |   3 +-
 net/ipv4/tcp_ipv4.c  |   9 +-
 net/ipv4/tcp_memcontrol.c| 147 ---
 net/ipv4/tcp_output.c|   6 +-
 net/ipv6/tcp_ipv6.c  |   3 -
 18 files changed, 328 insertions(+), 509 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-11-02 Thread Vladimir Davydov
On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote:
...
> Now, you mentioned that you'd rather see the socket buffers accounted
> at the allocator level, but I looked at the different allocation paths
> and network protocols and I'm not convinced that this makes sense. We
> don't want to be in the hotpath of every single packet when a lot of
> them are small, short-lived management blips that don't involve user
> space to let the kernel dispose of them.
> 
> __sk_mem_schedule() on the other hand is already wired up to exactly
> those consumers we are interested in for memory isolation: those with
> bigger chunks of data attached to them and those that have exploding
> receive queues when userspace fails to read(). UDP and TCP.
> 
> I mean, there is a reason why the global memory limits apply to only
> those types of packets in the first place: everything else is noise.
> 
> I agree that it's appealing to account at the allocator level and set
> page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
> of noise, and I don't want to pay that just for aesthetics. In this
> case it's better to track ownership on the socket level and only count
> packets that can accumulate a significant amount of memory consumed.

Sigh, you seem to be right. Moreover, I can't even think of a neat way
to account skb pages to memcg, because rcv skbs are generated in device
drivers, where we don't know which socket/memcg it will go to. We could
recharge individual pages when skb gets to the network or transport
layer, but it would result in unjustified overhead.

> 
> > > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > > for sure, but it's horrendous for network performance. There is no
> > > "stop growing" phase, it just keeps going full throttle until it hits
> > > the wall hard.
> > > 
> > > Now, we could probably try to replicate the global knobs and add a
> > > per-memcg soft limit. But you know better than anyone else how hard it
> > > is to estimate the overall workingset size of a workload, and the
> > > margins on containerized loads are razor-thin. Performance is much
> > > more sensitive to input errors, and often times parameters must be
> > > adjusted continuously during the runtime of a workload. It'd be
> > > disasterous to rely on yet more static, error-prone user input here.
> > 
> > Yeah, but the dynamic approach proposed in your patch set doesn't
> > guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> > reduces this possibility. Of course, memcg OOM is far not as disastrous
> > as the global one, but still it usually means the workload breakage.
> 
> Right now, the entire machine breaks. Confining it to a faulty memcg,
> as well as reducing the likelihood of that OOM in many cases seems
> like a good move in the right direction, no?

It seems. However, memcg OOM is also bad, we should strive to avoid it
if we can.

> 
> And how likely are memcg OOMs because of this anyway? There is of

Frankly, I've no idea. Your arguments below sound reassuring though.

> course a scenario imaginable where the packets pile up, followed by
> some *other* part of the workload, the one that doesn't read() and
> process packets, trying to expand--which then doesn't work and goes
> OOM. But that seems like a complete corner case. In the vast majority
> of cases, the application will be in full operation and just fail to
> read() fast enough--because the network bandwidth is enormous compared
> to the container's size, or because it shares the CPU with thousands
> of other workloads and there is scheduling latency.
> 
> This would be the perfect point to reign in the transmit window...
> 
> > The static approach is error-prone for sure, but it has existed for
> > years and worked satisfactory AFAIK.
> 
> ...but that point is not a fixed amount of memory consumed. It depends
> on the workload and the random interactions it's having with thousands
> of other containers on that same machine.
> 
> The point of containers is to maximize utilization of your hardware
> and systematically eliminate slack in the system. But it's exactly
> that slack on dedicated bare-metal machines that allowed us to take a
> wild guess at the settings and then tune them based on observing a
> handful of workloads. This approach is not going to work anymore when
> we pack the machine to capacity and still expect every single
> container out of thousands to perform well. We need that automation.

But we do use static approach when setting memory limits, no?
memory.{low,high,max} - they are all static.

I understand it's appealing to have just one knob - memory size - like
in case of virtual machines, but it doesn't seem to work with
containers. You added memory.low and memory.high knobs. VMs don't have
anything like that. How is one supposed to set them? Depends on the
workload, I guess. Also, there is the pids cgroup for limiting the
number of pids that can be used by a cgroup, because 

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-29 Thread Johannes Weiner
On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote:
> On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> > Having the hard limit as a failsafe (or a minimum for other consumers)
> > is one thing, and certainly something I'm open to for cgroupv2, should
> > we have problems with load startup up after a socket memory landgrab.
> > 
> > That being said, if the VM is struggling to reclaim pages, or is even
> > swapping, it makes perfect sense to let the socket memory scheduler
> > know it shouldn't continue to increase its footprint until the VM
> > recovers. Regardless of any hard limitations/minimum guarantees.
> > 
> > This is what my patch does and it seems pretty straight-forward to
> > me. I don't really understand why this is so controversial.
> 
> I'm not arguing that the idea behind this patch set is necessarily bad.
> Quite the contrary, it does look interesting to me. I'm just saying that
> IMO it can't replace hard/soft limits. It probably could if it was
> possible to shrink buffers, but I don't think it's feasible, even
> theoretically. That's why I propose not to change the behavior of the
> existing per memcg tcp limit at all. And frankly I don't get why you are
> so keen on simplifying it. You say it's a "crapload of boilerplate
> code". Well, I don't see how it is - it just replicates global knobs and
> I don't see how it could be done in a better way. The code is hidden
> behind jump labels, so the overhead is zero if it isn't used. If you
> really dislike this code, we can isolate it under a separate config
> option. But all right, I don't rule out the possibility that the code
> could be simplified. If you do that w/o breaking it, that'll be OK to
> me, but I don't see why it should be related to this particular patch
> set.

Okay, I see your concern.

I'm not trying to change the behavior, just the implementation,
because it's too complex for the functionality it actually provides.

And the reason it's part of this patch set is because I'm using the
same code to hook into the memory accounting, so it makes sense to
refactor this stuff in the same go.

There is also a niceness factor of not adding more memcg callbacks to
the networking subsystem when there is an option to consolidate them.

Now, you mentioned that you'd rather see the socket buffers accounted
at the allocator level, but I looked at the different allocation paths
and network protocols and I'm not convinced that this makes sense. We
don't want to be in the hotpath of every single packet when a lot of
them are small, short-lived management blips that don't involve user
space to let the kernel dispose of them.

__sk_mem_schedule() on the other hand is already wired up to exactly
those consumers we are interested in for memory isolation: those with
bigger chunks of data attached to them and those that have exploding
receive queues when userspace fails to read(). UDP and TCP.

I mean, there is a reason why the global memory limits apply to only
those types of packets in the first place: everything else is noise.

I agree that it's appealing to account at the allocator level and set
page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
of noise, and I don't want to pay that just for aesthetics. In this
case it's better to track ownership on the socket level and only count
packets that can accumulate a significant amount of memory consumed.

> > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > for sure, but it's horrendous for network performance. There is no
> > "stop growing" phase, it just keeps going full throttle until it hits
> > the wall hard.
> > 
> > Now, we could probably try to replicate the global knobs and add a
> > per-memcg soft limit. But you know better than anyone else how hard it
> > is to estimate the overall workingset size of a workload, and the
> > margins on containerized loads are razor-thin. Performance is much
> > more sensitive to input errors, and often times parameters must be
> > adjusted continuously during the runtime of a workload. It'd be
> > disasterous to rely on yet more static, error-prone user input here.
> 
> Yeah, but the dynamic approach proposed in your patch set doesn't
> guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> reduces this possibility. Of course, memcg OOM is far not as disastrous
> as the global one, but still it usually means the workload breakage.

Right now, the entire machine breaks. Confining it to a faulty memcg,
as well as reducing the likelihood of that OOM in many cases seems
like a good move in the right direction, no?

And how likely are memcg OOMs because of this anyway? There is of
course a scenario imaginable where the packets pile up, followed by
some *other* part of the workload, the one that doesn't read() and
process packets, trying to expand--which then doesn't work and goes
OOM. But that seems like a complete corner case. In the vast majority
of cases, the 

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-29 Thread Vladimir Davydov
On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> > Then you'd better not touch existing tcp limits at all, because they
> > just work, and the logic behind them is very close to that of global tcp
> > limits. I don't think one can simplify it somehow.
> 
> Uhm, no, there is a crapload of boilerplate code and complication that
> seems entirely unnecessary. The only thing missing from my patch seems
> to be the part where it enters memory pressure state when the limit is
> hit. I'm adding this for completeness, but I doubt it even matters.
> 
> > Moreover, frankly I still have my reservations about this vmpressure
> > propagation to skb you're proposing. It might work, but I doubt it
> > will allow us to throw away explicit tcp limit, as I explained
> > previously. So, even with your approach I think we can still need
> > per memcg tcp limit *unless* you get rid of global tcp limit
> > somehow.
> 
> Having the hard limit as a failsafe (or a minimum for other consumers)
> is one thing, and certainly something I'm open to for cgroupv2, should
> we have problems with load startup up after a socket memory landgrab.
> 
> That being said, if the VM is struggling to reclaim pages, or is even
> swapping, it makes perfect sense to let the socket memory scheduler
> know it shouldn't continue to increase its footprint until the VM
> recovers. Regardless of any hard limitations/minimum guarantees.
> 
> This is what my patch does and it seems pretty straight-forward to
> me. I don't really understand why this is so controversial.

I'm not arguing that the idea behind this patch set is necessarily bad.
Quite the contrary, it does look interesting to me. I'm just saying that
IMO it can't replace hard/soft limits. It probably could if it was
possible to shrink buffers, but I don't think it's feasible, even
theoretically. That's why I propose not to change the behavior of the
existing per memcg tcp limit at all. And frankly I don't get why you are
so keen on simplifying it. You say it's a "crapload of boilerplate
code". Well, I don't see how it is - it just replicates global knobs and
I don't see how it could be done in a better way. The code is hidden
behind jump labels, so the overhead is zero if it isn't used. If you
really dislike this code, we can isolate it under a separate config
option. But all right, I don't rule out the possibility that the code
could be simplified. If you do that w/o breaking it, that'll be OK to
me, but I don't see why it should be related to this particular patch
set.

> 
> The *next* step would be to figure out whether we can actually
> *reclaim* memory in the network subsystem--shrink windows and steal
> buffers back--and that might even be an avenue to replace tcp window
> limits. But it's not necessary for *this* patch series to be useful.

Again, I don't think we can *reclaim* network memory, but you're right.

> 
> > > So this seemed like a good way to prove a new mechanism before rolling
> > > it out to every single Linux setup, rather than switch everybody over
> > > after the limited scope testing I can do as a developer on my own.
> > > 
> > > Keep in mind that my patches are not committing anything in terms of
> > > interface, so we retain all the freedom to fix and tune the way this
> > > is implemented, including the freedom to re-add tcp window limits in
> > > case the pressure balancing is not a comprehensive solution.
> > 
> > I really dislike this kind of proof. It looks like you're trying to
> > push something you think is right covertly, w/o having a proper
> > discussion with networking people and then say that it just works
> > and hence should be done globally, but what if it won't? Revert it?
> > We already have a lot of dubious stuff in memcg that should be
> > reverted, so let's please try to avoid this kind of mistakes in
> > future. Note, I say "w/o having a proper discussion with networking
> > people", because I don't think they will really care *unless* you
> > change the global logic, simply because most of them aren't very
> > interested in memcg AFAICS.
> 
> Come on, Dave is the first To and netdev is CC'd. They might not care
> about memcg, but "pushing things covertly" is a bit of a stretch.

Sorry if it sounded rude to you. I just look back at my experience
patching slab internals to make kmem accountable, and AFAICS Christoph
didn't really care about *what* I was doing, he only cared about the
global case - if there was no performance degradation when kmemcg was
disabled, he was usually fine with it, even if from the memcg pov it was
a crap.

Anyway, I can't force you to patch the global case first or
simultaneously with the memcg case, so let's just hope I'm a bit too
overcautious.

> 
> > That effectively means you loose a chance to listen to networking
> > experts, who could point you at design flaws and propose an improvement
> > right away. Let's please not miss such 

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-28 Thread Vladimir Davydov
On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote:
...
> > > But regardless of tcp window control, we need to account socket memory
> > > in the main memory accounting pool where pressure is shared (to the
> > > best of our abilities) between all accounted memory consumers.
> > > 
> > 
> > No objections to this point. However, I really don't like the idea to
> > charge tcp window size to memory.current instead of charging individual
> > pages consumed by the workload for storing socket buffers, because it is
> > inconsistent with what we have now. Can't we charge individual skb pages
> > as we do in case of other kmem allocations?
> 
> Absolutely, both work for me. I chose that route because it's where
> the networking code already tracks and accounts memory consumed, so it
> seemed like a better site to hook into.
> 
> But I understand your concerns. We want to track this stuff as close
> to the memory allocators as possible.

Exactly.

> 
> > > But also, there are people right now for whom the socket buffers cause
> > > system OOM, but the existing memcg's hard tcp window limitq that
> > > exists absolutely wrecks network performance for them. It's not usable
> > > the way it is. It'd be much better to have the socket buffers exert
> > > pressure on the shared pool, and then propagate the overall pressure
> > > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> > 
> > This might or might not work. I'm not an expert to judge. But if you do
> > this only for memcg leaving the global case as it is, networking people
> > won't budge IMO. So could you please start such a major rework from the
> > global case? Could you please try to deprecate the tcp window limits not
> > only in the legacy memcg hierarchy, but also system-wide in order to
> > attract attention of networking experts?
> 
> I'm definitely interested in addressing this globally as well.
> 
> The idea behind this was to use the memcg part as a testbed. cgroup2
> is going to be new and people are prepared for hiccups when migrating
> their applications to it; and they can roll back to cgroup1 and tcp
> window limits at any time should they run into problems in production.

Then you'd better not touch existing tcp limits at all, because they
just work, and the logic behind them is very close to that of global tcp
limits. I don't think one can simplify it somehow. Moreover, frankly I
still have my reservations about this vmpressure propagation to skb
you're proposing. It might work, but I doubt it will allow us to throw
away explicit tcp limit, as I explained previously. So, even with your
approach I think we can still need per memcg tcp limit *unless* you get
rid of global tcp limit somehow.

> 
> So this seemed like a good way to prove a new mechanism before rolling
> it out to every single Linux setup, rather than switch everybody over
> after the limited scope testing I can do as a developer on my own.
> 
> Keep in mind that my patches are not committing anything in terms of
> interface, so we retain all the freedom to fix and tune the way this
> is implemented, including the freedom to re-add tcp window limits in
> case the pressure balancing is not a comprehensive solution.
> 

I really dislike this kind of proof. It looks like you're trying to push
something you think is right covertly, w/o having a proper discussion
with networking people and then say that it just works and hence should
be done globally, but what if it won't? Revert it? We already have a lot
of dubious stuff in memcg that should be reverted, so let's please try
to avoid this kind of mistakes in future. Note, I say "w/o having a
proper discussion with networking people", because I don't think they
will really care *unless* you change the global logic, simply because
most of them aren't very interested in memcg AFAICS.

That effectively means you loose a chance to listen to networking
experts, who could point you at design flaws and propose an improvement
right away. Let's please not miss such an opportunity. You said that
you'd seen this problem happen w/o cgroups, so you have a use case that
might need fixing at the global level. IMO it shouldn't be difficult to
prepare an RFC patch for the global case first and see what people think
about it.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-28 Thread Johannes Weiner
On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> Then you'd better not touch existing tcp limits at all, because they
> just work, and the logic behind them is very close to that of global tcp
> limits. I don't think one can simplify it somehow.

Uhm, no, there is a crapload of boilerplate code and complication that
seems entirely unnecessary. The only thing missing from my patch seems
to be the part where it enters memory pressure state when the limit is
hit. I'm adding this for completeness, but I doubt it even matters.

> Moreover, frankly I still have my reservations about this vmpressure
> propagation to skb you're proposing. It might work, but I doubt it
> will allow us to throw away explicit tcp limit, as I explained
> previously. So, even with your approach I think we can still need
> per memcg tcp limit *unless* you get rid of global tcp limit
> somehow.

Having the hard limit as a failsafe (or a minimum for other consumers)
is one thing, and certainly something I'm open to for cgroupv2, should
we have problems with load startup up after a socket memory landgrab.

That being said, if the VM is struggling to reclaim pages, or is even
swapping, it makes perfect sense to let the socket memory scheduler
know it shouldn't continue to increase its footprint until the VM
recovers. Regardless of any hard limitations/minimum guarantees.

This is what my patch does and it seems pretty straight-forward to
me. I don't really understand why this is so controversial.

The *next* step would be to figure out whether we can actually
*reclaim* memory in the network subsystem--shrink windows and steal
buffers back--and that might even be an avenue to replace tcp window
limits. But it's not necessary for *this* patch series to be useful.

> > So this seemed like a good way to prove a new mechanism before rolling
> > it out to every single Linux setup, rather than switch everybody over
> > after the limited scope testing I can do as a developer on my own.
> > 
> > Keep in mind that my patches are not committing anything in terms of
> > interface, so we retain all the freedom to fix and tune the way this
> > is implemented, including the freedom to re-add tcp window limits in
> > case the pressure balancing is not a comprehensive solution.
> 
> I really dislike this kind of proof. It looks like you're trying to
> push something you think is right covertly, w/o having a proper
> discussion with networking people and then say that it just works
> and hence should be done globally, but what if it won't? Revert it?
> We already have a lot of dubious stuff in memcg that should be
> reverted, so let's please try to avoid this kind of mistakes in
> future. Note, I say "w/o having a proper discussion with networking
> people", because I don't think they will really care *unless* you
> change the global logic, simply because most of them aren't very
> interested in memcg AFAICS.

Come on, Dave is the first To and netdev is CC'd. They might not care
about memcg, but "pushing things covertly" is a bit of a stretch.

> That effectively means you loose a chance to listen to networking
> experts, who could point you at design flaws and propose an improvement
> right away. Let's please not miss such an opportunity. You said that
> you'd seen this problem happen w/o cgroups, so you have a use case that
> might need fixing at the global level. IMO it shouldn't be difficult to
> prepare an RFC patch for the global case first and see what people think
> about it.

No, the problem we are running into is when network memory is not
tracked per cgroup. The lack of containment means that the socket
memory consumption of individual cgroups can trigger system OOM.

We tried using the per-memcg tcp limits, and that prevents the OOMs
for sure, but it's horrendous for network performance. There is no
"stop growing" phase, it just keeps going full throttle until it hits
the wall hard.

Now, we could probably try to replicate the global knobs and add a
per-memcg soft limit. But you know better than anyone else how hard it
is to estimate the overall workingset size of a workload, and the
margins on containerized loads are razor-thin. Performance is much
more sensitive to input errors, and often times parameters must be
adjusted continuously during the runtime of a workload. It'd be
disasterous to rely on yet more static, error-prone user input here.

What all this means to me is that fixing it on the cgroup level has
higher priority. But it also means that once we figured it out under
such a high-pressure environment, it's much easier to apply to the
global case and potentially replace the soft limit there.

This seems like a better approach to me than starting globally, only
to realize that the solution is not workable for cgroups and we need
yet something else.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-27 Thread Vladimir Davydov
On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> > Hi Johannes,
> > 
> > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> > ...
> > > Patch #5 adds accounting and tracking of socket memory to the unified
> > > hierarchy memory controller, as described above. It uses the existing
> > > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > > 
> > > Patch #8 uses the vmpressure extension to equalize pressure between
> > > the pages tracked natively by the VM and socket buffer pages. As the
> > > pool is shared, it makes sense that while natively tracked pages are
> > > under duress the network transmit windows are also not increased.
> > 
> > First of all, I've no experience in networking, so I'm likely to be
> > mistaken. Nevertheless I beg to disagree that this patch set is a step
> > in the right direction. Here goes why.
> > 
> > I admit that your idea to get rid of explicit tcp window control knobs
> > and size it dynamically basing on memory pressure instead does sound
> > tempting, but I don't think it'd always work. The problem is that in
> > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> > stop growing them. Now suppose a system hasn't experienced memory
> > pressure for a while. If we don't have explicit tcp window limit, tcp
> > buffers on such a system might have eaten almost all available memory
> > (because of network load/problems). If a user workload that needs a
> > significant amount of memory is started suddenly then, the network code
> > will receive a notification and surely stop growing buffers, but all
> > those buffers accumulated won't disappear instantly. As a result, the
> > workload might be unable to find enough free memory and have no choice
> > but invoke OOM killer. This looks unexpected from the user POV.
> 
> I'm not getting rid of those knobs, I'm just reusing the old socket
> accounting infrastructure in an attempt to make the memory accounting
> feature useful to more people in cgroups v2 (unified hierarchy).
> 

My understanding is that in the meantime you effectively break the
existing per memcg tcp window control logic.

> We can always come back to think about per-cgroup tcp window limits in
> the unified hierarchy, my patches don't get in the way of this. I'm
> not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> 
> But regardless of tcp window control, we need to account socket memory
> in the main memory accounting pool where pressure is shared (to the
> best of our abilities) between all accounted memory consumers.
> 

No objections to this point. However, I really don't like the idea to
charge tcp window size to memory.current instead of charging individual
pages consumed by the workload for storing socket buffers, because it is
inconsistent with what we have now. Can't we charge individual skb pages
as we do in case of other kmem allocations?

> From an interface standpoint alone, I don't think it's reasonable to
> ask users per default to limit different consumers on a case by case
> basis. I certainly have no problem with finetuning for scenarios you
> describe above, but with memory.current, memory.high, memory.max we
> are providing a generic interface to account and contain memory
> consumption of workloads. This has to include all major memory
> consumers to make semantical sense.

We can propose a reasonable default as we do in the global case.

> 
> But also, there are people right now for whom the socket buffers cause
> system OOM, but the existing memcg's hard tcp window limitq that
> exists absolutely wrecks network performance for them. It's not usable
> the way it is. It'd be much better to have the socket buffers exert
> pressure on the shared pool, and then propagate the overall pressure
> back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 

This might or might not work. I'm not an expert to judge. But if you do
this only for memcg leaving the global case as it is, networking people
won't budge IMO. So could you please start such a major rework from the
global case? Could you please try to deprecate the tcp window limits not
only in the legacy memcg hierarchy, but also system-wide in order to
attract attention of networking experts?

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-27 Thread Johannes Weiner
On Tue, Oct 27, 2015 at 11:43:21AM +0300, Vladimir Davydov wrote:
> On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> > I'm not getting rid of those knobs, I'm just reusing the old socket
> > accounting infrastructure in an attempt to make the memory accounting
> > feature useful to more people in cgroups v2 (unified hierarchy).
> 
> My understanding is that in the meantime you effectively break the
> existing per memcg tcp window control logic.

That's not my intention, this stuff has to keep working. I'm assuming
you mean the changes to sk_enter_memory_pressure() when hitting the
charge limit; let me address this in the other subthread.

> > We can always come back to think about per-cgroup tcp window limits in
> > the unified hierarchy, my patches don't get in the way of this. I'm
> > not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> > 
> > But regardless of tcp window control, we need to account socket memory
> > in the main memory accounting pool where pressure is shared (to the
> > best of our abilities) between all accounted memory consumers.
> > 
> 
> No objections to this point. However, I really don't like the idea to
> charge tcp window size to memory.current instead of charging individual
> pages consumed by the workload for storing socket buffers, because it is
> inconsistent with what we have now. Can't we charge individual skb pages
> as we do in case of other kmem allocations?

Absolutely, both work for me. I chose that route because it's where
the networking code already tracks and accounts memory consumed, so it
seemed like a better site to hook into.

But I understand your concerns. We want to track this stuff as close
to the memory allocators as possible.

> > But also, there are people right now for whom the socket buffers cause
> > system OOM, but the existing memcg's hard tcp window limitq that
> > exists absolutely wrecks network performance for them. It's not usable
> > the way it is. It'd be much better to have the socket buffers exert
> > pressure on the shared pool, and then propagate the overall pressure
> > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 
> This might or might not work. I'm not an expert to judge. But if you do
> this only for memcg leaving the global case as it is, networking people
> won't budge IMO. So could you please start such a major rework from the
> global case? Could you please try to deprecate the tcp window limits not
> only in the legacy memcg hierarchy, but also system-wide in order to
> attract attention of networking experts?

I'm definitely interested in addressing this globally as well.

The idea behind this was to use the memcg part as a testbed. cgroup2
is going to be new and people are prepared for hiccups when migrating
their applications to it; and they can roll back to cgroup1 and tcp
window limits at any time should they run into problems in production.

So this seemed like a good way to prove a new mechanism before rolling
it out to every single Linux setup, rather than switch everybody over
after the limited scope testing I can do as a developer on my own.

Keep in mind that my patches are not committing anything in terms of
interface, so we retain all the freedom to fix and tune the way this
is implemented, including the freedom to re-add tcp window limits in
case the pressure balancing is not a comprehensive solution.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-26 Thread Johannes Weiner
On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
> 
> On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> ...
> > Patch #5 adds accounting and tracking of socket memory to the unified
> > hierarchy memory controller, as described above. It uses the existing
> > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > 
> > Patch #8 uses the vmpressure extension to equalize pressure between
> > the pages tracked natively by the VM and socket buffer pages. As the
> > pool is shared, it makes sense that while natively tracked pages are
> > under duress the network transmit windows are also not increased.
> 
> First of all, I've no experience in networking, so I'm likely to be
> mistaken. Nevertheless I beg to disagree that this patch set is a step
> in the right direction. Here goes why.
> 
> I admit that your idea to get rid of explicit tcp window control knobs
> and size it dynamically basing on memory pressure instead does sound
> tempting, but I don't think it'd always work. The problem is that in
> contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> stop growing them. Now suppose a system hasn't experienced memory
> pressure for a while. If we don't have explicit tcp window limit, tcp
> buffers on such a system might have eaten almost all available memory
> (because of network load/problems). If a user workload that needs a
> significant amount of memory is started suddenly then, the network code
> will receive a notification and surely stop growing buffers, but all
> those buffers accumulated won't disappear instantly. As a result, the
> workload might be unable to find enough free memory and have no choice
> but invoke OOM killer. This looks unexpected from the user POV.

I'm not getting rid of those knobs, I'm just reusing the old socket
accounting infrastructure in an attempt to make the memory accounting
feature useful to more people in cgroups v2 (unified hierarchy).

We can always come back to think about per-cgroup tcp window limits in
the unified hierarchy, my patches don't get in the way of this. I'm
not removing the knobs in cgroups v1 and I'm not preventing them in v2.

But regardless of tcp window control, we need to account socket memory
in the main memory accounting pool where pressure is shared (to the
best of our abilities) between all accounted memory consumers.

>From an interface standpoint alone, I don't think it's reasonable to
ask users per default to limit different consumers on a case by case
basis. I certainly have no problem with finetuning for scenarios you
describe above, but with memory.current, memory.high, memory.max we
are providing a generic interface to account and contain memory
consumption of workloads. This has to include all major memory
consumers to make semantical sense.

But also, there are people right now for whom the socket buffers cause
system OOM, but the existing memcg's hard tcp window limitq that
exists absolutely wrecks network performance for them. It's not usable
the way it is. It'd be much better to have the socket buffers exert
pressure on the shared pool, and then propagate the overall pressure
back to individual consumers with reclaim, shrinkers, vmpressure etc.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-22 Thread Vladimir Davydov
Hi Johannes,

On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
...
> Patch #5 adds accounting and tracking of socket memory to the unified
> hierarchy memory controller, as described above. It uses the existing
> per-cpu charge caches and triggers high limit reclaim asynchroneously.
> 
> Patch #8 uses the vmpressure extension to equalize pressure between
> the pages tracked natively by the VM and socket buffer pages. As the
> pool is shared, it makes sense that while natively tracked pages are
> under duress the network transmit windows are also not increased.

First of all, I've no experience in networking, so I'm likely to be
mistaken. Nevertheless I beg to disagree that this patch set is a step
in the right direction. Here goes why.

I admit that your idea to get rid of explicit tcp window control knobs
and size it dynamically basing on memory pressure instead does sound
tempting, but I don't think it'd always work. The problem is that in
contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
stop growing them. Now suppose a system hasn't experienced memory
pressure for a while. If we don't have explicit tcp window limit, tcp
buffers on such a system might have eaten almost all available memory
(because of network load/problems). If a user workload that needs a
significant amount of memory is started suddenly then, the network code
will receive a notification and surely stop growing buffers, but all
those buffers accumulated won't disappear instantly. As a result, the
workload might be unable to find enough free memory and have no choice
but invoke OOM killer. This looks unexpected from the user POV.

That said, I think we do need per memcg tcp window control similar to
what we have system-wide. In other words, Glauber's work makes sense to
me. You might want to point me at my RFC patch where I proposed to
revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my
mind since then. Now I think I was mistaken, luckily I was stopped.
However, I may be mistaken again :-)

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-21 Thread Johannes Weiner
Hi,

this series adds socket buffer memory tracking and accounting to the
unified hierarchy memory cgroup controller.

[ Networking people, at this time please check the diffstat below to
  avoid going into convulsions. ]

Socket buffer memory can make up a significant share of a workload's
memory footprint, and so it needs to be accounted and tracked out of
the box, along with other types of memory that can be directly linked
to userspace activity, in order to provide useful resource isolation.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. And upon pressure, the VM reclaims and shrinks whatever memory
in that pool is within its reach.

These patches add accounting for memory consumed by sockets associated
with a cgroup to the existing pool of anonymous pages and page cache.

Patch #3 reworks the existing memcg socket infrastructure. It has many
provisions for future plans that won't materialize, and much of this
simply evaporates. The networking people should be happy about this.

Patch #5 adds accounting and tracking of socket memory to the unified
hierarchy memory controller, as described above. It uses the existing
per-cpu charge caches and triggers high limit reclaim asynchroneously.

Patch #8 uses the vmpressure extension to equalize pressure between
the pages tracked natively by the VM and socket buffer pages. As the
pool is shared, it makes sense that while natively tracked pages are
under duress the network transmit windows are also not increased.

As per above, this is an essential part of the new memory controller's
core functionality. With the unified hierarchy nearing release, please
consider this for 4.4.

 include/linux/memcontrol.h   |  90 +---
 include/linux/page_counter.h |   6 +-
 include/net/sock.h   | 139 ++--
 include/net/tcp.h|   5 +-
 include/net/tcp_memcontrol.h |   7 --
 mm/backing-dev.c |   2 +-
 mm/hugetlb_cgroup.c  |   3 +-
 mm/memcontrol.c  | 235 ++---
 mm/page_counter.c|  14 +--
 mm/vmpressure.c  |  29 -
 mm/vmscan.c  |  41 +++
 net/core/sock.c  |  78 --
 net/ipv4/sysctl_net_ipv4.c   |   1 -
 net/ipv4/tcp.c   |   3 +-
 net/ipv4/tcp_ipv4.c  |   9 +-
 net/ipv4/tcp_memcontrol.c| 147 --
 net/ipv4/tcp_output.c|   6 +-
 net/ipv6/tcp_ipv6.c  |   3 -
 18 files changed, 319 insertions(+), 499 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html