[PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy v2
Hi, this is version 2 of the patches to add socket memory accounting to the unified hierarchy memory controller. Changes from v1 include: - No accounting overhead unless a dedicated cgroup is created and the memory controller instructed to track that group's memory footprint. Distribution kernels enable CONFIG_MEMCG, and users (incl. systemd) might create cgroups only for process control or resources other than memory. As noted by David and Michal, these setups shouldn't pay any overhead for this. - Continue to enter the socket pressure state when hitting the memory controller's hard limit. Vladimir noted that there is at least some value in telling other sockets in the cgroup to not increase their transmit windows when one of them is already dropping packets. - Drop the controversial vmpressure rework. Instead of changing the level where pressure is noted, keep noting pressure in its origin and then make the pressure check hierarchical. As noted by Michal and Vladimir, we shouldn't risk changing user-visible behavior. --- Socket buffer memory can make up a significant share of a workload's memory footprint that can be directly linked to userspace activity, and so it needs to be part of the memory controller to provide proper resource isolation/containment. Historically, socket buffers were accounted in a separate counter, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail hard and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications. To not repeat this mistake, the new memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. Upon pressure, the VM reclaims and shrinks and puts pressure on whatever memory consumer in that pool is within its reach. For socket memory, pressure feedback is provided through vmpressure events. When the VM has trouble freeing memory, the network code is instructed to stop growing the cgroup's transmit windows. --- This series begins with a rework of the existing tcp memory controller that simplifies and cleans up the code while allowing us to have only one set of networking hooks for both memory controller versions. The original behavior of the existing tcp controller should be preserved. It then adds socket accounting to the v2 memory controller, including the use of the per-cpu charge cache and async memory.high enforcement from socket memory charges. Lastly, vmpressure is hooked up to the socket code so that it stops growing transmit windows when the VM has trouble reclaiming memory. include/linux/memcontrol.h | 98 --- include/linux/page_counter.h | 6 +- include/net/sock.h | 137 ++--- include/net/tcp.h| 5 +- include/net/tcp_memcontrol.h | 7 -- mm/backing-dev.c | 2 +- mm/hugetlb_cgroup.c | 3 +- mm/memcontrol.c | 262 + mm/page_counter.c| 14 +-- mm/vmpressure.c | 25 +++- mm/vmscan.c | 31 ++--- net/core/sock.c | 78 +++- net/ipv4/sysctl_net_ipv4.c | 1 - net/ipv4/tcp.c | 3 +- net/ipv4/tcp_ipv4.c | 9 +- net/ipv4/tcp_memcontrol.c| 147 --- net/ipv4/tcp_output.c| 6 +- net/ipv6/tcp_ipv6.c | 3 - 18 files changed, 328 insertions(+), 509 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote: ... > Now, you mentioned that you'd rather see the socket buffers accounted > at the allocator level, but I looked at the different allocation paths > and network protocols and I'm not convinced that this makes sense. We > don't want to be in the hotpath of every single packet when a lot of > them are small, short-lived management blips that don't involve user > space to let the kernel dispose of them. > > __sk_mem_schedule() on the other hand is already wired up to exactly > those consumers we are interested in for memory isolation: those with > bigger chunks of data attached to them and those that have exploding > receive queues when userspace fails to read(). UDP and TCP. > > I mean, there is a reason why the global memory limits apply to only > those types of packets in the first place: everything else is noise. > > I agree that it's appealing to account at the allocator level and set > page->mem_cgroup etc. but in this case we'd pay extra to capture a lot > of noise, and I don't want to pay that just for aesthetics. In this > case it's better to track ownership on the socket level and only count > packets that can accumulate a significant amount of memory consumed. Sigh, you seem to be right. Moreover, I can't even think of a neat way to account skb pages to memcg, because rcv skbs are generated in device drivers, where we don't know which socket/memcg it will go to. We could recharge individual pages when skb gets to the network or transport layer, but it would result in unjustified overhead. > > > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > > for sure, but it's horrendous for network performance. There is no > > > "stop growing" phase, it just keeps going full throttle until it hits > > > the wall hard. > > > > > > Now, we could probably try to replicate the global knobs and add a > > > per-memcg soft limit. But you know better than anyone else how hard it > > > is to estimate the overall workingset size of a workload, and the > > > margins on containerized loads are razor-thin. Performance is much > > > more sensitive to input errors, and often times parameters must be > > > adjusted continuously during the runtime of a workload. It'd be > > > disasterous to rely on yet more static, error-prone user input here. > > > > Yeah, but the dynamic approach proposed in your patch set doesn't > > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > > reduces this possibility. Of course, memcg OOM is far not as disastrous > > as the global one, but still it usually means the workload breakage. > > Right now, the entire machine breaks. Confining it to a faulty memcg, > as well as reducing the likelihood of that OOM in many cases seems > like a good move in the right direction, no? It seems. However, memcg OOM is also bad, we should strive to avoid it if we can. > > And how likely are memcg OOMs because of this anyway? There is of Frankly, I've no idea. Your arguments below sound reassuring though. > course a scenario imaginable where the packets pile up, followed by > some *other* part of the workload, the one that doesn't read() and > process packets, trying to expand--which then doesn't work and goes > OOM. But that seems like a complete corner case. In the vast majority > of cases, the application will be in full operation and just fail to > read() fast enough--because the network bandwidth is enormous compared > to the container's size, or because it shares the CPU with thousands > of other workloads and there is scheduling latency. > > This would be the perfect point to reign in the transmit window... > > > The static approach is error-prone for sure, but it has existed for > > years and worked satisfactory AFAIK. > > ...but that point is not a fixed amount of memory consumed. It depends > on the workload and the random interactions it's having with thousands > of other containers on that same machine. > > The point of containers is to maximize utilization of your hardware > and systematically eliminate slack in the system. But it's exactly > that slack on dedicated bare-metal machines that allowed us to take a > wild guess at the settings and then tune them based on observing a > handful of workloads. This approach is not going to work anymore when > we pack the machine to capacity and still expect every single > container out of thousands to perform well. We need that automation. But we do use static approach when setting memory limits, no? memory.{low,high,max} - they are all static. I understand it's appealing to have just one knob - memory size - like in case of virtual machines, but it doesn't seem to work with containers. You added memory.low and memory.high knobs. VMs don't have anything like that. How is one supposed to set them? Depends on the workload, I guess. Also, there is the pids cgroup for limiting the number of pids that can be used by a cgroup, because
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote: > On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > > Having the hard limit as a failsafe (or a minimum for other consumers) > > is one thing, and certainly something I'm open to for cgroupv2, should > > we have problems with load startup up after a socket memory landgrab. > > > > That being said, if the VM is struggling to reclaim pages, or is even > > swapping, it makes perfect sense to let the socket memory scheduler > > know it shouldn't continue to increase its footprint until the VM > > recovers. Regardless of any hard limitations/minimum guarantees. > > > > This is what my patch does and it seems pretty straight-forward to > > me. I don't really understand why this is so controversial. > > I'm not arguing that the idea behind this patch set is necessarily bad. > Quite the contrary, it does look interesting to me. I'm just saying that > IMO it can't replace hard/soft limits. It probably could if it was > possible to shrink buffers, but I don't think it's feasible, even > theoretically. That's why I propose not to change the behavior of the > existing per memcg tcp limit at all. And frankly I don't get why you are > so keen on simplifying it. You say it's a "crapload of boilerplate > code". Well, I don't see how it is - it just replicates global knobs and > I don't see how it could be done in a better way. The code is hidden > behind jump labels, so the overhead is zero if it isn't used. If you > really dislike this code, we can isolate it under a separate config > option. But all right, I don't rule out the possibility that the code > could be simplified. If you do that w/o breaking it, that'll be OK to > me, but I don't see why it should be related to this particular patch > set. Okay, I see your concern. I'm not trying to change the behavior, just the implementation, because it's too complex for the functionality it actually provides. And the reason it's part of this patch set is because I'm using the same code to hook into the memory accounting, so it makes sense to refactor this stuff in the same go. There is also a niceness factor of not adding more memcg callbacks to the networking subsystem when there is an option to consolidate them. Now, you mentioned that you'd rather see the socket buffers accounted at the allocator level, but I looked at the different allocation paths and network protocols and I'm not convinced that this makes sense. We don't want to be in the hotpath of every single packet when a lot of them are small, short-lived management blips that don't involve user space to let the kernel dispose of them. __sk_mem_schedule() on the other hand is already wired up to exactly those consumers we are interested in for memory isolation: those with bigger chunks of data attached to them and those that have exploding receive queues when userspace fails to read(). UDP and TCP. I mean, there is a reason why the global memory limits apply to only those types of packets in the first place: everything else is noise. I agree that it's appealing to account at the allocator level and set page->mem_cgroup etc. but in this case we'd pay extra to capture a lot of noise, and I don't want to pay that just for aesthetics. In this case it's better to track ownership on the socket level and only count packets that can accumulate a significant amount of memory consumed. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > for sure, but it's horrendous for network performance. There is no > > "stop growing" phase, it just keeps going full throttle until it hits > > the wall hard. > > > > Now, we could probably try to replicate the global knobs and add a > > per-memcg soft limit. But you know better than anyone else how hard it > > is to estimate the overall workingset size of a workload, and the > > margins on containerized loads are razor-thin. Performance is much > > more sensitive to input errors, and often times parameters must be > > adjusted continuously during the runtime of a workload. It'd be > > disasterous to rely on yet more static, error-prone user input here. > > Yeah, but the dynamic approach proposed in your patch set doesn't > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > reduces this possibility. Of course, memcg OOM is far not as disastrous > as the global one, but still it usually means the workload breakage. Right now, the entire machine breaks. Confining it to a faulty memcg, as well as reducing the likelihood of that OOM in many cases seems like a good move in the right direction, no? And how likely are memcg OOMs because of this anyway? There is of course a scenario imaginable where the packets pile up, followed by some *other* part of the workload, the one that doesn't read() and process packets, trying to expand--which then doesn't work and goes OOM. But that seems like a complete corner case. In the vast majority of cases, the
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > > Then you'd better not touch existing tcp limits at all, because they > > just work, and the logic behind them is very close to that of global tcp > > limits. I don't think one can simplify it somehow. > > Uhm, no, there is a crapload of boilerplate code and complication that > seems entirely unnecessary. The only thing missing from my patch seems > to be the part where it enters memory pressure state when the limit is > hit. I'm adding this for completeness, but I doubt it even matters. > > > Moreover, frankly I still have my reservations about this vmpressure > > propagation to skb you're proposing. It might work, but I doubt it > > will allow us to throw away explicit tcp limit, as I explained > > previously. So, even with your approach I think we can still need > > per memcg tcp limit *unless* you get rid of global tcp limit > > somehow. > > Having the hard limit as a failsafe (or a minimum for other consumers) > is one thing, and certainly something I'm open to for cgroupv2, should > we have problems with load startup up after a socket memory landgrab. > > That being said, if the VM is struggling to reclaim pages, or is even > swapping, it makes perfect sense to let the socket memory scheduler > know it shouldn't continue to increase its footprint until the VM > recovers. Regardless of any hard limitations/minimum guarantees. > > This is what my patch does and it seems pretty straight-forward to > me. I don't really understand why this is so controversial. I'm not arguing that the idea behind this patch set is necessarily bad. Quite the contrary, it does look interesting to me. I'm just saying that IMO it can't replace hard/soft limits. It probably could if it was possible to shrink buffers, but I don't think it's feasible, even theoretically. That's why I propose not to change the behavior of the existing per memcg tcp limit at all. And frankly I don't get why you are so keen on simplifying it. You say it's a "crapload of boilerplate code". Well, I don't see how it is - it just replicates global knobs and I don't see how it could be done in a better way. The code is hidden behind jump labels, so the overhead is zero if it isn't used. If you really dislike this code, we can isolate it under a separate config option. But all right, I don't rule out the possibility that the code could be simplified. If you do that w/o breaking it, that'll be OK to me, but I don't see why it should be related to this particular patch set. > > The *next* step would be to figure out whether we can actually > *reclaim* memory in the network subsystem--shrink windows and steal > buffers back--and that might even be an avenue to replace tcp window > limits. But it's not necessary for *this* patch series to be useful. Again, I don't think we can *reclaim* network memory, but you're right. > > > > So this seemed like a good way to prove a new mechanism before rolling > > > it out to every single Linux setup, rather than switch everybody over > > > after the limited scope testing I can do as a developer on my own. > > > > > > Keep in mind that my patches are not committing anything in terms of > > > interface, so we retain all the freedom to fix and tune the way this > > > is implemented, including the freedom to re-add tcp window limits in > > > case the pressure balancing is not a comprehensive solution. > > > > I really dislike this kind of proof. It looks like you're trying to > > push something you think is right covertly, w/o having a proper > > discussion with networking people and then say that it just works > > and hence should be done globally, but what if it won't? Revert it? > > We already have a lot of dubious stuff in memcg that should be > > reverted, so let's please try to avoid this kind of mistakes in > > future. Note, I say "w/o having a proper discussion with networking > > people", because I don't think they will really care *unless* you > > change the global logic, simply because most of them aren't very > > interested in memcg AFAICS. > > Come on, Dave is the first To and netdev is CC'd. They might not care > about memcg, but "pushing things covertly" is a bit of a stretch. Sorry if it sounded rude to you. I just look back at my experience patching slab internals to make kmem accountable, and AFAICS Christoph didn't really care about *what* I was doing, he only cared about the global case - if there was no performance degradation when kmemcg was disabled, he was usually fine with it, even if from the memcg pov it was a crap. Anyway, I can't force you to patch the global case first or simultaneously with the memcg case, so let's just hope I'm a bit too overcautious. > > > That effectively means you loose a chance to listen to networking > > experts, who could point you at design flaws and propose an improvement > > right away. Let's please not miss such
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote: ... > > > But regardless of tcp window control, we need to account socket memory > > > in the main memory accounting pool where pressure is shared (to the > > > best of our abilities) between all accounted memory consumers. > > > > > > > No objections to this point. However, I really don't like the idea to > > charge tcp window size to memory.current instead of charging individual > > pages consumed by the workload for storing socket buffers, because it is > > inconsistent with what we have now. Can't we charge individual skb pages > > as we do in case of other kmem allocations? > > Absolutely, both work for me. I chose that route because it's where > the networking code already tracks and accounts memory consumed, so it > seemed like a better site to hook into. > > But I understand your concerns. We want to track this stuff as close > to the memory allocators as possible. Exactly. > > > > But also, there are people right now for whom the socket buffers cause > > > system OOM, but the existing memcg's hard tcp window limitq that > > > exists absolutely wrecks network performance for them. It's not usable > > > the way it is. It'd be much better to have the socket buffers exert > > > pressure on the shared pool, and then propagate the overall pressure > > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > > > This might or might not work. I'm not an expert to judge. But if you do > > this only for memcg leaving the global case as it is, networking people > > won't budge IMO. So could you please start such a major rework from the > > global case? Could you please try to deprecate the tcp window limits not > > only in the legacy memcg hierarchy, but also system-wide in order to > > attract attention of networking experts? > > I'm definitely interested in addressing this globally as well. > > The idea behind this was to use the memcg part as a testbed. cgroup2 > is going to be new and people are prepared for hiccups when migrating > their applications to it; and they can roll back to cgroup1 and tcp > window limits at any time should they run into problems in production. Then you'd better not touch existing tcp limits at all, because they just work, and the logic behind them is very close to that of global tcp limits. I don't think one can simplify it somehow. Moreover, frankly I still have my reservations about this vmpressure propagation to skb you're proposing. It might work, but I doubt it will allow us to throw away explicit tcp limit, as I explained previously. So, even with your approach I think we can still need per memcg tcp limit *unless* you get rid of global tcp limit somehow. > > So this seemed like a good way to prove a new mechanism before rolling > it out to every single Linux setup, rather than switch everybody over > after the limited scope testing I can do as a developer on my own. > > Keep in mind that my patches are not committing anything in terms of > interface, so we retain all the freedom to fix and tune the way this > is implemented, including the freedom to re-add tcp window limits in > case the pressure balancing is not a comprehensive solution. > I really dislike this kind of proof. It looks like you're trying to push something you think is right covertly, w/o having a proper discussion with networking people and then say that it just works and hence should be done globally, but what if it won't? Revert it? We already have a lot of dubious stuff in memcg that should be reverted, so let's please try to avoid this kind of mistakes in future. Note, I say "w/o having a proper discussion with networking people", because I don't think they will really care *unless* you change the global logic, simply because most of them aren't very interested in memcg AFAICS. That effectively means you loose a chance to listen to networking experts, who could point you at design flaws and propose an improvement right away. Let's please not miss such an opportunity. You said that you'd seen this problem happen w/o cgroups, so you have a use case that might need fixing at the global level. IMO it shouldn't be difficult to prepare an RFC patch for the global case first and see what people think about it. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > Then you'd better not touch existing tcp limits at all, because they > just work, and the logic behind them is very close to that of global tcp > limits. I don't think one can simplify it somehow. Uhm, no, there is a crapload of boilerplate code and complication that seems entirely unnecessary. The only thing missing from my patch seems to be the part where it enters memory pressure state when the limit is hit. I'm adding this for completeness, but I doubt it even matters. > Moreover, frankly I still have my reservations about this vmpressure > propagation to skb you're proposing. It might work, but I doubt it > will allow us to throw away explicit tcp limit, as I explained > previously. So, even with your approach I think we can still need > per memcg tcp limit *unless* you get rid of global tcp limit > somehow. Having the hard limit as a failsafe (or a minimum for other consumers) is one thing, and certainly something I'm open to for cgroupv2, should we have problems with load startup up after a socket memory landgrab. That being said, if the VM is struggling to reclaim pages, or is even swapping, it makes perfect sense to let the socket memory scheduler know it shouldn't continue to increase its footprint until the VM recovers. Regardless of any hard limitations/minimum guarantees. This is what my patch does and it seems pretty straight-forward to me. I don't really understand why this is so controversial. The *next* step would be to figure out whether we can actually *reclaim* memory in the network subsystem--shrink windows and steal buffers back--and that might even be an avenue to replace tcp window limits. But it's not necessary for *this* patch series to be useful. > > So this seemed like a good way to prove a new mechanism before rolling > > it out to every single Linux setup, rather than switch everybody over > > after the limited scope testing I can do as a developer on my own. > > > > Keep in mind that my patches are not committing anything in terms of > > interface, so we retain all the freedom to fix and tune the way this > > is implemented, including the freedom to re-add tcp window limits in > > case the pressure balancing is not a comprehensive solution. > > I really dislike this kind of proof. It looks like you're trying to > push something you think is right covertly, w/o having a proper > discussion with networking people and then say that it just works > and hence should be done globally, but what if it won't? Revert it? > We already have a lot of dubious stuff in memcg that should be > reverted, so let's please try to avoid this kind of mistakes in > future. Note, I say "w/o having a proper discussion with networking > people", because I don't think they will really care *unless* you > change the global logic, simply because most of them aren't very > interested in memcg AFAICS. Come on, Dave is the first To and netdev is CC'd. They might not care about memcg, but "pushing things covertly" is a bit of a stretch. > That effectively means you loose a chance to listen to networking > experts, who could point you at design flaws and propose an improvement > right away. Let's please not miss such an opportunity. You said that > you'd seen this problem happen w/o cgroups, so you have a use case that > might need fixing at the global level. IMO it shouldn't be difficult to > prepare an RFC patch for the global case first and see what people think > about it. No, the problem we are running into is when network memory is not tracked per cgroup. The lack of containment means that the socket memory consumption of individual cgroups can trigger system OOM. We tried using the per-memcg tcp limits, and that prevents the OOMs for sure, but it's horrendous for network performance. There is no "stop growing" phase, it just keeps going full throttle until it hits the wall hard. Now, we could probably try to replicate the global knobs and add a per-memcg soft limit. But you know better than anyone else how hard it is to estimate the overall workingset size of a workload, and the margins on containerized loads are razor-thin. Performance is much more sensitive to input errors, and often times parameters must be adjusted continuously during the runtime of a workload. It'd be disasterous to rely on yet more static, error-prone user input here. What all this means to me is that fixing it on the cgroup level has higher priority. But it also means that once we figured it out under such a high-pressure environment, it's much easier to apply to the global case and potentially replace the soft limit there. This seems like a better approach to me than starting globally, only to realize that the solution is not workable for cgroups and we need yet something else. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > > Hi Johannes, > > > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > > ... > > > Patch #5 adds accounting and tracking of socket memory to the unified > > > hierarchy memory controller, as described above. It uses the existing > > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > > the pages tracked natively by the VM and socket buffer pages. As the > > > pool is shared, it makes sense that while natively tracked pages are > > > under duress the network transmit windows are also not increased. > > > > First of all, I've no experience in networking, so I'm likely to be > > mistaken. Nevertheless I beg to disagree that this patch set is a step > > in the right direction. Here goes why. > > > > I admit that your idea to get rid of explicit tcp window control knobs > > and size it dynamically basing on memory pressure instead does sound > > tempting, but I don't think it'd always work. The problem is that in > > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > > stop growing them. Now suppose a system hasn't experienced memory > > pressure for a while. If we don't have explicit tcp window limit, tcp > > buffers on such a system might have eaten almost all available memory > > (because of network load/problems). If a user workload that needs a > > significant amount of memory is started suddenly then, the network code > > will receive a notification and surely stop growing buffers, but all > > those buffers accumulated won't disappear instantly. As a result, the > > workload might be unable to find enough free memory and have no choice > > but invoke OOM killer. This looks unexpected from the user POV. > > I'm not getting rid of those knobs, I'm just reusing the old socket > accounting infrastructure in an attempt to make the memory accounting > feature useful to more people in cgroups v2 (unified hierarchy). > My understanding is that in the meantime you effectively break the existing per memcg tcp window control logic. > We can always come back to think about per-cgroup tcp window limits in > the unified hierarchy, my patches don't get in the way of this. I'm > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > But regardless of tcp window control, we need to account socket memory > in the main memory accounting pool where pressure is shared (to the > best of our abilities) between all accounted memory consumers. > No objections to this point. However, I really don't like the idea to charge tcp window size to memory.current instead of charging individual pages consumed by the workload for storing socket buffers, because it is inconsistent with what we have now. Can't we charge individual skb pages as we do in case of other kmem allocations? > From an interface standpoint alone, I don't think it's reasonable to > ask users per default to limit different consumers on a case by case > basis. I certainly have no problem with finetuning for scenarios you > describe above, but with memory.current, memory.high, memory.max we > are providing a generic interface to account and contain memory > consumption of workloads. This has to include all major memory > consumers to make semantical sense. We can propose a reasonable default as we do in the global case. > > But also, there are people right now for whom the socket buffers cause > system OOM, but the existing memcg's hard tcp window limitq that > exists absolutely wrecks network performance for them. It's not usable > the way it is. It'd be much better to have the socket buffers exert > pressure on the shared pool, and then propagate the overall pressure > back to individual consumers with reclaim, shrinkers, vmpressure etc. > This might or might not work. I'm not an expert to judge. But if you do this only for memcg leaving the global case as it is, networking people won't budge IMO. So could you please start such a major rework from the global case? Could you please try to deprecate the tcp window limits not only in the legacy memcg hierarchy, but also system-wide in order to attract attention of networking experts? Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Tue, Oct 27, 2015 at 11:43:21AM +0300, Vladimir Davydov wrote: > On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > > I'm not getting rid of those knobs, I'm just reusing the old socket > > accounting infrastructure in an attempt to make the memory accounting > > feature useful to more people in cgroups v2 (unified hierarchy). > > My understanding is that in the meantime you effectively break the > existing per memcg tcp window control logic. That's not my intention, this stuff has to keep working. I'm assuming you mean the changes to sk_enter_memory_pressure() when hitting the charge limit; let me address this in the other subthread. > > We can always come back to think about per-cgroup tcp window limits in > > the unified hierarchy, my patches don't get in the way of this. I'm > > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > > > But regardless of tcp window control, we need to account socket memory > > in the main memory accounting pool where pressure is shared (to the > > best of our abilities) between all accounted memory consumers. > > > > No objections to this point. However, I really don't like the idea to > charge tcp window size to memory.current instead of charging individual > pages consumed by the workload for storing socket buffers, because it is > inconsistent with what we have now. Can't we charge individual skb pages > as we do in case of other kmem allocations? Absolutely, both work for me. I chose that route because it's where the networking code already tracks and accounts memory consumed, so it seemed like a better site to hook into. But I understand your concerns. We want to track this stuff as close to the memory allocators as possible. > > But also, there are people right now for whom the socket buffers cause > > system OOM, but the existing memcg's hard tcp window limitq that > > exists absolutely wrecks network performance for them. It's not usable > > the way it is. It'd be much better to have the socket buffers exert > > pressure on the shared pool, and then propagate the overall pressure > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > This might or might not work. I'm not an expert to judge. But if you do > this only for memcg leaving the global case as it is, networking people > won't budge IMO. So could you please start such a major rework from the > global case? Could you please try to deprecate the tcp window limits not > only in the legacy memcg hierarchy, but also system-wide in order to > attract attention of networking experts? I'm definitely interested in addressing this globally as well. The idea behind this was to use the memcg part as a testbed. cgroup2 is going to be new and people are prepared for hiccups when migrating their applications to it; and they can roll back to cgroup1 and tcp window limits at any time should they run into problems in production. So this seemed like a good way to prove a new mechanism before rolling it out to every single Linux setup, rather than switch everybody over after the limited scope testing I can do as a developer on my own. Keep in mind that my patches are not committing anything in terms of interface, so we retain all the freedom to fix and tune the way this is implemented, including the freedom to re-add tcp window limits in case the pressure balancing is not a comprehensive solution. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > Hi Johannes, > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > ... > > Patch #5 adds accounting and tracking of socket memory to the unified > > hierarchy memory controller, as described above. It uses the existing > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > the pages tracked natively by the VM and socket buffer pages. As the > > pool is shared, it makes sense that while natively tracked pages are > > under duress the network transmit windows are also not increased. > > First of all, I've no experience in networking, so I'm likely to be > mistaken. Nevertheless I beg to disagree that this patch set is a step > in the right direction. Here goes why. > > I admit that your idea to get rid of explicit tcp window control knobs > and size it dynamically basing on memory pressure instead does sound > tempting, but I don't think it'd always work. The problem is that in > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > stop growing them. Now suppose a system hasn't experienced memory > pressure for a while. If we don't have explicit tcp window limit, tcp > buffers on such a system might have eaten almost all available memory > (because of network load/problems). If a user workload that needs a > significant amount of memory is started suddenly then, the network code > will receive a notification and surely stop growing buffers, but all > those buffers accumulated won't disappear instantly. As a result, the > workload might be unable to find enough free memory and have no choice > but invoke OOM killer. This looks unexpected from the user POV. I'm not getting rid of those knobs, I'm just reusing the old socket accounting infrastructure in an attempt to make the memory accounting feature useful to more people in cgroups v2 (unified hierarchy). We can always come back to think about per-cgroup tcp window limits in the unified hierarchy, my patches don't get in the way of this. I'm not removing the knobs in cgroups v1 and I'm not preventing them in v2. But regardless of tcp window control, we need to account socket memory in the main memory accounting pool where pressure is shared (to the best of our abilities) between all accounted memory consumers. >From an interface standpoint alone, I don't think it's reasonable to ask users per default to limit different consumers on a case by case basis. I certainly have no problem with finetuning for scenarios you describe above, but with memory.current, memory.high, memory.max we are providing a generic interface to account and contain memory consumption of workloads. This has to include all major memory consumers to make semantical sense. But also, there are people right now for whom the socket buffers cause system OOM, but the existing memcg's hard tcp window limitq that exists absolutely wrecks network performance for them. It's not usable the way it is. It'd be much better to have the socket buffers exert pressure on the shared pool, and then propagate the overall pressure back to individual consumers with reclaim, shrinkers, vmpressure etc. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Hi Johannes, On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: ... > Patch #5 adds accounting and tracking of socket memory to the unified > hierarchy memory controller, as described above. It uses the existing > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > Patch #8 uses the vmpressure extension to equalize pressure between > the pages tracked natively by the VM and socket buffer pages. As the > pool is shared, it makes sense that while natively tracked pages are > under duress the network transmit windows are also not increased. First of all, I've no experience in networking, so I'm likely to be mistaken. Nevertheless I beg to disagree that this patch set is a step in the right direction. Here goes why. I admit that your idea to get rid of explicit tcp window control knobs and size it dynamically basing on memory pressure instead does sound tempting, but I don't think it'd always work. The problem is that in contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only stop growing them. Now suppose a system hasn't experienced memory pressure for a while. If we don't have explicit tcp window limit, tcp buffers on such a system might have eaten almost all available memory (because of network load/problems). If a user workload that needs a significant amount of memory is started suddenly then, the network code will receive a notification and surely stop growing buffers, but all those buffers accumulated won't disappear instantly. As a result, the workload might be unable to find enough free memory and have no choice but invoke OOM killer. This looks unexpected from the user POV. That said, I think we do need per memcg tcp window control similar to what we have system-wide. In other words, Glauber's work makes sense to me. You might want to point me at my RFC patch where I proposed to revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my mind since then. Now I think I was mistaken, luckily I was stopped. However, I may be mistaken again :-) Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Hi, this series adds socket buffer memory tracking and accounting to the unified hierarchy memory cgroup controller. [ Networking people, at this time please check the diffstat below to avoid going into convulsions. ] Socket buffer memory can make up a significant share of a workload's memory footprint, and so it needs to be accounted and tracked out of the box, along with other types of memory that can be directly linked to userspace activity, in order to provide useful resource isolation. Historically, socket buffers were accounted in a separate counter, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail hard and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications. To not repeat this mistake, the new memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. And upon pressure, the VM reclaims and shrinks whatever memory in that pool is within its reach. These patches add accounting for memory consumed by sockets associated with a cgroup to the existing pool of anonymous pages and page cache. Patch #3 reworks the existing memcg socket infrastructure. It has many provisions for future plans that won't materialize, and much of this simply evaporates. The networking people should be happy about this. Patch #5 adds accounting and tracking of socket memory to the unified hierarchy memory controller, as described above. It uses the existing per-cpu charge caches and triggers high limit reclaim asynchroneously. Patch #8 uses the vmpressure extension to equalize pressure between the pages tracked natively by the VM and socket buffer pages. As the pool is shared, it makes sense that while natively tracked pages are under duress the network transmit windows are also not increased. As per above, this is an essential part of the new memory controller's core functionality. With the unified hierarchy nearing release, please consider this for 4.4. include/linux/memcontrol.h | 90 +--- include/linux/page_counter.h | 6 +- include/net/sock.h | 139 ++-- include/net/tcp.h| 5 +- include/net/tcp_memcontrol.h | 7 -- mm/backing-dev.c | 2 +- mm/hugetlb_cgroup.c | 3 +- mm/memcontrol.c | 235 ++--- mm/page_counter.c| 14 +-- mm/vmpressure.c | 29 - mm/vmscan.c | 41 +++ net/core/sock.c | 78 -- net/ipv4/sysctl_net_ipv4.c | 1 - net/ipv4/tcp.c | 3 +- net/ipv4/tcp_ipv4.c | 9 +- net/ipv4/tcp_memcontrol.c| 147 -- net/ipv4/tcp_output.c| 6 +- net/ipv6/tcp_ipv6.c | 3 - 18 files changed, 319 insertions(+), 499 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html