Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Fri 02-10-20 09:53:05, Rik van Riel wrote: > On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote: > > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > > > (Apologies for messing up the mailing list thread, Gmail had fooled > > > me into > > > believing that it properly picked up the thread) > > > > > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko wrote: > > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > > > yes it shows the bottleneck but it is quite artificial. Read > > > > > > data is > > > > > > usually processed and/or written back and that changes the > > > > > > picture a > > > > > > lot. > > > > > Apologies for reviving an ancient thread (and apologies in > > > > > advance for my lack > > > > > of knowledge on how mailing lists work), but I'd like to offer > > > > > up another > > > > > reason why merging this might be a good idea. > > > > > > > > > > From what I understand, zswap runs its compression on the same > > > > > kswapd thread, > > > > > limiting it to a single thread for compression. Given enough > > > > > processing power, > > > > > zswap can get great throughput using heavier compression > > > > > algorithms like zstd, > > > > > but this is currently greatly limited by the lack of threading. > > > > > > > > Isn't this a problem of the zswap implementation rather than > > > > general > > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out > > > > in a > > > > context outside of the reclaim? > > On systems with lots of very fast IO devices, we have > also seen kswapd take 100% CPU time without any zswap > in use. Do you have more details? Does the saturated kswapd lead to pre-mature direct reclaim? What is the saturated number of reclaimed pages per unit of time? Have you tried to play with this to see whether an additional worker would help? -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Fri, Oct 02, 2020 at 09:53:05AM -0400, Rik van Riel wrote: > On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote: > > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > > > (Apologies for messing up the mailing list thread, Gmail had fooled > > > me into > > > believing that it properly picked up the thread) > > > > > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko wrote: > > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > > > yes it shows the bottleneck but it is quite artificial. Read > > > > > > data is > > > > > > usually processed and/or written back and that changes the > > > > > > picture a > > > > > > lot. > > > > > Apologies for reviving an ancient thread (and apologies in > > > > > advance for my lack > > > > > of knowledge on how mailing lists work), but I'd like to offer > > > > > up another > > > > > reason why merging this might be a good idea. > > > > > > > > > > From what I understand, zswap runs its compression on the same > > > > > kswapd thread, > > > > > limiting it to a single thread for compression. Given enough > > > > > processing power, > > > > > zswap can get great throughput using heavier compression > > > > > algorithms like zstd, > > > > > but this is currently greatly limited by the lack of threading. > > > > > > > > Isn't this a problem of the zswap implementation rather than > > > > general > > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out > > > > in a > > > > context outside of the reclaim? > > On systems with lots of very fast IO devices, we have > also seen kswapd take 100% CPU time without any zswap > in use. > > This seems like a generic issue, though zswap does > manage to bring it out on lower end systems. Then, given Mel's observation about contention on the LRU lock, what's the solution? Partition the LRU list? Batch removals from the LRU list by kswapd and hand off to per-?node?cpu? worker threads? Rik, if you have access to one of those systems, I'd be interested to know whether using file THPs would help with your workload. Tracking only one THP instead of, say, 16 regular size pages is going to reduce the amount of time taken to pull things off the LRU list.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote: > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > > (Apologies for messing up the mailing list thread, Gmail had fooled > > me into > > believing that it properly picked up the thread) > > > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko wrote: > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > > yes it shows the bottleneck but it is quite artificial. Read > > > > > data is > > > > > usually processed and/or written back and that changes the > > > > > picture a > > > > > lot. > > > > Apologies for reviving an ancient thread (and apologies in > > > > advance for my lack > > > > of knowledge on how mailing lists work), but I'd like to offer > > > > up another > > > > reason why merging this might be a good idea. > > > > > > > > From what I understand, zswap runs its compression on the same > > > > kswapd thread, > > > > limiting it to a single thread for compression. Given enough > > > > processing power, > > > > zswap can get great throughput using heavier compression > > > > algorithms like zstd, > > > > but this is currently greatly limited by the lack of threading. > > > > > > Isn't this a problem of the zswap implementation rather than > > > general > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out > > > in a > > > context outside of the reclaim? On systems with lots of very fast IO devices, we have also seen kswapd take 100% CPU time without any zswap in use. This seems like a generic issue, though zswap does manage to bring it out on lower end systems. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Fri, Oct 02, 2020 at 09:03:33AM +0200, Michal Hocko wrote: > > > My recollection of the particular patch is dimm but I do remember it > > > tried to add more kswapd threads which would just paper over the problem > > > you are seein rather than solve it. > > > > Yeah, that's exactly what it does, just adding more kswap threads. > > Which is far from trivial because it has its side effects on the over > system balanc. While I have not read the original patches, multiple kswapd threads will smash into the LRU lock repeatedly. It's already the case that just plain storms of page cache allocations hammer that lock on pagevec releases and gets worse as memory sizes increase. Increasing LRU lock contention when memory is low is going to have diminishing returns. -- Mel Gorman SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > (Apologies for messing up the mailing list thread, Gmail had fooled me into > believing that it properly picked up the thread) > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko wrote: > > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > yes it shows the bottleneck but it is quite artificial. Read data is > > > > usually processed and/or written back and that changes the picture a > > > > lot. > > > Apologies for reviving an ancient thread (and apologies in advance for my > > > lack > > > of knowledge on how mailing lists work), but I'd like to offer up another > > > reason why merging this might be a good idea. > > > > > > From what I understand, zswap runs its compression on the same kswapd > > > thread, > > > limiting it to a single thread for compression. Given enough processing > > > power, > > > zswap can get great throughput using heavier compression algorithms like > > > zstd, > > > but this is currently greatly limited by the lack of threading. > > > > Isn't this a problem of the zswap implementation rather than general > > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a > > context outside of the reclaim? > > I wouldn't be able to tell you, the documentation on zswap is fairly limited > from what I've found. I would recommend you to talk to zswap maintainers. Describing your problem and suggesting to offload the heavy lifting into a separate context like the standard swap IO does. You are not the only one to hit this problem http://lkml.kernel.org/r/CALvZod43VXKZ3StaGXK_EZG_fKcW3v3=ceyowfwp4hnjpoo...@mail.gmail.com. Ccing Shakeel on such an email might help you to give more usecases. > > My recollection of the particular patch is dimm but I do remember it > > tried to add more kswapd threads which would just paper over the problem > > you are seein rather than solve it. > > Yeah, that's exactly what it does, just adding more kswap threads. Which is far from trivial because it has its side effects on the over system balanc. See my reply to the original request and the follow up discussion. I am not saying this is impossible to achieve and tune properly but it is certainly non trivial and it would require a really strong justification. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
(Apologies for messing up the mailing list thread, Gmail had fooled me into believing that it properly picked up the thread) On Thu, 1 Oct 2020 at 14:30, Michal Hocko wrote: > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > yes it shows the bottleneck but it is quite artificial. Read data is > > > usually processed and/or written back and that changes the picture a > > > lot. > > Apologies for reviving an ancient thread (and apologies in advance for my > > lack > > of knowledge on how mailing lists work), but I'd like to offer up another > > reason why merging this might be a good idea. > > > > From what I understand, zswap runs its compression on the same kswapd > > thread, > > limiting it to a single thread for compression. Given enough processing > > power, > > zswap can get great throughput using heavier compression algorithms like > > zstd, > > but this is currently greatly limited by the lack of threading. > > Isn't this a problem of the zswap implementation rather than general > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a > context outside of the reclaim? I wouldn't be able to tell you, the documentation on zswap is fairly limited from what I've found. > My recollection of the particular patch is dimm but I do remember it > tried to add more kswapd threads which would just paper over the problem > you are seein rather than solve it. Yeah, that's exactly what it does, just adding more kswap threads. I've tried updating the patch to the latest mainline kernel to test its viability for our use case, but the kswap code changed too much over the past 2 years, updating it is beyond my ability right now it seems. For the time being I've switched over to zram, which better suits our use case either way, and is threaded, but lacks zswap's memory deduplication. Even with zram I'm still seeing kswap frequently max out a core though, so there's definitely still a case for further optimization of kswap. In our case it's not a single big application taking up our memory, rather we are running 2000 high-memory applications. They store a lot of data in swap, but rarely ever access said data, so the actual swap i/o isn't even that high. -- Sebastiaan Meijer
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > yes it shows the bottleneck but it is quite artificial. Read data is > > usually processed and/or written back and that changes the picture a > > lot. > Apologies for reviving an ancient thread (and apologies in advance for my lack > of knowledge on how mailing lists work), but I'd like to offer up another > reason why merging this might be a good idea. > > From what I understand, zswap runs its compression on the same kswapd thread, > limiting it to a single thread for compression. Given enough processing power, > zswap can get great throughput using heavier compression algorithms like zstd, > but this is currently greatly limited by the lack of threading. Isn't this a problem of the zswap implementation rather than general kswapd reclaim? Why zswap doesn't do the same as normal swap out in a context outside of the reclaim? My recollection of the particular patch is dimm but I do remember it tried to add more kswapd threads which would just paper over the problem you are seein rather than solve it. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> yes it shows the bottleneck but it is quite artificial. Read data is > usually processed and/or written back and that changes the picture a > lot. Apologies for reviving an ancient thread (and apologies in advance for my lack of knowledge on how mailing lists work), but I'd like to offer up another reason why merging this might be a good idea. >From what I understand, zswap runs its compression on the same kswapd thread, limiting it to a single thread for compression. Given enough processing power, zswap can get great throughput using heavier compression algorithms like zstd, but this is currently greatly limited by the lack of threading. People on other sites have claimed applying this patchset greatly improved zswap performance on their systems even for lighter compression algorithms. For me personally I currently have a swap-heavy zswap-enabled server with a single-threaded kswapd0 consuming 100% CPU constantly, and performance is suffering because of it. The server has 32 cores sitting mostly idle that I'd love to put to zswap work. This setup could be considered a corner case, but it's definitely a production workload that would greatly benefit from this change. -- Sebastiaan Meijer
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Mon 16-04-18 20:02:22, Buddy Lumpkin wrote: > > > On Apr 12, 2018, at 6:16 AM, Michal Hockowrote: [...] > > But once you hit a wall with > > hard-to-reclaim pages then I would expect multiple threads will simply > > contend more (e.g. on fs locks in shrinkers etc…). > > If that is the case, this is already happening since direct reclaims do just > about > everything that kswapd does. I have tested with a mix of filesystem reads, > writes > and anonymous memory with and without a swap device. The only locking > problems I have run into so far are related to routines in mm/workingset.c. You haven't tried hard enough. Try to generate a bigger fs metadata pressure. In other words something less of a toy than a pure reader without any real processing. [...] > > Or more specifically. How is the admin supposed to know how many > > background threads are still improving the situation? > > Reduce the setting and check to see if pgscan_direct is still incrementing. This just doesn't work. You are oversimplifying a lot! There are much more aspects to this. How many background threads are still worth it without stealing cycles from others? Is half of CPUs per NUMA node worth devoting to background reclaim or is it better to let those excessive memory consumers to be throttled by the direct reclaim? You are still ignoring/underestimating the fact that kswapd steals cycles even from other workload that is not memory bound while direct reclaim throttles (mostly) memory consumers. [...] > > I still haven't looked at your test results in detail because they seem > > quite artificial. Clean pagecache reclaim is not all that interesting > > IMHO > > Clean page cache is extremely interesting for demonstrating this bottleneck. yes it shows the bottleneck but it is quite artificial. Read data is usually processed and/or written back and that changes the picture a lot. Anyway, I do agree that the reclaim can be made faster. I am just not (yet) convinced that multiplying the number of workers is the way to achieve that. [...] > >>> I would be also very interested > >>> to see how to scale the number of threads based on how CPUs are utilized > >>> by other workloads. > >> > >> I think we have reached the point where it makes sense for page > >> replacement to have more > >> than one mode. Enterprise class servers with lots of memory and a large > >> number of CPU > >> cores would benefit heavily if more threads could be devoted toward > >> proactive page > >> replacement. The polar opposite case is my Raspberry PI which I want to > >> run as efficiently > >> as possible. This problem is only going to get worse. I think it makes > >> sense to be able to > >> choose between efficiency and performance (throughput and latency > >> reduction). > > > > The thing is that as long as this would require admin to guess then this > > is not all that useful. People will simply not know what to set and we > > are going to end up with stupid admin guides claiming that you should > > use 1/N of per node cpus for kswapd and that will not work. > > I think this sysctl is very intuitive to use. Only use it if direct reclaims > are > occurring. This can be seen with sar -B. Justify any increase with testing. > That is a whole lot easier to wrap your head around than a lot of the other > sysctls that are available today. Find me an admin that actually understands > what the swappiness tunable does. Well, you have pointed to a nice example actually. Yes swappiness is confusing and you can find _many_ different howtos for tuning. Do they work? No, for a long time on most workloads because we are simply pagecache biased so much these days that we simply ignore the value most of the time. I am pretty sure your "just watch sar -B and tune accordingly" will become obsolete in a short time and people will get confused again. Because they are explicitly tuning for their workload but it doesn't help anymore because the internal implementation of the reclaim has changed again (this happens all the time). No, I simply do not want to repeat past errors and expose too much of implementation details for admins who will most likely have no clue how to use the tuning and rely on random advices on internet or even worse admin guides of questionable quality full of cargo cult advises (remember advises to disable THP for basically any performance problem you see). > > Not to > > mention that the reclaim logic is full of heuristics which change over > > time and a subtle implementation detail that would work for a particular > > scaling might break without anybody noticing. Really, if we are not able > > to come up with some auto tuning then I think that this is not really > > worth it. > > This is all speculation about how a patch behaves that you have not even > tested. Similar arguments can be made about most of the sysctls that are > available. I really do want a solid background for the change like this. You
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Mon 16-04-18 20:02:22, Buddy Lumpkin wrote: > > > On Apr 12, 2018, at 6:16 AM, Michal Hocko wrote: [...] > > But once you hit a wall with > > hard-to-reclaim pages then I would expect multiple threads will simply > > contend more (e.g. on fs locks in shrinkers etc…). > > If that is the case, this is already happening since direct reclaims do just > about > everything that kswapd does. I have tested with a mix of filesystem reads, > writes > and anonymous memory with and without a swap device. The only locking > problems I have run into so far are related to routines in mm/workingset.c. You haven't tried hard enough. Try to generate a bigger fs metadata pressure. In other words something less of a toy than a pure reader without any real processing. [...] > > Or more specifically. How is the admin supposed to know how many > > background threads are still improving the situation? > > Reduce the setting and check to see if pgscan_direct is still incrementing. This just doesn't work. You are oversimplifying a lot! There are much more aspects to this. How many background threads are still worth it without stealing cycles from others? Is half of CPUs per NUMA node worth devoting to background reclaim or is it better to let those excessive memory consumers to be throttled by the direct reclaim? You are still ignoring/underestimating the fact that kswapd steals cycles even from other workload that is not memory bound while direct reclaim throttles (mostly) memory consumers. [...] > > I still haven't looked at your test results in detail because they seem > > quite artificial. Clean pagecache reclaim is not all that interesting > > IMHO > > Clean page cache is extremely interesting for demonstrating this bottleneck. yes it shows the bottleneck but it is quite artificial. Read data is usually processed and/or written back and that changes the picture a lot. Anyway, I do agree that the reclaim can be made faster. I am just not (yet) convinced that multiplying the number of workers is the way to achieve that. [...] > >>> I would be also very interested > >>> to see how to scale the number of threads based on how CPUs are utilized > >>> by other workloads. > >> > >> I think we have reached the point where it makes sense for page > >> replacement to have more > >> than one mode. Enterprise class servers with lots of memory and a large > >> number of CPU > >> cores would benefit heavily if more threads could be devoted toward > >> proactive page > >> replacement. The polar opposite case is my Raspberry PI which I want to > >> run as efficiently > >> as possible. This problem is only going to get worse. I think it makes > >> sense to be able to > >> choose between efficiency and performance (throughput and latency > >> reduction). > > > > The thing is that as long as this would require admin to guess then this > > is not all that useful. People will simply not know what to set and we > > are going to end up with stupid admin guides claiming that you should > > use 1/N of per node cpus for kswapd and that will not work. > > I think this sysctl is very intuitive to use. Only use it if direct reclaims > are > occurring. This can be seen with sar -B. Justify any increase with testing. > That is a whole lot easier to wrap your head around than a lot of the other > sysctls that are available today. Find me an admin that actually understands > what the swappiness tunable does. Well, you have pointed to a nice example actually. Yes swappiness is confusing and you can find _many_ different howtos for tuning. Do they work? No, for a long time on most workloads because we are simply pagecache biased so much these days that we simply ignore the value most of the time. I am pretty sure your "just watch sar -B and tune accordingly" will become obsolete in a short time and people will get confused again. Because they are explicitly tuning for their workload but it doesn't help anymore because the internal implementation of the reclaim has changed again (this happens all the time). No, I simply do not want to repeat past errors and expose too much of implementation details for admins who will most likely have no clue how to use the tuning and rely on random advices on internet or even worse admin guides of questionable quality full of cargo cult advises (remember advises to disable THP for basically any performance problem you see). > > Not to > > mention that the reclaim logic is full of heuristics which change over > > time and a subtle implementation detail that would work for a particular > > scaling might break without anybody noticing. Really, if we are not able > > to come up with some auto tuning then I think that this is not really > > worth it. > > This is all speculation about how a patch behaves that you have not even > tested. Similar arguments can be made about most of the sysctls that are > available. I really do want a solid background for the change like this. You are throwing a
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 12, 2018, at 6:16 AM, Michal Hockowrote: > > On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote: >> >>> On Apr 3, 2018, at 6:31 AM, Michal Hocko wrote: >>> >>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. >>> >>> AFAIR we used to scale the number of kswapd workers many years ago. It >>> just turned out to be not all that great. We have a kswapd reclaim >>> window for quite some time and that can allow to tune how much proactive >>> kswapd should be. >> >> Are you referring to vm.watermark_scale_factor? > > Yes along with min_free_kbytes > >> This helps quite a bit. Previously >> I had to increase min_free_kbytes in order to get a larger gap between the >> low >> and min watemarks. I was very excited when saw that this had been added >> upstream. >> >>> >>> Also please note that the direct reclaim is a way to throttle overly >>> aggressive memory consumers. >> >> I totally agree, in fact I think this should be the primary role of direct >> reclaims >> because they have a substantial impact on performance. Direct reclaims are >> the emergency brakes for page allocation, and the case I am making here is >> that they used to only occur when kswapd had to skip over a lot of pages. > > Or when it is busy reclaiming which can be the case quite easily if you > do not have the inactive file LRU full of clean page cache. And that is > another problem. If you have a trivial reclaim situation then a single > kswapd thread can reclaim quickly enough. A single kswapd thread does not help quickly enough. That is the entire point of this patch. > But once you hit a wall with > hard-to-reclaim pages then I would expect multiple threads will simply > contend more (e.g. on fs locks in shrinkers etc…). If that is the case, this is already happening since direct reclaims do just about everything that kswapd does. I have tested with a mix of filesystem reads, writes and anonymous memory with and without a swap device. The only locking problems I have run into so far are related to routines in mm/workingset.c. It is a lot harder to burden the page scan logic than it used to be. Somewhere around 2007 a change was made where page types
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 12, 2018, at 6:16 AM, Michal Hocko wrote: > > On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote: >> >>> On Apr 3, 2018, at 6:31 AM, Michal Hocko wrote: >>> >>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. >>> >>> AFAIR we used to scale the number of kswapd workers many years ago. It >>> just turned out to be not all that great. We have a kswapd reclaim >>> window for quite some time and that can allow to tune how much proactive >>> kswapd should be. >> >> Are you referring to vm.watermark_scale_factor? > > Yes along with min_free_kbytes > >> This helps quite a bit. Previously >> I had to increase min_free_kbytes in order to get a larger gap between the >> low >> and min watemarks. I was very excited when saw that this had been added >> upstream. >> >>> >>> Also please note that the direct reclaim is a way to throttle overly >>> aggressive memory consumers. >> >> I totally agree, in fact I think this should be the primary role of direct >> reclaims >> because they have a substantial impact on performance. Direct reclaims are >> the emergency brakes for page allocation, and the case I am making here is >> that they used to only occur when kswapd had to skip over a lot of pages. > > Or when it is busy reclaiming which can be the case quite easily if you > do not have the inactive file LRU full of clean page cache. And that is > another problem. If you have a trivial reclaim situation then a single > kswapd thread can reclaim quickly enough. A single kswapd thread does not help quickly enough. That is the entire point of this patch. > But once you hit a wall with > hard-to-reclaim pages then I would expect multiple threads will simply > contend more (e.g. on fs locks in shrinkers etc…). If that is the case, this is already happening since direct reclaims do just about everything that kswapd does. I have tested with a mix of filesystem reads, writes and anonymous memory with and without a swap device. The only locking problems I have run into so far are related to routines in mm/workingset.c. It is a lot harder to burden the page scan logic than it used to be. Somewhere around 2007 a change was made where page types that had to be skipped over were simply
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue 10-04-18 20:10:24, Buddy Lumpkin wrote: [...] > > Also please note that the direct reclaim is a way to throttle overly > > aggressive memory consumers. The more we do in the background context > > the easier for them it will be to allocate faster. So I am not really > > sure that more background threads will solve the underlying problem. > > A single kswapd thread used to keep up with all of the demand you could > create on a Linux system quite easily provided it didn’t have to scan a lot > of pages that were ineligible for eviction. Well, what do you mean by ineligible for eviction? Could you be more specific? Are we talking about pages on LRU list or metadata and shrinker based reclaim. > 10 years ago, Fibre Channel was > the popular high performance interconnect and if you were lucky enough > to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host > bus adapter. Also, most high end storage solutions were still using spinning > rust so it took an insane number of spindles behind each host bus adapter > to saturate the channel if the access patterns were random. There really > wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t > been any attempts to do this in the last 10 years. I do not really see your point. Yeah you can get a faster storage today. So what? Pagecache has always been bound by the RAM speed. > > It is just a matter of memory hogs tunning to end in the very same > > situtation AFAICS. Moreover the more they are going to allocate the more > > less CPU time will _other_ (non-allocating) task get. > > Please describe the scenario a bit more clearly. Once you start constructing > the workload that can create this scenario, I think you will find that you end > up with a mix that is rarely seen in practice. What I meant is that the more you reclaim in the background to more you allow memory hogs to allocate because they will not get throttled. All that on behalf of other workload which is not memory bound and cannot use CPU cycles additional kswapd would consume. Think of any computation intensive workload spreading over most CPUs and a memory hungry data processing. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue 10-04-18 20:10:24, Buddy Lumpkin wrote: [...] > > Also please note that the direct reclaim is a way to throttle overly > > aggressive memory consumers. The more we do in the background context > > the easier for them it will be to allocate faster. So I am not really > > sure that more background threads will solve the underlying problem. > > A single kswapd thread used to keep up with all of the demand you could > create on a Linux system quite easily provided it didn’t have to scan a lot > of pages that were ineligible for eviction. Well, what do you mean by ineligible for eviction? Could you be more specific? Are we talking about pages on LRU list or metadata and shrinker based reclaim. > 10 years ago, Fibre Channel was > the popular high performance interconnect and if you were lucky enough > to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host > bus adapter. Also, most high end storage solutions were still using spinning > rust so it took an insane number of spindles behind each host bus adapter > to saturate the channel if the access patterns were random. There really > wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t > been any attempts to do this in the last 10 years. I do not really see your point. Yeah you can get a faster storage today. So what? Pagecache has always been bound by the RAM speed. > > It is just a matter of memory hogs tunning to end in the very same > > situtation AFAICS. Moreover the more they are going to allocate the more > > less CPU time will _other_ (non-allocating) task get. > > Please describe the scenario a bit more clearly. Once you start constructing > the workload that can create this scenario, I think you will find that you end > up with a mix that is rarely seen in practice. What I meant is that the more you reclaim in the background to more you allow memory hogs to allocate because they will not get throttled. All that on behalf of other workload which is not memory bound and cannot use CPU cycles additional kswapd would consume. Think of any computation intensive workload spreading over most CPUs and a memory hungry data processing. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote: > > > On Apr 3, 2018, at 6:31 AM, Michal Hockowrote: > > > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > >> Page replacement is handled in the Linux Kernel in one of two ways: > >> > >> 1) Asynchronously via kswapd > >> 2) Synchronously, via direct reclaim > >> > >> At page allocation time the allocating task is immediately given a page > >> from the zone free list allowing it to go right back to work doing > >> whatever it was doing; Probably directly or indirectly executing business > >> logic. > >> > >> Just prior to satisfying the allocation, free pages is checked to see if > >> it has reached the zone low watermark and if so, kswapd is awakened. > >> Kswapd will start scanning pages looking for inactive pages to evict to > >> make room for new page allocations. The work of kswapd allows tasks to > >> continue allocating memory from their respective zone free list without > >> incurring any delay. > >> > >> When the demand for free pages exceeds the rate that kswapd tasks can > >> supply them, page allocation works differently. Once the allocating task > >> finds that the number of free pages is at or below the zone min watermark, > >> the task will no longer pull pages from the free list. Instead, the task > >> will run the same CPU-bound routines as kswapd to satisfy its own > >> allocation by scanning and evicting pages. This is called a direct reclaim. > >> > >> The time spent performing a direct reclaim can be substantial, often > >> taking tens to hundreds of milliseconds for small order0 allocations to > >> half a second or more for order9 huge-page allocations. In fact, kswapd is > >> not actually required on a linux system. It exists for the sole purpose of > >> optimizing performance by preventing direct reclaims. > >> > >> When memory shortfall is sufficient to trigger direct reclaims, they can > >> occur in any task that is running on the system. A single aggressive > >> memory allocating task can set the stage for collateral damage to occur in > >> small tasks that rarely allocate additional memory. Consider the impact of > >> injecting an additional 100ms of latency when nscd allocates memory to > >> facilitate caching of a DNS query. > >> > >> The presence of direct reclaims 10 years ago was a fairly reliable > >> indicator that too much was being asked of a Linux system. Kswapd was > >> likely wasting time scanning pages that were ineligible for eviction. > >> Adding RAM or reducing the working set size would usually make the problem > >> go away. Since then hardware has evolved to bring a new struggle for > >> kswapd. Storage speeds have increased by orders of magnitude while CPU > >> clock speeds stayed the same or even slowed down in exchange for more > >> cores per package. This presents a throughput problem for a single > >> threaded kswapd that will get worse with each generation of new hardware. > > > > AFAIR we used to scale the number of kswapd workers many years ago. It > > just turned out to be not all that great. We have a kswapd reclaim > > window for quite some time and that can allow to tune how much proactive > > kswapd should be. > > Are you referring to vm.watermark_scale_factor? Yes along with min_free_kbytes > This helps quite a bit. Previously > I had to increase min_free_kbytes in order to get a larger gap between the low > and min watemarks. I was very excited when saw that this had been added > upstream. > > > > > Also please note that the direct reclaim is a way to throttle overly > > aggressive memory consumers. > > I totally agree, in fact I think this should be the primary role of direct > reclaims > because they have a substantial impact on performance. Direct reclaims are > the emergency brakes for page allocation, and the case I am making here is > that they used to only occur when kswapd had to skip over a lot of pages. Or when it is busy reclaiming which can be the case quite easily if you do not have the inactive file LRU full of clean page cache. And that is another problem. If you have a trivial reclaim situation then a single kswapd thread can reclaim quickly enough. But once you hit a wall with hard-to-reclaim pages then I would expect multiple threads will simply contend more (e.g. on fs locks in shrinkers etc...). Or how do you want to prevent that? Or more specifically. How is the admin supposed to know how many background threads are still improving the situation? > This changed over time as the rate a system can allocate pages increased. > Direct reclaims slowly became a normal part of page replacement. > > > The more we do in the background context > > the easier for them it will be to allocate faster. So I am not really > > sure that more background threads will solve the underlying problem. It > > is just a matter of memory hogs tunning to end in the very same > > situtation AFAICS. Moreover the more they are going to allocate the more > >
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote: > > > On Apr 3, 2018, at 6:31 AM, Michal Hocko wrote: > > > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > >> Page replacement is handled in the Linux Kernel in one of two ways: > >> > >> 1) Asynchronously via kswapd > >> 2) Synchronously, via direct reclaim > >> > >> At page allocation time the allocating task is immediately given a page > >> from the zone free list allowing it to go right back to work doing > >> whatever it was doing; Probably directly or indirectly executing business > >> logic. > >> > >> Just prior to satisfying the allocation, free pages is checked to see if > >> it has reached the zone low watermark and if so, kswapd is awakened. > >> Kswapd will start scanning pages looking for inactive pages to evict to > >> make room for new page allocations. The work of kswapd allows tasks to > >> continue allocating memory from their respective zone free list without > >> incurring any delay. > >> > >> When the demand for free pages exceeds the rate that kswapd tasks can > >> supply them, page allocation works differently. Once the allocating task > >> finds that the number of free pages is at or below the zone min watermark, > >> the task will no longer pull pages from the free list. Instead, the task > >> will run the same CPU-bound routines as kswapd to satisfy its own > >> allocation by scanning and evicting pages. This is called a direct reclaim. > >> > >> The time spent performing a direct reclaim can be substantial, often > >> taking tens to hundreds of milliseconds for small order0 allocations to > >> half a second or more for order9 huge-page allocations. In fact, kswapd is > >> not actually required on a linux system. It exists for the sole purpose of > >> optimizing performance by preventing direct reclaims. > >> > >> When memory shortfall is sufficient to trigger direct reclaims, they can > >> occur in any task that is running on the system. A single aggressive > >> memory allocating task can set the stage for collateral damage to occur in > >> small tasks that rarely allocate additional memory. Consider the impact of > >> injecting an additional 100ms of latency when nscd allocates memory to > >> facilitate caching of a DNS query. > >> > >> The presence of direct reclaims 10 years ago was a fairly reliable > >> indicator that too much was being asked of a Linux system. Kswapd was > >> likely wasting time scanning pages that were ineligible for eviction. > >> Adding RAM or reducing the working set size would usually make the problem > >> go away. Since then hardware has evolved to bring a new struggle for > >> kswapd. Storage speeds have increased by orders of magnitude while CPU > >> clock speeds stayed the same or even slowed down in exchange for more > >> cores per package. This presents a throughput problem for a single > >> threaded kswapd that will get worse with each generation of new hardware. > > > > AFAIR we used to scale the number of kswapd workers many years ago. It > > just turned out to be not all that great. We have a kswapd reclaim > > window for quite some time and that can allow to tune how much proactive > > kswapd should be. > > Are you referring to vm.watermark_scale_factor? Yes along with min_free_kbytes > This helps quite a bit. Previously > I had to increase min_free_kbytes in order to get a larger gap between the low > and min watemarks. I was very excited when saw that this had been added > upstream. > > > > > Also please note that the direct reclaim is a way to throttle overly > > aggressive memory consumers. > > I totally agree, in fact I think this should be the primary role of direct > reclaims > because they have a substantial impact on performance. Direct reclaims are > the emergency brakes for page allocation, and the case I am making here is > that they used to only occur when kswapd had to skip over a lot of pages. Or when it is busy reclaiming which can be the case quite easily if you do not have the inactive file LRU full of clean page cache. And that is another problem. If you have a trivial reclaim situation then a single kswapd thread can reclaim quickly enough. But once you hit a wall with hard-to-reclaim pages then I would expect multiple threads will simply contend more (e.g. on fs locks in shrinkers etc...). Or how do you want to prevent that? Or more specifically. How is the admin supposed to know how many background threads are still improving the situation? > This changed over time as the rate a system can allocate pages increased. > Direct reclaims slowly became a normal part of page replacement. > > > The more we do in the background context > > the easier for them it will be to allocate faster. So I am not really > > sure that more background threads will solve the underlying problem. It > > is just a matter of memory hogs tunning to end in the very same > > situtation AFAICS. Moreover the more they are going to allocate the more > > less CPU time will
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 2:12 PM, Matthew Wilcoxwrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you >> devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd >> threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. If the memory hog is generating enough demand for multiple kswapd tasks to be busy, then it is generating enough demand to trigger direct reclaims. Since direct reclaims are 100% CPU bound, the preemptions you are concerned about are happening anyway. > In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. This makes direct reclaims sound like a positive thing overall and that is simply not the case. If cleaning is the metaphor to describe direct reclaims, then it’s happening in the kitchen using a garden hose. When conditions for direct reclaims are present they can occur in any task that is allocating on the system. They inject latency in random places and they decrease filesystem throughput. When software engineers try to build their own cache, I usually try to talk them out of it. This rarely works, as they usually have reasons they believe make the project compelling, so I just ask that they compare their results using direct IO and a private cache to simply allowing the page cache to do it’s thing. I can’t make this pitch any more because direct reclaims have too much of an impact on filesystem throughput. The only positive thing that direct reclaims provide is a means to prevent the system from crashing or deadlocking when it falls too low on memory. > If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? >
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you >> devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd >> threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. If the memory hog is generating enough demand for multiple kswapd tasks to be busy, then it is generating enough demand to trigger direct reclaims. Since direct reclaims are 100% CPU bound, the preemptions you are concerned about are happening anyway. > In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. This makes direct reclaims sound like a positive thing overall and that is simply not the case. If cleaning is the metaphor to describe direct reclaims, then it’s happening in the kitchen using a garden hose. When conditions for direct reclaims are present they can occur in any task that is allocating on the system. They inject latency in random places and they decrease filesystem throughput. When software engineers try to build their own cache, I usually try to talk them out of it. This rarely works, as they usually have reasons they believe make the project compelling, so I just ask that they compare their results using direct IO and a private cache to simply allowing the page cache to do it’s thing. I can’t make this pitch any more because direct reclaims have too much of an impact on filesystem throughput. The only positive thing that direct reclaims provide is a means to prevent the system from crashing or deadlocking when it falls too low on memory. > If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? >
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 12:07 PM, Matthew Wilcoxwrote: > > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd was >>> likely wasting time scanning pages that were ineligible for eviction. >>> Adding RAM or reducing the working set size would usually make the problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>> clock speeds stayed the same or even slowed down in exchange for more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new hardware. >> >> AFAIR we used to scale the number of kswapd workers many years ago. It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much proactive >> kswapd should be. >> >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the more >> less CPU time will _other_ (non-allocating) task get. >> >>> Test Details >> >> I will have to study this more to comment. >> >> [...] >>> By increasing the number of kswapd threads, throughput increased by ~50% >>> while kernel mode CPU utilization decreased or stayed the same, likely due >>> to a decrease in the number of parallel tasks at any given time doing page >>> replacement. >> >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >> >> Please note that I am not saying that we absolutely have to stick with the >> current single-thread-per-node implementation but I would really like to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very interested >> to see how to scale the number of threads based on how CPUs are utilized >> by other workloads. > > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves in > their own reasonably-sized working sets with one monster process trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. A single thread cannot create the demand to keep any number of kswapd tasks busy, so this memory hog is going to need to have multiple threads if it is going to do any measurable damage to the amount of work performed by the compute bound tasks, and once we increase the number of tasks used for the memory hog, preemption is already happening. So let’s say we are willing to accept that it is going to take multiple threads to create enough demand to keep multiple kswapd tasks busy, we just do not want any additional preemptions strictly due to additional kswapd tasks. You have to consider, If we managed to create enough demand to keep multiple kswapd tasks busy, then we are creating enough demand to trigger direct reclaims. A _lot_ of direct reclaims, and direct reclaims consume A _lot_ of cpu. So if we are running multiple kswapd threads, they might be preempting your N-1 threads, but if they were not running, the memory hog tasks would be preempting your N-1 threads. > > My biggest problem with the patch-as-presented is that it's yet one more > thing for admins to get wrong. We should spawn more threads automatically > if system conditions are right to do that. One thing about this patch-as-presented that an admin could get wrong is by starting with a setting of 16, deciding that it didn’t help and reducing it back to one. It allows for 16 threads because I actually saw a benefit with large numbers of kswapd threads when a substantial amount of the memory pressure was created using anonymous memory mappings that do not involve the page cache. This really is a special case, and the maximum number of threads allowed should probably be reduced to a more sensible value like 8 or even 6 if there is concern about admins doing the wrong thing.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox wrote: > > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd was >>> likely wasting time scanning pages that were ineligible for eviction. >>> Adding RAM or reducing the working set size would usually make the problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>> clock speeds stayed the same or even slowed down in exchange for more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new hardware. >> >> AFAIR we used to scale the number of kswapd workers many years ago. It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much proactive >> kswapd should be. >> >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the more >> less CPU time will _other_ (non-allocating) task get. >> >>> Test Details >> >> I will have to study this more to comment. >> >> [...] >>> By increasing the number of kswapd threads, throughput increased by ~50% >>> while kernel mode CPU utilization decreased or stayed the same, likely due >>> to a decrease in the number of parallel tasks at any given time doing page >>> replacement. >> >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >> >> Please note that I am not saying that we absolutely have to stick with the >> current single-thread-per-node implementation but I would really like to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very interested >> to see how to scale the number of threads based on how CPUs are utilized >> by other workloads. > > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves in > their own reasonably-sized working sets with one monster process trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. A single thread cannot create the demand to keep any number of kswapd tasks busy, so this memory hog is going to need to have multiple threads if it is going to do any measurable damage to the amount of work performed by the compute bound tasks, and once we increase the number of tasks used for the memory hog, preemption is already happening. So let’s say we are willing to accept that it is going to take multiple threads to create enough demand to keep multiple kswapd tasks busy, we just do not want any additional preemptions strictly due to additional kswapd tasks. You have to consider, If we managed to create enough demand to keep multiple kswapd tasks busy, then we are creating enough demand to trigger direct reclaims. A _lot_ of direct reclaims, and direct reclaims consume A _lot_ of cpu. So if we are running multiple kswapd threads, they might be preempting your N-1 threads, but if they were not running, the memory hog tasks would be preempting your N-1 threads. > > My biggest problem with the patch-as-presented is that it's yet one more > thing for admins to get wrong. We should spawn more threads automatically > if system conditions are right to do that. One thing about this patch-as-presented that an admin could get wrong is by starting with a setting of 16, deciding that it didn’t help and reducing it back to one. It allows for 16 threads because I actually saw a benefit with large numbers of kswapd threads when a substantial amount of the memory pressure was created using anonymous memory mappings that do not involve the page cache. This really is a special case, and the maximum number of threads allowed should probably be reduced to a more sensible value like 8 or even 6 if there is concern about admins doing the wrong thing.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 6:31 AM, Michal Hockowrote: > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >> Page replacement is handled in the Linux Kernel in one of two ways: >> >> 1) Asynchronously via kswapd >> 2) Synchronously, via direct reclaim >> >> At page allocation time the allocating task is immediately given a page >> from the zone free list allowing it to go right back to work doing >> whatever it was doing; Probably directly or indirectly executing business >> logic. >> >> Just prior to satisfying the allocation, free pages is checked to see if >> it has reached the zone low watermark and if so, kswapd is awakened. >> Kswapd will start scanning pages looking for inactive pages to evict to >> make room for new page allocations. The work of kswapd allows tasks to >> continue allocating memory from their respective zone free list without >> incurring any delay. >> >> When the demand for free pages exceeds the rate that kswapd tasks can >> supply them, page allocation works differently. Once the allocating task >> finds that the number of free pages is at or below the zone min watermark, >> the task will no longer pull pages from the free list. Instead, the task >> will run the same CPU-bound routines as kswapd to satisfy its own >> allocation by scanning and evicting pages. This is called a direct reclaim. >> >> The time spent performing a direct reclaim can be substantial, often >> taking tens to hundreds of milliseconds for small order0 allocations to >> half a second or more for order9 huge-page allocations. In fact, kswapd is >> not actually required on a linux system. It exists for the sole purpose of >> optimizing performance by preventing direct reclaims. >> >> When memory shortfall is sufficient to trigger direct reclaims, they can >> occur in any task that is running on the system. A single aggressive >> memory allocating task can set the stage for collateral damage to occur in >> small tasks that rarely allocate additional memory. Consider the impact of >> injecting an additional 100ms of latency when nscd allocates memory to >> facilitate caching of a DNS query. >> >> The presence of direct reclaims 10 years ago was a fairly reliable >> indicator that too much was being asked of a Linux system. Kswapd was >> likely wasting time scanning pages that were ineligible for eviction. >> Adding RAM or reducing the working set size would usually make the problem >> go away. Since then hardware has evolved to bring a new struggle for >> kswapd. Storage speeds have increased by orders of magnitude while CPU >> clock speeds stayed the same or even slowed down in exchange for more >> cores per package. This presents a throughput problem for a single >> threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. I am not aware of a previous version of Linux that offered more than one kswapd thread per NUMA node. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. A single kswapd thread used to keep up with all of the demand you could create on a Linux system quite easily provided it didn’t have to scan a lot of pages that were ineligible for eviction. 10 years ago, Fibre Channel was the popular high performance interconnect and if you were lucky enough to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host bus adapter. Also, most high end storage solutions were still using spinning rust so it took an insane number of spindles behind each host bus adapter to saturate the channel if the access patterns were random. There really wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t been any attempts to do this in the last 10 years. > It is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. Please describe the scenario a bit more clearly. Once you start constructing the workload that can create this scenario, I think you will find that you end up with a mix that is rarely seen in practice. > >> Test Details > > I will have to study this more to comment. > > [...] >> By increasing the number of kswapd threads, throughput increased by ~50% >> while kernel mode CPU utilization decreased or stayed the same, likely due >> to a decrease in the number of parallel tasks at any given time doing page >> replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 6:31 AM, Michal Hocko wrote: > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >> Page replacement is handled in the Linux Kernel in one of two ways: >> >> 1) Asynchronously via kswapd >> 2) Synchronously, via direct reclaim >> >> At page allocation time the allocating task is immediately given a page >> from the zone free list allowing it to go right back to work doing >> whatever it was doing; Probably directly or indirectly executing business >> logic. >> >> Just prior to satisfying the allocation, free pages is checked to see if >> it has reached the zone low watermark and if so, kswapd is awakened. >> Kswapd will start scanning pages looking for inactive pages to evict to >> make room for new page allocations. The work of kswapd allows tasks to >> continue allocating memory from their respective zone free list without >> incurring any delay. >> >> When the demand for free pages exceeds the rate that kswapd tasks can >> supply them, page allocation works differently. Once the allocating task >> finds that the number of free pages is at or below the zone min watermark, >> the task will no longer pull pages from the free list. Instead, the task >> will run the same CPU-bound routines as kswapd to satisfy its own >> allocation by scanning and evicting pages. This is called a direct reclaim. >> >> The time spent performing a direct reclaim can be substantial, often >> taking tens to hundreds of milliseconds for small order0 allocations to >> half a second or more for order9 huge-page allocations. In fact, kswapd is >> not actually required on a linux system. It exists for the sole purpose of >> optimizing performance by preventing direct reclaims. >> >> When memory shortfall is sufficient to trigger direct reclaims, they can >> occur in any task that is running on the system. A single aggressive >> memory allocating task can set the stage for collateral damage to occur in >> small tasks that rarely allocate additional memory. Consider the impact of >> injecting an additional 100ms of latency when nscd allocates memory to >> facilitate caching of a DNS query. >> >> The presence of direct reclaims 10 years ago was a fairly reliable >> indicator that too much was being asked of a Linux system. Kswapd was >> likely wasting time scanning pages that were ineligible for eviction. >> Adding RAM or reducing the working set size would usually make the problem >> go away. Since then hardware has evolved to bring a new struggle for >> kswapd. Storage speeds have increased by orders of magnitude while CPU >> clock speeds stayed the same or even slowed down in exchange for more >> cores per package. This presents a throughput problem for a single >> threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. I am not aware of a previous version of Linux that offered more than one kswapd thread per NUMA node. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. A single kswapd thread used to keep up with all of the demand you could create on a Linux system quite easily provided it didn’t have to scan a lot of pages that were ineligible for eviction. 10 years ago, Fibre Channel was the popular high performance interconnect and if you were lucky enough to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host bus adapter. Also, most high end storage solutions were still using spinning rust so it took an insane number of spindles behind each host bus adapter to saturate the channel if the access patterns were random. There really wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t been any attempts to do this in the last 10 years. > It is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. Please describe the scenario a bit more clearly. Once you start constructing the workload that can create this scenario, I think you will find that you end up with a mix that is rarely seen in practice. > >> Test Details > > I will have to study this more to comment. > > [...] >> By increasing the number of kswapd threads, throughput increased by ~50% >> while kernel mode CPU utilization decreased or stayed the same, likely due >> to a decrease in the number of parallel tasks at any given time doing page >> replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might run along with your
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 2:12 PM, Matthew Wilcoxwrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you >> devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd >> threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? > Trying to distinguish between the monster and a high value task that you want to run as quickly as possible would be challenging. I like your idea of using renice. It probably makes sense to continue to run the first thread on each node at a standard nice value, and run each additional task with a positive nice value.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you >> devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd >> threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? > Trying to distinguish between the monster and a high value task that you want to run as quickly as possible would be challenging. I like your idea of using renice. It probably makes sense to continue to run the first thread on each node at a standard nice value, and run each additional task with a positive nice value.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 2:12 PM, Matthew Wilcoxwrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you >> devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd >> threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. For the scenario you describe above. I have my own opinions, but I would rather not speculate on what happens. Tomorrow I will try to simulate this situation and i’ll report back on the results. I think this actually makes a case for accepting the patch as-is for now. Please hear me out on this: You mentioned being concerned that an admin will do the wrong thing with this tunable. I worked in the System Administrator/System Engineering job families for many years and even though I transitioned to spending most of my time on performance and kernel work, I still maintain an active role in System Engineering related projects, hiring and mentoring. The kswapd_threads tunable defaults to a value of one, which is the current default behavior. I think there are plenty of sysctls that are more confusing than this one. If you want to make a comparison, I would say that Transparent Hugepages is one of the best examples of a feature that has confused System Administrators. I am sure it works a lot better today, but it has a history of really sharp edges, and it has been shipping enabled by default for a long time in the OS distributions I am familiar with. I am hopeful that it works better in later kernels as I think we need more features like it. Specifically, features that bring high performance to naive third party apps that do not make use of advanced features like hugetlbfs, spoke, direct IO, or clumsy interfaces like posix_fadvise. But until they are absolutely polished, I wish these kinds of features would not be turned on by default. This includes kswapd_threads. More reasons why implementing this tunable makes sense for now: - A feature like this is a lot easier to reason about after it has been used in the field for a while. This includes trying to auto-tune it - We need an answer for this problem today. Today there are single NVMe drives capable of 10GB/s and larger systems than the system I used for testing - In the scenario you describe above, an admin would have no reason to touch this sysctl - I think I mentioned this before. I honestly thought a lot of tuning would be necessary after implementing this but so far that hasn’t been the case. It works pretty well. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? >
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you >> devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd >> threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. For the scenario you describe above. I have my own opinions, but I would rather not speculate on what happens. Tomorrow I will try to simulate this situation and i’ll report back on the results. I think this actually makes a case for accepting the patch as-is for now. Please hear me out on this: You mentioned being concerned that an admin will do the wrong thing with this tunable. I worked in the System Administrator/System Engineering job families for many years and even though I transitioned to spending most of my time on performance and kernel work, I still maintain an active role in System Engineering related projects, hiring and mentoring. The kswapd_threads tunable defaults to a value of one, which is the current default behavior. I think there are plenty of sysctls that are more confusing than this one. If you want to make a comparison, I would say that Transparent Hugepages is one of the best examples of a feature that has confused System Administrators. I am sure it works a lot better today, but it has a history of really sharp edges, and it has been shipping enabled by default for a long time in the OS distributions I am familiar with. I am hopeful that it works better in later kernels as I think we need more features like it. Specifically, features that bring high performance to naive third party apps that do not make use of advanced features like hugetlbfs, spoke, direct IO, or clumsy interfaces like posix_fadvise. But until they are absolutely polished, I wish these kinds of features would not be turned on by default. This includes kswapd_threads. More reasons why implementing this tunable makes sense for now: - A feature like this is a lot easier to reason about after it has been used in the field for a while. This includes trying to auto-tune it - We need an answer for this problem today. Today there are single NVMe drives capable of 10GB/s and larger systems than the system I used for testing - In the scenario you describe above, an admin would have no reason to touch this sysctl - I think I mentioned this before. I honestly thought a lot of tuning would be necessary after implementing this but so far that hasn’t been the case. It works pretty well. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? >
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: > > Yes, very much this. If you have a single-threaded workload which is > > using the entirety of memory and would like to use even more, then it > > makes sense to use as many CPUs as necessary getting memory out of its > > way. If you have N CPUs and N-1 threads happily occupying themselves in > > their own reasonably-sized working sets with one monster process trying > > to use as much RAM as possible, then I'd be pretty unimpressed to see > > the N-1 well-behaved threads preempted by kswapd. > > The default value provides one kswapd thread per NUMA node, the same > it was without the patch. Also, I would point out that just because you devote > more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads > are busy, they are almost certainly doing work that would have resulted in > direct reclaims, which are often substantially more expensive than a couple > extra context switches due to preemption. [...] > In my previous response to Michal Hocko, I described > how I think we could scale watermarks in response to direct reclaims, and > launch more kswapd threads when kswapd peaks at 100% CPU usage. I think you're missing my point about the workload ... kswapd isn't "nice", so it will compete with the N-1 threads which are chugging along at 100% CPU inside their working sets. In this scenario, we _don't_ want to kick off kswapd at all; we want the monster thread to clean up its own mess. If we have idle CPUs, then yes, absolutely, lets have them clean up for the monster, but otherwise, I want my N-1 threads doing their own thing. Maybe we should renice kswapd anyway ... thoughts? We don't seem to have had a nice'd kswapd since 2.6.12, but maybe we played with that earlier and discovered it was a bad idea?
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: > > Yes, very much this. If you have a single-threaded workload which is > > using the entirety of memory and would like to use even more, then it > > makes sense to use as many CPUs as necessary getting memory out of its > > way. If you have N CPUs and N-1 threads happily occupying themselves in > > their own reasonably-sized working sets with one monster process trying > > to use as much RAM as possible, then I'd be pretty unimpressed to see > > the N-1 well-behaved threads preempted by kswapd. > > The default value provides one kswapd thread per NUMA node, the same > it was without the patch. Also, I would point out that just because you devote > more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads > are busy, they are almost certainly doing work that would have resulted in > direct reclaims, which are often substantially more expensive than a couple > extra context switches due to preemption. [...] > In my previous response to Michal Hocko, I described > how I think we could scale watermarks in response to direct reclaims, and > launch more kswapd threads when kswapd peaks at 100% CPU usage. I think you're missing my point about the workload ... kswapd isn't "nice", so it will compete with the N-1 threads which are chugging along at 100% CPU inside their working sets. In this scenario, we _don't_ want to kick off kswapd at all; we want the monster thread to clean up its own mess. If we have idle CPUs, then yes, absolutely, lets have them clean up for the monster, but otherwise, I want my N-1 threads doing their own thing. Maybe we should renice kswapd anyway ... thoughts? We don't seem to have had a nice'd kswapd since 2.6.12, but maybe we played with that earlier and discovered it was a bad idea?
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 12:07 PM, Matthew Wilcoxwrote: > > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd was >>> likely wasting time scanning pages that were ineligible for eviction. >>> Adding RAM or reducing the working set size would usually make the problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>> clock speeds stayed the same or even slowed down in exchange for more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new hardware. >> >> AFAIR we used to scale the number of kswapd workers many years ago. It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much proactive >> kswapd should be. >> >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the more >> less CPU time will _other_ (non-allocating) task get. >> >>> Test Details >> >> I will have to study this more to comment. >> >> [...] >>> By increasing the number of kswapd threads, throughput increased by ~50% >>> while kernel mode CPU utilization decreased or stayed the same, likely due >>> to a decrease in the number of parallel tasks at any given time doing page >>> replacement. >> >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >> >> Please note that I am not saying that we absolutely have to stick with the >> current single-thread-per-node implementation but I would really like to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very interested >> to see how to scale the number of threads based on how CPUs are utilized >> by other workloads. > > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves in > their own reasonably-sized working sets with one monster process trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. The default value provides one kswapd thread per NUMA node, the same it was without the patch. Also, I would point out that just because you devote more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads are busy, they are almost certainly doing work that would have resulted in direct reclaims, which are often substantially more expensive than a couple extra context switches due to preemption. Also, the code still uses wake_up_interruptible to wake kswapd threads, so after starting the first kswapd thread, free pages minus the size of the allocation would still need to be below the low watermark for a page allocation at that time to cause another kswapd thread to wake up. When I first decided to try this out, I figured a lot of tuning would be needed to see good behavior. But what I found in practice was that it actually works quite well. When you look closely, you see that there is very little difference between a direct reclaim and kswapd. In fact, direct reclaims work a little harder than kswapd, and they should continue to do so because that prevents the number of parallel scanning tasks from increasing unnecessarily. Please try it out, you might be surprised at how well it works. > > My biggest problem with the patch-as-presented is that it's yet one more > thing for admins to get wrong. We should spawn more threads automatically > if system conditions are right to do that. I totally agree with this. In my previous response to Michal Hocko, I described how I think we could scale watermarks in response to direct reclaims, and launch more kswapd threads when kswapd peaks at 100% CPU usage.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox wrote: > > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd was >>> likely wasting time scanning pages that were ineligible for eviction. >>> Adding RAM or reducing the working set size would usually make the problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>> clock speeds stayed the same or even slowed down in exchange for more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new hardware. >> >> AFAIR we used to scale the number of kswapd workers many years ago. It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much proactive >> kswapd should be. >> >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the more >> less CPU time will _other_ (non-allocating) task get. >> >>> Test Details >> >> I will have to study this more to comment. >> >> [...] >>> By increasing the number of kswapd threads, throughput increased by ~50% >>> while kernel mode CPU utilization decreased or stayed the same, likely due >>> to a decrease in the number of parallel tasks at any given time doing page >>> replacement. >> >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >> >> Please note that I am not saying that we absolutely have to stick with the >> current single-thread-per-node implementation but I would really like to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very interested >> to see how to scale the number of threads based on how CPUs are utilized >> by other workloads. > > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves in > their own reasonably-sized working sets with one monster process trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. The default value provides one kswapd thread per NUMA node, the same it was without the patch. Also, I would point out that just because you devote more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads are busy, they are almost certainly doing work that would have resulted in direct reclaims, which are often substantially more expensive than a couple extra context switches due to preemption. Also, the code still uses wake_up_interruptible to wake kswapd threads, so after starting the first kswapd thread, free pages minus the size of the allocation would still need to be below the low watermark for a page allocation at that time to cause another kswapd thread to wake up. When I first decided to try this out, I figured a lot of tuning would be needed to see good behavior. But what I found in practice was that it actually works quite well. When you look closely, you see that there is very little difference between a direct reclaim and kswapd. In fact, direct reclaims work a little harder than kswapd, and they should continue to do so because that prevents the number of parallel scanning tasks from increasing unnecessarily. Please try it out, you might be surprised at how well it works. > > My biggest problem with the patch-as-presented is that it's yet one more > thing for admins to get wrong. We should spawn more threads automatically > if system conditions are right to do that. I totally agree with this. In my previous response to Michal Hocko, I described how I think we could scale watermarks in response to direct reclaims, and launch more kswapd threads when kswapd peaks at 100% CPU usage.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
Very sorry, I forgot to send my last response as plain text. > On Apr 3, 2018, at 6:31 AM, Michal Hockowrote: > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >> Page replacement is handled in the Linux Kernel in one of two ways: >> >> 1) Asynchronously via kswapd >> 2) Synchronously, via direct reclaim >> >> At page allocation time the allocating task is immediately given a page >> from the zone free list allowing it to go right back to work doing >> whatever it was doing; Probably directly or indirectly executing business >> logic. >> >> Just prior to satisfying the allocation, free pages is checked to see if >> it has reached the zone low watermark and if so, kswapd is awakened. >> Kswapd will start scanning pages looking for inactive pages to evict to >> make room for new page allocations. The work of kswapd allows tasks to >> continue allocating memory from their respective zone free list without >> incurring any delay. >> >> When the demand for free pages exceeds the rate that kswapd tasks can >> supply them, page allocation works differently. Once the allocating task >> finds that the number of free pages is at or below the zone min watermark, >> the task will no longer pull pages from the free list. Instead, the task >> will run the same CPU-bound routines as kswapd to satisfy its own >> allocation by scanning and evicting pages. This is called a direct reclaim. >> >> The time spent performing a direct reclaim can be substantial, often >> taking tens to hundreds of milliseconds for small order0 allocations to >> half a second or more for order9 huge-page allocations. In fact, kswapd is >> not actually required on a linux system. It exists for the sole purpose of >> optimizing performance by preventing direct reclaims. >> >> When memory shortfall is sufficient to trigger direct reclaims, they can >> occur in any task that is running on the system. A single aggressive >> memory allocating task can set the stage for collateral damage to occur in >> small tasks that rarely allocate additional memory. Consider the impact of >> injecting an additional 100ms of latency when nscd allocates memory to >> facilitate caching of a DNS query. >> >> The presence of direct reclaims 10 years ago was a fairly reliable >> indicator that too much was being asked of a Linux system. Kswapd was >> likely wasting time scanning pages that were ineligible for eviction. >> Adding RAM or reducing the working set size would usually make the problem >> go away. Since then hardware has evolved to bring a new struggle for >> kswapd. Storage speeds have increased by orders of magnitude while CPU >> clock speeds stayed the same or even slowed down in exchange for more >> cores per package. This presents a throughput problem for a single >> threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. Are you referring to vm.watermark_scale_factor? This helps quite a bit. Previously I had to increase min_free_kbytes in order to get a larger gap between the low and min watemarks. I was very excited when saw that this had been added upstream. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. I totally agree, in fact I think this should be the primary role of direct reclaims because they have a substantial impact on performance. Direct reclaims are the emergency brakes for page allocation, and the case I am making here is that they used to only occur when kswapd had to skip over a lot of pages. This changed over time as the rate a system can allocate pages increased. Direct reclaims slowly became a normal part of page replacement. > The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. It > is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. The important thing to realize here is that kswapd and direct reclaims run the same code paths. There is very little that they do differently. If you compare my test results with one kswapd vs four, your an see that direct reclaims increase the kernel mode CPU consumption considerably. By dedicating more threads to proactive page replacement, you eliminate direct reclaims which reduces the total number of parallel threads that are spinning on the CPU. > >> Test Details > > I will have to study this more to comment. > > [...] >> By increasing the number of kswapd threads, throughput increased by ~50% >> while kernel mode CPU utilization decreased or stayed the same, likely due >>
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
Very sorry, I forgot to send my last response as plain text. > On Apr 3, 2018, at 6:31 AM, Michal Hocko wrote: > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >> Page replacement is handled in the Linux Kernel in one of two ways: >> >> 1) Asynchronously via kswapd >> 2) Synchronously, via direct reclaim >> >> At page allocation time the allocating task is immediately given a page >> from the zone free list allowing it to go right back to work doing >> whatever it was doing; Probably directly or indirectly executing business >> logic. >> >> Just prior to satisfying the allocation, free pages is checked to see if >> it has reached the zone low watermark and if so, kswapd is awakened. >> Kswapd will start scanning pages looking for inactive pages to evict to >> make room for new page allocations. The work of kswapd allows tasks to >> continue allocating memory from their respective zone free list without >> incurring any delay. >> >> When the demand for free pages exceeds the rate that kswapd tasks can >> supply them, page allocation works differently. Once the allocating task >> finds that the number of free pages is at or below the zone min watermark, >> the task will no longer pull pages from the free list. Instead, the task >> will run the same CPU-bound routines as kswapd to satisfy its own >> allocation by scanning and evicting pages. This is called a direct reclaim. >> >> The time spent performing a direct reclaim can be substantial, often >> taking tens to hundreds of milliseconds for small order0 allocations to >> half a second or more for order9 huge-page allocations. In fact, kswapd is >> not actually required on a linux system. It exists for the sole purpose of >> optimizing performance by preventing direct reclaims. >> >> When memory shortfall is sufficient to trigger direct reclaims, they can >> occur in any task that is running on the system. A single aggressive >> memory allocating task can set the stage for collateral damage to occur in >> small tasks that rarely allocate additional memory. Consider the impact of >> injecting an additional 100ms of latency when nscd allocates memory to >> facilitate caching of a DNS query. >> >> The presence of direct reclaims 10 years ago was a fairly reliable >> indicator that too much was being asked of a Linux system. Kswapd was >> likely wasting time scanning pages that were ineligible for eviction. >> Adding RAM or reducing the working set size would usually make the problem >> go away. Since then hardware has evolved to bring a new struggle for >> kswapd. Storage speeds have increased by orders of magnitude while CPU >> clock speeds stayed the same or even slowed down in exchange for more >> cores per package. This presents a throughput problem for a single >> threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. Are you referring to vm.watermark_scale_factor? This helps quite a bit. Previously I had to increase min_free_kbytes in order to get a larger gap between the low and min watemarks. I was very excited when saw that this had been added upstream. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. I totally agree, in fact I think this should be the primary role of direct reclaims because they have a substantial impact on performance. Direct reclaims are the emergency brakes for page allocation, and the case I am making here is that they used to only occur when kswapd had to skip over a lot of pages. This changed over time as the rate a system can allocate pages increased. Direct reclaims slowly became a normal part of page replacement. > The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. It > is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. The important thing to realize here is that kswapd and direct reclaims run the same code paths. There is very little that they do differently. If you compare my test results with one kswapd vs four, your an see that direct reclaims increase the kernel mode CPU consumption considerably. By dedicating more threads to proactive page replacement, you eliminate direct reclaims which reduces the total number of parallel threads that are spinning on the CPU. > >> Test Details > > I will have to study this more to comment. > > [...] >> By increasing the number of kswapd threads, throughput increased by ~50% >> while kernel mode CPU utilization decreased or stayed the same, likely due >> to a decrease in
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > > The presence of direct reclaims 10 years ago was a fairly reliable > > indicator that too much was being asked of a Linux system. Kswapd was > > likely wasting time scanning pages that were ineligible for eviction. > > Adding RAM or reducing the working set size would usually make the problem > > go away. Since then hardware has evolved to bring a new struggle for > > kswapd. Storage speeds have increased by orders of magnitude while CPU > > clock speeds stayed the same or even slowed down in exchange for more > > cores per package. This presents a throughput problem for a single > > threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. It > is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. > > > Test Details > > I will have to study this more to comment. > > [...] > > By increasing the number of kswapd threads, throughput increased by ~50% > > while kernel mode CPU utilization decreased or stayed the same, likely due > > to a decrease in the number of parallel tasks at any given time doing page > > replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might run along with your tests (and which doesn't > really need to allocate a lot of memory)? In other words how > does the patch behaves with a non-artificial mixed workloads? > > Please note that I am not saying that we absolutely have to stick with the > current single-thread-per-node implementation but I would really like to > see more background on why we should be allowing heavy memory hogs to > allocate faster or how to prevent that. I would be also very interested > to see how to scale the number of threads based on how CPUs are utilized > by other workloads. Yes, very much this. If you have a single-threaded workload which is using the entirety of memory and would like to use even more, then it makes sense to use as many CPUs as necessary getting memory out of its way. If you have N CPUs and N-1 threads happily occupying themselves in their own reasonably-sized working sets with one monster process trying to use as much RAM as possible, then I'd be pretty unimpressed to see the N-1 well-behaved threads preempted by kswapd. My biggest problem with the patch-as-presented is that it's yet one more thing for admins to get wrong. We should spawn more threads automatically if system conditions are right to do that.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > > The presence of direct reclaims 10 years ago was a fairly reliable > > indicator that too much was being asked of a Linux system. Kswapd was > > likely wasting time scanning pages that were ineligible for eviction. > > Adding RAM or reducing the working set size would usually make the problem > > go away. Since then hardware has evolved to bring a new struggle for > > kswapd. Storage speeds have increased by orders of magnitude while CPU > > clock speeds stayed the same or even slowed down in exchange for more > > cores per package. This presents a throughput problem for a single > > threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. It > is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. > > > Test Details > > I will have to study this more to comment. > > [...] > > By increasing the number of kswapd threads, throughput increased by ~50% > > while kernel mode CPU utilization decreased or stayed the same, likely due > > to a decrease in the number of parallel tasks at any given time doing page > > replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might run along with your tests (and which doesn't > really need to allocate a lot of memory)? In other words how > does the patch behaves with a non-artificial mixed workloads? > > Please note that I am not saying that we absolutely have to stick with the > current single-thread-per-node implementation but I would really like to > see more background on why we should be allowing heavy memory hogs to > allocate faster or how to prevent that. I would be also very interested > to see how to scale the number of threads based on how CPUs are utilized > by other workloads. Yes, very much this. If you have a single-threaded workload which is using the entirety of memory and would like to use even more, then it makes sense to use as many CPUs as necessary getting memory out of its way. If you have N CPUs and N-1 threads happily occupying themselves in their own reasonably-sized working sets with one monster process trying to use as much RAM as possible, then I'd be pretty unimpressed to see the N-1 well-behaved threads preempted by kswapd. My biggest problem with the patch-as-presented is that it's yet one more thing for admins to get wrong. We should spawn more threads automatically if system conditions are right to do that.
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > Page replacement is handled in the Linux Kernel in one of two ways: > > 1) Asynchronously via kswapd > 2) Synchronously, via direct reclaim > > At page allocation time the allocating task is immediately given a page > from the zone free list allowing it to go right back to work doing > whatever it was doing; Probably directly or indirectly executing business > logic. > > Just prior to satisfying the allocation, free pages is checked to see if > it has reached the zone low watermark and if so, kswapd is awakened. > Kswapd will start scanning pages looking for inactive pages to evict to > make room for new page allocations. The work of kswapd allows tasks to > continue allocating memory from their respective zone free list without > incurring any delay. > > When the demand for free pages exceeds the rate that kswapd tasks can > supply them, page allocation works differently. Once the allocating task > finds that the number of free pages is at or below the zone min watermark, > the task will no longer pull pages from the free list. Instead, the task > will run the same CPU-bound routines as kswapd to satisfy its own > allocation by scanning and evicting pages. This is called a direct reclaim. > > The time spent performing a direct reclaim can be substantial, often > taking tens to hundreds of milliseconds for small order0 allocations to > half a second or more for order9 huge-page allocations. In fact, kswapd is > not actually required on a linux system. It exists for the sole purpose of > optimizing performance by preventing direct reclaims. > > When memory shortfall is sufficient to trigger direct reclaims, they can > occur in any task that is running on the system. A single aggressive > memory allocating task can set the stage for collateral damage to occur in > small tasks that rarely allocate additional memory. Consider the impact of > injecting an additional 100ms of latency when nscd allocates memory to > facilitate caching of a DNS query. > > The presence of direct reclaims 10 years ago was a fairly reliable > indicator that too much was being asked of a Linux system. Kswapd was > likely wasting time scanning pages that were ineligible for eviction. > Adding RAM or reducing the working set size would usually make the problem > go away. Since then hardware has evolved to bring a new struggle for > kswapd. Storage speeds have increased by orders of magnitude while CPU > clock speeds stayed the same or even slowed down in exchange for more > cores per package. This presents a throughput problem for a single > threaded kswapd that will get worse with each generation of new hardware. AFAIR we used to scale the number of kswapd workers many years ago. It just turned out to be not all that great. We have a kswapd reclaim window for quite some time and that can allow to tune how much proactive kswapd should be. Also please note that the direct reclaim is a way to throttle overly aggressive memory consumers. The more we do in the background context the easier for them it will be to allocate faster. So I am not really sure that more background threads will solve the underlying problem. It is just a matter of memory hogs tunning to end in the very same situtation AFAICS. Moreover the more they are going to allocate the more less CPU time will _other_ (non-allocating) task get. > Test Details I will have to study this more to comment. [...] > By increasing the number of kswapd threads, throughput increased by ~50% > while kernel mode CPU utilization decreased or stayed the same, likely due > to a decrease in the number of parallel tasks at any given time doing page > replacement. Well, isn't that just an effect of more work being done on behalf of other workload that might run along with your tests (and which doesn't really need to allocate a lot of memory)? In other words how does the patch behaves with a non-artificial mixed workloads? Please note that I am not saying that we absolutely have to stick with the current single-thread-per-node implementation but I would really like to see more background on why we should be allowing heavy memory hogs to allocate faster or how to prevent that. I would be also very interested to see how to scale the number of threads based on how CPUs are utilized by other workloads. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > Page replacement is handled in the Linux Kernel in one of two ways: > > 1) Asynchronously via kswapd > 2) Synchronously, via direct reclaim > > At page allocation time the allocating task is immediately given a page > from the zone free list allowing it to go right back to work doing > whatever it was doing; Probably directly or indirectly executing business > logic. > > Just prior to satisfying the allocation, free pages is checked to see if > it has reached the zone low watermark and if so, kswapd is awakened. > Kswapd will start scanning pages looking for inactive pages to evict to > make room for new page allocations. The work of kswapd allows tasks to > continue allocating memory from their respective zone free list without > incurring any delay. > > When the demand for free pages exceeds the rate that kswapd tasks can > supply them, page allocation works differently. Once the allocating task > finds that the number of free pages is at or below the zone min watermark, > the task will no longer pull pages from the free list. Instead, the task > will run the same CPU-bound routines as kswapd to satisfy its own > allocation by scanning and evicting pages. This is called a direct reclaim. > > The time spent performing a direct reclaim can be substantial, often > taking tens to hundreds of milliseconds for small order0 allocations to > half a second or more for order9 huge-page allocations. In fact, kswapd is > not actually required on a linux system. It exists for the sole purpose of > optimizing performance by preventing direct reclaims. > > When memory shortfall is sufficient to trigger direct reclaims, they can > occur in any task that is running on the system. A single aggressive > memory allocating task can set the stage for collateral damage to occur in > small tasks that rarely allocate additional memory. Consider the impact of > injecting an additional 100ms of latency when nscd allocates memory to > facilitate caching of a DNS query. > > The presence of direct reclaims 10 years ago was a fairly reliable > indicator that too much was being asked of a Linux system. Kswapd was > likely wasting time scanning pages that were ineligible for eviction. > Adding RAM or reducing the working set size would usually make the problem > go away. Since then hardware has evolved to bring a new struggle for > kswapd. Storage speeds have increased by orders of magnitude while CPU > clock speeds stayed the same or even slowed down in exchange for more > cores per package. This presents a throughput problem for a single > threaded kswapd that will get worse with each generation of new hardware. AFAIR we used to scale the number of kswapd workers many years ago. It just turned out to be not all that great. We have a kswapd reclaim window for quite some time and that can allow to tune how much proactive kswapd should be. Also please note that the direct reclaim is a way to throttle overly aggressive memory consumers. The more we do in the background context the easier for them it will be to allocate faster. So I am not really sure that more background threads will solve the underlying problem. It is just a matter of memory hogs tunning to end in the very same situtation AFAICS. Moreover the more they are going to allocate the more less CPU time will _other_ (non-allocating) task get. > Test Details I will have to study this more to comment. [...] > By increasing the number of kswapd threads, throughput increased by ~50% > while kernel mode CPU utilization decreased or stayed the same, likely due > to a decrease in the number of parallel tasks at any given time doing page > replacement. Well, isn't that just an effect of more work being done on behalf of other workload that might run along with your tests (and which doesn't really need to allocate a lot of memory)? In other words how does the patch behaves with a non-artificial mixed workloads? Please note that I am not saying that we absolutely have to stick with the current single-thread-per-node implementation but I would really like to see more background on why we should be allowing heavy memory hogs to allocate faster or how to prevent that. I would be also very interested to see how to scale the number of threads based on how CPUs are utilized by other workloads. -- Michal Hocko SUSE Labs