Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Michal Hocko
On Fri 02-10-20 09:53:05, Rik van Riel wrote:
> On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > > (Apologies for messing up the mailing list thread, Gmail had fooled
> > > me into
> > > believing that it properly picked up the thread)
> > > 
> > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
> > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > > data is
> > > > > > usually processed and/or written back and that changes the
> > > > > > picture a
> > > > > > lot.
> > > > > Apologies for reviving an ancient thread (and apologies in
> > > > > advance for my lack
> > > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > > up another
> > > > > reason why merging this might be a good idea.
> > > > > 
> > > > > From what I understand, zswap runs its compression on the same
> > > > > kswapd thread,
> > > > > limiting it to a single thread for compression. Given enough
> > > > > processing power,
> > > > > zswap can get great throughput using heavier compression
> > > > > algorithms like zstd,
> > > > > but this is currently greatly limited by the lack of threading.
> > > > 
> > > > Isn't this a problem of the zswap implementation rather than
> > > > general
> > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > > in a
> > > > context outside of the reclaim?
> 
> On systems with lots of very fast IO devices, we have
> also seen kswapd take 100% CPU time without any zswap
> in use.

Do you have more details? Does the saturated kswapd lead to pre-mature
direct reclaim? What is the saturated number of reclaimed pages per unit
of time? Have you tried to play with this to see whether an additional
worker would help?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Matthew Wilcox
On Fri, Oct 02, 2020 at 09:53:05AM -0400, Rik van Riel wrote:
> On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > > (Apologies for messing up the mailing list thread, Gmail had fooled
> > > me into
> > > believing that it properly picked up the thread)
> > > 
> > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
> > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > > data is
> > > > > > usually processed and/or written back and that changes the
> > > > > > picture a
> > > > > > lot.
> > > > > Apologies for reviving an ancient thread (and apologies in
> > > > > advance for my lack
> > > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > > up another
> > > > > reason why merging this might be a good idea.
> > > > > 
> > > > > From what I understand, zswap runs its compression on the same
> > > > > kswapd thread,
> > > > > limiting it to a single thread for compression. Given enough
> > > > > processing power,
> > > > > zswap can get great throughput using heavier compression
> > > > > algorithms like zstd,
> > > > > but this is currently greatly limited by the lack of threading.
> > > > 
> > > > Isn't this a problem of the zswap implementation rather than
> > > > general
> > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > > in a
> > > > context outside of the reclaim?
> 
> On systems with lots of very fast IO devices, we have
> also seen kswapd take 100% CPU time without any zswap
> in use.
> 
> This seems like a generic issue, though zswap does
> manage to bring it out on lower end systems.

Then, given Mel's observation about contention on the LRU lock, what's
the solution?  Partition the LRU list?  Batch removals from the LRU list
by kswapd and hand off to per-?node?cpu? worker threads?

Rik, if you have access to one of those systems, I'd be interested to know
whether using file THPs would help with your workload.  Tracking only
one THP instead of, say, 16 regular size pages is going to reduce the
amount of time taken to pull things off the LRU list.


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Rik van Riel
On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > (Apologies for messing up the mailing list thread, Gmail had fooled
> > me into
> > believing that it properly picked up the thread)
> > 
> > On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
> > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > data is
> > > > > usually processed and/or written back and that changes the
> > > > > picture a
> > > > > lot.
> > > > Apologies for reviving an ancient thread (and apologies in
> > > > advance for my lack
> > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > up another
> > > > reason why merging this might be a good idea.
> > > > 
> > > > From what I understand, zswap runs its compression on the same
> > > > kswapd thread,
> > > > limiting it to a single thread for compression. Given enough
> > > > processing power,
> > > > zswap can get great throughput using heavier compression
> > > > algorithms like zstd,
> > > > but this is currently greatly limited by the lack of threading.
> > > 
> > > Isn't this a problem of the zswap implementation rather than
> > > general
> > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > in a
> > > context outside of the reclaim?

On systems with lots of very fast IO devices, we have
also seen kswapd take 100% CPU time without any zswap
in use.

This seems like a generic issue, though zswap does
manage to bring it out on lower end systems.

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Mel Gorman
On Fri, Oct 02, 2020 at 09:03:33AM +0200, Michal Hocko wrote:
> > > My recollection of the particular patch is dimm but I do remember it
> > > tried to add more kswapd threads which would just paper over the problem
> > > you are seein rather than solve it.
> > 
> > Yeah, that's exactly what it does, just adding more kswap threads.
> 
> Which is far from trivial because it has its side effects on the over
> system balanc.

While I have not read the original patches, multiple kswapd threads will
smash into the LRU lock repeatedly. It's already the case that just plain
storms of page cache allocations hammer that lock on pagevec releases and
gets worse as memory sizes increase. Increasing LRU lock contention when
memory is low is going to have diminishing returns.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Michal Hocko
On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> (Apologies for messing up the mailing list thread, Gmail had fooled me into
> believing that it properly picked up the thread)
> 
> On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
> >
> > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > yes it shows the bottleneck but it is quite artificial. Read data is
> > > > usually processed and/or written back and that changes the picture a
> > > > lot.
> > > Apologies for reviving an ancient thread (and apologies in advance for my 
> > > lack
> > > of knowledge on how mailing lists work), but I'd like to offer up another
> > > reason why merging this might be a good idea.
> > >
> > > From what I understand, zswap runs its compression on the same kswapd 
> > > thread,
> > > limiting it to a single thread for compression. Given enough processing 
> > > power,
> > > zswap can get great throughput using heavier compression algorithms like 
> > > zstd,
> > > but this is currently greatly limited by the lack of threading.
> >
> > Isn't this a problem of the zswap implementation rather than general
> > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
> > context outside of the reclaim?
> 
> I wouldn't be able to tell you, the documentation on zswap is fairly limited
> from what I've found.

I would recommend you to talk to zswap maintainers. Describing your
problem and suggesting to offload the heavy lifting into a separate
context like the standard swap IO does. You are not the only one to hit
this problem
http://lkml.kernel.org/r/CALvZod43VXKZ3StaGXK_EZG_fKcW3v3=ceyowfwp4hnjpoo...@mail.gmail.com.
Ccing Shakeel on such an email might help you to give more usecases.

> > My recollection of the particular patch is dimm but I do remember it
> > tried to add more kswapd threads which would just paper over the problem
> > you are seein rather than solve it.
> 
> Yeah, that's exactly what it does, just adding more kswap threads.

Which is far from trivial because it has its side effects on the over
system balanc. See my reply to the original request and the follow up
discussion. I am not saying this is impossible to achieve and tune
properly but it is certainly non trivial and it would require a really
strong justification.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-01 Thread Sebastiaan Meijer
(Apologies for messing up the mailing list thread, Gmail had fooled me into
believing that it properly picked up the thread)

On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
>
> On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > yes it shows the bottleneck but it is quite artificial. Read data is
> > > usually processed and/or written back and that changes the picture a
> > > lot.
> > Apologies for reviving an ancient thread (and apologies in advance for my 
> > lack
> > of knowledge on how mailing lists work), but I'd like to offer up another
> > reason why merging this might be a good idea.
> >
> > From what I understand, zswap runs its compression on the same kswapd 
> > thread,
> > limiting it to a single thread for compression. Given enough processing 
> > power,
> > zswap can get great throughput using heavier compression algorithms like 
> > zstd,
> > but this is currently greatly limited by the lack of threading.
>
> Isn't this a problem of the zswap implementation rather than general
> kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
> context outside of the reclaim?

I wouldn't be able to tell you, the documentation on zswap is fairly limited
from what I've found.

> My recollection of the particular patch is dimm but I do remember it
> tried to add more kswapd threads which would just paper over the problem
> you are seein rather than solve it.

Yeah, that's exactly what it does, just adding more kswap threads.
I've tried updating the patch to the latest mainline kernel to test its
viability for our use case, but the kswap code changed too much over the
past 2 years, updating it is beyond my ability right now it seems.

For the time being I've switched over to zram, which better suits our use
case either way, and is threaded, but lacks zswap's memory deduplication.

Even with zram I'm still seeing kswap frequently max out a core though,
so there's definitely still a case for further optimization of kswap.
In our case it's not a single big application taking up our memory, rather we
are running 2000 high-memory applications. They store a lot of data in swap,
but rarely ever access said data, so the actual swap i/o isn't even that high.

--
Sebastiaan Meijer


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-01 Thread Michal Hocko
On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > yes it shows the bottleneck but it is quite artificial. Read data is
> > usually processed and/or written back and that changes the picture a
> > lot.
> Apologies for reviving an ancient thread (and apologies in advance for my lack
> of knowledge on how mailing lists work), but I'd like to offer up another
> reason why merging this might be a good idea.
> 
> From what I understand, zswap runs its compression on the same kswapd thread,
> limiting it to a single thread for compression. Given enough processing power,
> zswap can get great throughput using heavier compression algorithms like zstd,
> but this is currently greatly limited by the lack of threading.

Isn't this a problem of the zswap implementation rather than general
kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
context outside of the reclaim?

My recollection of the particular patch is dimm but I do remember it
tried to add more kswapd threads which would just paper over the problem
you are seein rather than solve it.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-09-30 Thread Sebastiaan Meijer
> yes it shows the bottleneck but it is quite artificial. Read data is
> usually processed and/or written back and that changes the picture a
> lot.
Apologies for reviving an ancient thread (and apologies in advance for my lack
of knowledge on how mailing lists work), but I'd like to offer up another
reason why merging this might be a good idea.

>From what I understand, zswap runs its compression on the same kswapd thread,
limiting it to a single thread for compression. Given enough processing power,
zswap can get great throughput using heavier compression algorithms like zstd,
but this is currently greatly limited by the lack of threading.
People on other sites have claimed applying this patchset greatly improved
zswap performance on their systems even for lighter compression algorithms.

For me personally I currently have a swap-heavy zswap-enabled server with
a single-threaded kswapd0 consuming 100% CPU constantly, and performance
is suffering because of it.
The server has 32 cores sitting mostly idle that I'd love to put to zswap work.

This setup could be considered a corner case, but it's definitely a
production workload that would greatly benefit from this change.
--
Sebastiaan Meijer


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-17 Thread Michal Hocko
On Mon 16-04-18 20:02:22, Buddy Lumpkin wrote:
> 
> > On Apr 12, 2018, at 6:16 AM, Michal Hocko  wrote:
[...]
> > But once you hit a wall with
> > hard-to-reclaim pages then I would expect multiple threads will simply
> > contend more (e.g. on fs locks in shrinkers etc…).
> 
> If that is the case, this is already happening since direct reclaims do just 
> about
> everything that kswapd does. I have tested with a mix of filesystem reads, 
> writes
> and anonymous memory with and without a swap device. The only locking
> problems I have run into so far are related to routines in mm/workingset.c.

You haven't tried hard enough. Try to generate a bigger fs metadata
pressure. In other words something less of a toy than a pure reader
without any real processing.

[...]

> > Or more specifically. How is the admin supposed to know how many
> > background threads are still improving the situation?
> 
> Reduce the setting and check to see if pgscan_direct is still incrementing.

This just doesn't work. You are oversimplifying a lot! There are much
more aspects to this. How many background threads are still worth it
without stealing cycles from others? Is half of CPUs per NUMA node worth
devoting to background reclaim or is it better to let those excessive
memory consumers to be throttled by the direct reclaim?

You are still ignoring/underestimating the fact that kswapd steals
cycles even from other workload that is not memory bound while direct
reclaim throttles (mostly) memory consumers.

[...]
> > I still haven't looked at your test results in detail because they seem
> > quite artificial. Clean pagecache reclaim is not all that interesting
> > IMHO
> 
> Clean page cache is extremely interesting for demonstrating this bottleneck.

yes it shows the bottleneck but it is quite artificial. Read data is
usually processed and/or written back and that changes the picture a
lot.

Anyway, I do agree that the reclaim can be made faster. I am just not
(yet) convinced that multiplying the number of workers is the way to achieve
that.

[...]
> >>> I would be also very interested
> >>> to see how to scale the number of threads based on how CPUs are utilized
> >>> by other workloads.
> >> 
> >> I think we have reached the point where it makes sense for page 
> >> replacement to have more
> >> than one mode. Enterprise class servers with lots of memory and a large 
> >> number of CPU
> >> cores would benefit heavily if more threads could be devoted toward 
> >> proactive page
> >> replacement. The polar opposite case is my Raspberry PI which I want to 
> >> run as efficiently
> >> as possible. This problem is only going to get worse. I think it makes 
> >> sense to be able to 
> >> choose between efficiency and performance (throughput and latency 
> >> reduction).
> > 
> > The thing is that as long as this would require admin to guess then this
> > is not all that useful. People will simply not know what to set and we
> > are going to end up with stupid admin guides claiming that you should
> > use 1/N of per node cpus for kswapd and that will not work.
> 
> I think this sysctl is very intuitive to use. Only use it if direct reclaims 
> are
> occurring. This can be seen with sar -B. Justify any increase with testing.
> That is a whole lot easier to wrap your head around than a lot of the other
> sysctls that are available today. Find me an admin that actually understands
> what the swappiness tunable does. 

Well, you have pointed to a nice example actually. Yes swappiness is
confusing and you can find _many_ different howtos for tuning. Do they
work? No, for a long time on most workloads because we are simply
pagecache biased so much these days that we simply ignore the value most of the
time. I am pretty sure your "just watch sar -B and tune accordingly" will
become obsolete in a short time and people will get confused again.
Because they are explicitly tuning for their workload but it doesn't
help anymore because the internal implementation of the reclaim has
changed again (this happens all the time).

No, I simply do not want to repeat past errors and expose too much of
implementation details for admins who will most likely have no clue how
to use the tuning and rely on random advices on internet or even worse
admin guides of questionable quality full of cargo cult advises
(remember advises to disable THP for basically any performance problem
you see).

> > Not to
> > mention that the reclaim logic is full of heuristics which change over
> > time and a subtle implementation detail that would work for a particular
> > scaling might break without anybody noticing. Really, if we are not able
> > to come up with some auto tuning then I think that this is not really
> > worth it.
> 
> This is all speculation about how a patch behaves that you have not even
> tested. Similar arguments can be made about most of the sysctls that are
> available. 

I really do want a solid background for the change like this. You 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-17 Thread Michal Hocko
On Mon 16-04-18 20:02:22, Buddy Lumpkin wrote:
> 
> > On Apr 12, 2018, at 6:16 AM, Michal Hocko  wrote:
[...]
> > But once you hit a wall with
> > hard-to-reclaim pages then I would expect multiple threads will simply
> > contend more (e.g. on fs locks in shrinkers etc…).
> 
> If that is the case, this is already happening since direct reclaims do just 
> about
> everything that kswapd does. I have tested with a mix of filesystem reads, 
> writes
> and anonymous memory with and without a swap device. The only locking
> problems I have run into so far are related to routines in mm/workingset.c.

You haven't tried hard enough. Try to generate a bigger fs metadata
pressure. In other words something less of a toy than a pure reader
without any real processing.

[...]

> > Or more specifically. How is the admin supposed to know how many
> > background threads are still improving the situation?
> 
> Reduce the setting and check to see if pgscan_direct is still incrementing.

This just doesn't work. You are oversimplifying a lot! There are much
more aspects to this. How many background threads are still worth it
without stealing cycles from others? Is half of CPUs per NUMA node worth
devoting to background reclaim or is it better to let those excessive
memory consumers to be throttled by the direct reclaim?

You are still ignoring/underestimating the fact that kswapd steals
cycles even from other workload that is not memory bound while direct
reclaim throttles (mostly) memory consumers.

[...]
> > I still haven't looked at your test results in detail because they seem
> > quite artificial. Clean pagecache reclaim is not all that interesting
> > IMHO
> 
> Clean page cache is extremely interesting for demonstrating this bottleneck.

yes it shows the bottleneck but it is quite artificial. Read data is
usually processed and/or written back and that changes the picture a
lot.

Anyway, I do agree that the reclaim can be made faster. I am just not
(yet) convinced that multiplying the number of workers is the way to achieve
that.

[...]
> >>> I would be also very interested
> >>> to see how to scale the number of threads based on how CPUs are utilized
> >>> by other workloads.
> >> 
> >> I think we have reached the point where it makes sense for page 
> >> replacement to have more
> >> than one mode. Enterprise class servers with lots of memory and a large 
> >> number of CPU
> >> cores would benefit heavily if more threads could be devoted toward 
> >> proactive page
> >> replacement. The polar opposite case is my Raspberry PI which I want to 
> >> run as efficiently
> >> as possible. This problem is only going to get worse. I think it makes 
> >> sense to be able to 
> >> choose between efficiency and performance (throughput and latency 
> >> reduction).
> > 
> > The thing is that as long as this would require admin to guess then this
> > is not all that useful. People will simply not know what to set and we
> > are going to end up with stupid admin guides claiming that you should
> > use 1/N of per node cpus for kswapd and that will not work.
> 
> I think this sysctl is very intuitive to use. Only use it if direct reclaims 
> are
> occurring. This can be seen with sar -B. Justify any increase with testing.
> That is a whole lot easier to wrap your head around than a lot of the other
> sysctls that are available today. Find me an admin that actually understands
> what the swappiness tunable does. 

Well, you have pointed to a nice example actually. Yes swappiness is
confusing and you can find _many_ different howtos for tuning. Do they
work? No, for a long time on most workloads because we are simply
pagecache biased so much these days that we simply ignore the value most of the
time. I am pretty sure your "just watch sar -B and tune accordingly" will
become obsolete in a short time and people will get confused again.
Because they are explicitly tuning for their workload but it doesn't
help anymore because the internal implementation of the reclaim has
changed again (this happens all the time).

No, I simply do not want to repeat past errors and expose too much of
implementation details for admins who will most likely have no clue how
to use the tuning and rely on random advices on internet or even worse
admin guides of questionable quality full of cargo cult advises
(remember advises to disable THP for basically any performance problem
you see).

> > Not to
> > mention that the reclaim logic is full of heuristics which change over
> > time and a subtle implementation detail that would work for a particular
> > scaling might break without anybody noticing. Really, if we are not able
> > to come up with some auto tuning then I think that this is not really
> > worth it.
> 
> This is all speculation about how a patch behaves that you have not even
> tested. Similar arguments can be made about most of the sysctls that are
> available. 

I really do want a solid background for the change like this. You are
throwing a 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-16 Thread Buddy Lumpkin

> On Apr 12, 2018, at 6:16 AM, Michal Hocko  wrote:
> 
> On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote:
>> 
>>> On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
>>> 
>>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
 Page replacement is handled in the Linux Kernel in one of two ways:
 
 1) Asynchronously via kswapd
 2) Synchronously, via direct reclaim
 
 At page allocation time the allocating task is immediately given a page
 from the zone free list allowing it to go right back to work doing
 whatever it was doing; Probably directly or indirectly executing business
 logic.
 
 Just prior to satisfying the allocation, free pages is checked to see if
 it has reached the zone low watermark and if so, kswapd is awakened.
 Kswapd will start scanning pages looking for inactive pages to evict to
 make room for new page allocations. The work of kswapd allows tasks to
 continue allocating memory from their respective zone free list without
 incurring any delay.
 
 When the demand for free pages exceeds the rate that kswapd tasks can
 supply them, page allocation works differently. Once the allocating task
 finds that the number of free pages is at or below the zone min watermark,
 the task will no longer pull pages from the free list. Instead, the task
 will run the same CPU-bound routines as kswapd to satisfy its own
 allocation by scanning and evicting pages. This is called a direct reclaim.
 
 The time spent performing a direct reclaim can be substantial, often
 taking tens to hundreds of milliseconds for small order0 allocations to
 half a second or more for order9 huge-page allocations. In fact, kswapd is
 not actually required on a linux system. It exists for the sole purpose of
 optimizing performance by preventing direct reclaims.
 
 When memory shortfall is sufficient to trigger direct reclaims, they can
 occur in any task that is running on the system. A single aggressive
 memory allocating task can set the stage for collateral damage to occur in
 small tasks that rarely allocate additional memory. Consider the impact of
 injecting an additional 100ms of latency when nscd allocates memory to
 facilitate caching of a DNS query.
 
 The presence of direct reclaims 10 years ago was a fairly reliable
 indicator that too much was being asked of a Linux system. Kswapd was
 likely wasting time scanning pages that were ineligible for eviction.
 Adding RAM or reducing the working set size would usually make the problem
 go away. Since then hardware has evolved to bring a new struggle for
 kswapd. Storage speeds have increased by orders of magnitude while CPU
 clock speeds stayed the same or even slowed down in exchange for more
 cores per package. This presents a throughput problem for a single
 threaded kswapd that will get worse with each generation of new hardware.
>>> 
>>> AFAIR we used to scale the number of kswapd workers many years ago. It
>>> just turned out to be not all that great. We have a kswapd reclaim
>>> window for quite some time and that can allow to tune how much proactive
>>> kswapd should be.
>> 
>> Are you referring to vm.watermark_scale_factor?
> 
> Yes along with min_free_kbytes
> 
>> This helps quite a bit. Previously
>> I had to increase min_free_kbytes in order to get a larger gap between the 
>> low
>> and min watemarks. I was very excited when saw that this had been added
>> upstream. 
>> 
>>> 
>>> Also please note that the direct reclaim is a way to throttle overly
>>> aggressive memory consumers.
>> 
>> I totally agree, in fact I think this should be the primary role of direct 
>> reclaims
>> because they have a substantial impact on performance. Direct reclaims are
>> the emergency brakes for page allocation, and the case I am making here is 
>> that they used to only occur when kswapd had to skip over a lot of pages. 
> 
> Or when it is busy reclaiming which can be the case quite easily if you
> do not have the inactive file LRU full of clean page cache. And that is
> another problem. If you have a trivial reclaim situation then a single
> kswapd thread can reclaim quickly enough.

A single kswapd thread does not help quickly enough. That is the entire point
of this patch.

> But once you hit a wall with
> hard-to-reclaim pages then I would expect multiple threads will simply
> contend more (e.g. on fs locks in shrinkers etc…).

If that is the case, this is already happening since direct reclaims do just 
about
everything that kswapd does. I have tested with a mix of filesystem reads, 
writes
and anonymous memory with and without a swap device. The only locking
problems I have run into so far are related to routines in mm/workingset.c.

It is a lot harder to burden the page scan logic than it used to be. Somewhere
around 2007 a change was made where page types 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-16 Thread Buddy Lumpkin

> On Apr 12, 2018, at 6:16 AM, Michal Hocko  wrote:
> 
> On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote:
>> 
>>> On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
>>> 
>>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
 Page replacement is handled in the Linux Kernel in one of two ways:
 
 1) Asynchronously via kswapd
 2) Synchronously, via direct reclaim
 
 At page allocation time the allocating task is immediately given a page
 from the zone free list allowing it to go right back to work doing
 whatever it was doing; Probably directly or indirectly executing business
 logic.
 
 Just prior to satisfying the allocation, free pages is checked to see if
 it has reached the zone low watermark and if so, kswapd is awakened.
 Kswapd will start scanning pages looking for inactive pages to evict to
 make room for new page allocations. The work of kswapd allows tasks to
 continue allocating memory from their respective zone free list without
 incurring any delay.
 
 When the demand for free pages exceeds the rate that kswapd tasks can
 supply them, page allocation works differently. Once the allocating task
 finds that the number of free pages is at or below the zone min watermark,
 the task will no longer pull pages from the free list. Instead, the task
 will run the same CPU-bound routines as kswapd to satisfy its own
 allocation by scanning and evicting pages. This is called a direct reclaim.
 
 The time spent performing a direct reclaim can be substantial, often
 taking tens to hundreds of milliseconds for small order0 allocations to
 half a second or more for order9 huge-page allocations. In fact, kswapd is
 not actually required on a linux system. It exists for the sole purpose of
 optimizing performance by preventing direct reclaims.
 
 When memory shortfall is sufficient to trigger direct reclaims, they can
 occur in any task that is running on the system. A single aggressive
 memory allocating task can set the stage for collateral damage to occur in
 small tasks that rarely allocate additional memory. Consider the impact of
 injecting an additional 100ms of latency when nscd allocates memory to
 facilitate caching of a DNS query.
 
 The presence of direct reclaims 10 years ago was a fairly reliable
 indicator that too much was being asked of a Linux system. Kswapd was
 likely wasting time scanning pages that were ineligible for eviction.
 Adding RAM or reducing the working set size would usually make the problem
 go away. Since then hardware has evolved to bring a new struggle for
 kswapd. Storage speeds have increased by orders of magnitude while CPU
 clock speeds stayed the same or even slowed down in exchange for more
 cores per package. This presents a throughput problem for a single
 threaded kswapd that will get worse with each generation of new hardware.
>>> 
>>> AFAIR we used to scale the number of kswapd workers many years ago. It
>>> just turned out to be not all that great. We have a kswapd reclaim
>>> window for quite some time and that can allow to tune how much proactive
>>> kswapd should be.
>> 
>> Are you referring to vm.watermark_scale_factor?
> 
> Yes along with min_free_kbytes
> 
>> This helps quite a bit. Previously
>> I had to increase min_free_kbytes in order to get a larger gap between the 
>> low
>> and min watemarks. I was very excited when saw that this had been added
>> upstream. 
>> 
>>> 
>>> Also please note that the direct reclaim is a way to throttle overly
>>> aggressive memory consumers.
>> 
>> I totally agree, in fact I think this should be the primary role of direct 
>> reclaims
>> because they have a substantial impact on performance. Direct reclaims are
>> the emergency brakes for page allocation, and the case I am making here is 
>> that they used to only occur when kswapd had to skip over a lot of pages. 
> 
> Or when it is busy reclaiming which can be the case quite easily if you
> do not have the inactive file LRU full of clean page cache. And that is
> another problem. If you have a trivial reclaim situation then a single
> kswapd thread can reclaim quickly enough.

A single kswapd thread does not help quickly enough. That is the entire point
of this patch.

> But once you hit a wall with
> hard-to-reclaim pages then I would expect multiple threads will simply
> contend more (e.g. on fs locks in shrinkers etc…).

If that is the case, this is already happening since direct reclaims do just 
about
everything that kswapd does. I have tested with a mix of filesystem reads, 
writes
and anonymous memory with and without a swap device. The only locking
problems I have run into so far are related to routines in mm/workingset.c.

It is a lot harder to burden the page scan logic than it used to be. Somewhere
around 2007 a change was made where page types that had to be skipped
over were simply 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-12 Thread Michal Hocko
On Tue 10-04-18 20:10:24, Buddy Lumpkin wrote:
[...]
> > Also please note that the direct reclaim is a way to throttle overly
> > aggressive memory consumers. The more we do in the background context
> > the easier for them it will be to allocate faster. So I am not really
> > sure that more background threads will solve the underlying problem.
> 
> A single kswapd thread used to keep up with all of the demand you could
> create on a Linux system quite easily provided it didn’t have to scan a lot
> of pages that were ineligible for eviction.

Well, what do you mean by ineligible for eviction? Could you be more
specific? Are we talking about pages on LRU list or metadata and
shrinker based reclaim.

> 10 years ago, Fibre Channel was
> the popular high performance interconnect and if you were lucky enough
> to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host
> bus adapter. Also, most high end storage solutions were still using spinning
> rust so it took an insane number of spindles behind each host bus adapter
> to saturate the channel if the access patterns were random. There really
> wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t
> been any attempts to do this in the last 10 years.

I do not really see your point. Yeah you can get a faster storage today.
So what? Pagecache has always been bound by the RAM speed.

> > It is just a matter of memory hogs tunning to end in the very same
> > situtation AFAICS. Moreover the more they are going to allocate the more
> > less CPU time will _other_ (non-allocating) task get.
> 
> Please describe the scenario a bit more clearly. Once you start constructing
> the workload that can create this scenario, I think you will find that you end
> up with a mix that is rarely seen in practice.

What I meant is that the more you reclaim in the background to more you
allow memory hogs to allocate because they will not get throttled. All
that on behalf of other workload which is not memory bound and cannot
use CPU cycles additional kswapd would consume. Think of any computation
intensive workload spreading over most CPUs and a memory hungry data
processing.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-12 Thread Michal Hocko
On Tue 10-04-18 20:10:24, Buddy Lumpkin wrote:
[...]
> > Also please note that the direct reclaim is a way to throttle overly
> > aggressive memory consumers. The more we do in the background context
> > the easier for them it will be to allocate faster. So I am not really
> > sure that more background threads will solve the underlying problem.
> 
> A single kswapd thread used to keep up with all of the demand you could
> create on a Linux system quite easily provided it didn’t have to scan a lot
> of pages that were ineligible for eviction.

Well, what do you mean by ineligible for eviction? Could you be more
specific? Are we talking about pages on LRU list or metadata and
shrinker based reclaim.

> 10 years ago, Fibre Channel was
> the popular high performance interconnect and if you were lucky enough
> to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host
> bus adapter. Also, most high end storage solutions were still using spinning
> rust so it took an insane number of spindles behind each host bus adapter
> to saturate the channel if the access patterns were random. There really
> wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t
> been any attempts to do this in the last 10 years.

I do not really see your point. Yeah you can get a faster storage today.
So what? Pagecache has always been bound by the RAM speed.

> > It is just a matter of memory hogs tunning to end in the very same
> > situtation AFAICS. Moreover the more they are going to allocate the more
> > less CPU time will _other_ (non-allocating) task get.
> 
> Please describe the scenario a bit more clearly. Once you start constructing
> the workload that can create this scenario, I think you will find that you end
> up with a mix that is rarely seen in practice.

What I meant is that the more you reclaim in the background to more you
allow memory hogs to allocate because they will not get throttled. All
that on behalf of other workload which is not memory bound and cannot
use CPU cycles additional kswapd would consume. Think of any computation
intensive workload spreading over most CPUs and a memory hungry data
processing.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-12 Thread Michal Hocko
On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote:
> 
> > On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
> > 
> > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> >> Page replacement is handled in the Linux Kernel in one of two ways:
> >> 
> >> 1) Asynchronously via kswapd
> >> 2) Synchronously, via direct reclaim
> >> 
> >> At page allocation time the allocating task is immediately given a page
> >> from the zone free list allowing it to go right back to work doing
> >> whatever it was doing; Probably directly or indirectly executing business
> >> logic.
> >> 
> >> Just prior to satisfying the allocation, free pages is checked to see if
> >> it has reached the zone low watermark and if so, kswapd is awakened.
> >> Kswapd will start scanning pages looking for inactive pages to evict to
> >> make room for new page allocations. The work of kswapd allows tasks to
> >> continue allocating memory from their respective zone free list without
> >> incurring any delay.
> >> 
> >> When the demand for free pages exceeds the rate that kswapd tasks can
> >> supply them, page allocation works differently. Once the allocating task
> >> finds that the number of free pages is at or below the zone min watermark,
> >> the task will no longer pull pages from the free list. Instead, the task
> >> will run the same CPU-bound routines as kswapd to satisfy its own
> >> allocation by scanning and evicting pages. This is called a direct reclaim.
> >> 
> >> The time spent performing a direct reclaim can be substantial, often
> >> taking tens to hundreds of milliseconds for small order0 allocations to
> >> half a second or more for order9 huge-page allocations. In fact, kswapd is
> >> not actually required on a linux system. It exists for the sole purpose of
> >> optimizing performance by preventing direct reclaims.
> >> 
> >> When memory shortfall is sufficient to trigger direct reclaims, they can
> >> occur in any task that is running on the system. A single aggressive
> >> memory allocating task can set the stage for collateral damage to occur in
> >> small tasks that rarely allocate additional memory. Consider the impact of
> >> injecting an additional 100ms of latency when nscd allocates memory to
> >> facilitate caching of a DNS query.
> >> 
> >> The presence of direct reclaims 10 years ago was a fairly reliable
> >> indicator that too much was being asked of a Linux system. Kswapd was
> >> likely wasting time scanning pages that were ineligible for eviction.
> >> Adding RAM or reducing the working set size would usually make the problem
> >> go away. Since then hardware has evolved to bring a new struggle for
> >> kswapd. Storage speeds have increased by orders of magnitude while CPU
> >> clock speeds stayed the same or even slowed down in exchange for more
> >> cores per package. This presents a throughput problem for a single
> >> threaded kswapd that will get worse with each generation of new hardware.
> > 
> > AFAIR we used to scale the number of kswapd workers many years ago. It
> > just turned out to be not all that great. We have a kswapd reclaim
> > window for quite some time and that can allow to tune how much proactive
> > kswapd should be.
> 
> Are you referring to vm.watermark_scale_factor?

Yes along with min_free_kbytes

> This helps quite a bit. Previously
> I had to increase min_free_kbytes in order to get a larger gap between the low
> and min watemarks. I was very excited when saw that this had been added
> upstream. 
> 
> > 
> > Also please note that the direct reclaim is a way to throttle overly
> > aggressive memory consumers.
> 
> I totally agree, in fact I think this should be the primary role of direct 
> reclaims
> because they have a substantial impact on performance. Direct reclaims are
> the emergency brakes for page allocation, and the case I am making here is 
> that they used to only occur when kswapd had to skip over a lot of pages. 

Or when it is busy reclaiming which can be the case quite easily if you
do not have the inactive file LRU full of clean page cache. And that is
another problem. If you have a trivial reclaim situation then a single
kswapd thread can reclaim quickly enough. But once you hit a wall with
hard-to-reclaim pages then I would expect multiple threads will simply
contend more (e.g. on fs locks in shrinkers etc...). Or how do you want
to prevent that?

Or more specifically. How is the admin supposed to know how many
background threads are still improving the situation?

> This changed over time as the rate a system can allocate pages increased. 
> Direct reclaims slowly became a normal part of page replacement. 
> 
> > The more we do in the background context
> > the easier for them it will be to allocate faster. So I am not really
> > sure that more background threads will solve the underlying problem. It
> > is just a matter of memory hogs tunning to end in the very same
> > situtation AFAICS. Moreover the more they are going to allocate the more
> > 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-12 Thread Michal Hocko
On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote:
> 
> > On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
> > 
> > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> >> Page replacement is handled in the Linux Kernel in one of two ways:
> >> 
> >> 1) Asynchronously via kswapd
> >> 2) Synchronously, via direct reclaim
> >> 
> >> At page allocation time the allocating task is immediately given a page
> >> from the zone free list allowing it to go right back to work doing
> >> whatever it was doing; Probably directly or indirectly executing business
> >> logic.
> >> 
> >> Just prior to satisfying the allocation, free pages is checked to see if
> >> it has reached the zone low watermark and if so, kswapd is awakened.
> >> Kswapd will start scanning pages looking for inactive pages to evict to
> >> make room for new page allocations. The work of kswapd allows tasks to
> >> continue allocating memory from their respective zone free list without
> >> incurring any delay.
> >> 
> >> When the demand for free pages exceeds the rate that kswapd tasks can
> >> supply them, page allocation works differently. Once the allocating task
> >> finds that the number of free pages is at or below the zone min watermark,
> >> the task will no longer pull pages from the free list. Instead, the task
> >> will run the same CPU-bound routines as kswapd to satisfy its own
> >> allocation by scanning and evicting pages. This is called a direct reclaim.
> >> 
> >> The time spent performing a direct reclaim can be substantial, often
> >> taking tens to hundreds of milliseconds for small order0 allocations to
> >> half a second or more for order9 huge-page allocations. In fact, kswapd is
> >> not actually required on a linux system. It exists for the sole purpose of
> >> optimizing performance by preventing direct reclaims.
> >> 
> >> When memory shortfall is sufficient to trigger direct reclaims, they can
> >> occur in any task that is running on the system. A single aggressive
> >> memory allocating task can set the stage for collateral damage to occur in
> >> small tasks that rarely allocate additional memory. Consider the impact of
> >> injecting an additional 100ms of latency when nscd allocates memory to
> >> facilitate caching of a DNS query.
> >> 
> >> The presence of direct reclaims 10 years ago was a fairly reliable
> >> indicator that too much was being asked of a Linux system. Kswapd was
> >> likely wasting time scanning pages that were ineligible for eviction.
> >> Adding RAM or reducing the working set size would usually make the problem
> >> go away. Since then hardware has evolved to bring a new struggle for
> >> kswapd. Storage speeds have increased by orders of magnitude while CPU
> >> clock speeds stayed the same or even slowed down in exchange for more
> >> cores per package. This presents a throughput problem for a single
> >> threaded kswapd that will get worse with each generation of new hardware.
> > 
> > AFAIR we used to scale the number of kswapd workers many years ago. It
> > just turned out to be not all that great. We have a kswapd reclaim
> > window for quite some time and that can allow to tune how much proactive
> > kswapd should be.
> 
> Are you referring to vm.watermark_scale_factor?

Yes along with min_free_kbytes

> This helps quite a bit. Previously
> I had to increase min_free_kbytes in order to get a larger gap between the low
> and min watemarks. I was very excited when saw that this had been added
> upstream. 
> 
> > 
> > Also please note that the direct reclaim is a way to throttle overly
> > aggressive memory consumers.
> 
> I totally agree, in fact I think this should be the primary role of direct 
> reclaims
> because they have a substantial impact on performance. Direct reclaims are
> the emergency brakes for page allocation, and the case I am making here is 
> that they used to only occur when kswapd had to skip over a lot of pages. 

Or when it is busy reclaiming which can be the case quite easily if you
do not have the inactive file LRU full of clean page cache. And that is
another problem. If you have a trivial reclaim situation then a single
kswapd thread can reclaim quickly enough. But once you hit a wall with
hard-to-reclaim pages then I would expect multiple threads will simply
contend more (e.g. on fs locks in shrinkers etc...). Or how do you want
to prevent that?

Or more specifically. How is the admin supposed to know how many
background threads are still improving the situation?

> This changed over time as the rate a system can allocate pages increased. 
> Direct reclaims slowly became a normal part of page replacement. 
> 
> > The more we do in the background context
> > the easier for them it will be to allocate faster. So I am not really
> > sure that more background threads will solve the underlying problem. It
> > is just a matter of memory hogs tunning to end in the very same
> > situtation AFAICS. Moreover the more they are going to allocate the more
> > less CPU time will 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-11 Thread Buddy Lumpkin

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you 
>> devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd 
>> threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  

If the memory hog is generating enough demand for multiple kswapd
tasks to be busy, then it is generating enough demand to trigger direct
reclaims. Since direct reclaims are 100% CPU bound, the preemptions
you are concerned about are happening anyway.

> In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.

This makes direct reclaims sound like a positive thing overall and that
is simply not the case. If cleaning is the metaphor to describe direct
reclaims, then it’s happening in the kitchen using a garden hose.
When conditions for direct reclaims are present they can occur in any
task that is allocating on the system. They inject latency in random places
and they decrease filesystem throughput.

When software engineers try to build their own cache, I usually try to talk
them out of it. This rarely works, as they usually have reasons they believe
make the project compelling, so I just ask that they compare their results
using direct IO and a private cache to simply allowing the page cache to
do it’s thing. I can’t make this pitch any more because direct reclaims have
too much of an impact on filesystem throughput.

The only positive thing that direct reclaims provide is a means to prevent
the system from crashing or deadlocking when it falls too low on memory.

> If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
> 



Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-11 Thread Buddy Lumpkin

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you 
>> devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd 
>> threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  

If the memory hog is generating enough demand for multiple kswapd
tasks to be busy, then it is generating enough demand to trigger direct
reclaims. Since direct reclaims are 100% CPU bound, the preemptions
you are concerned about are happening anyway.

> In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.

This makes direct reclaims sound like a positive thing overall and that
is simply not the case. If cleaning is the metaphor to describe direct
reclaims, then it’s happening in the kitchen using a garden hose.
When conditions for direct reclaims are present they can occur in any
task that is allocating on the system. They inject latency in random places
and they decrease filesystem throughput.

When software engineers try to build their own cache, I usually try to talk
them out of it. This rarely works, as they usually have reasons they believe
make the project compelling, so I just ask that they compare their results
using direct IO and a private cache to simply allowing the page cache to
do it’s thing. I can’t make this pitch any more because direct reclaims have
too much of an impact on filesystem throughput.

The only positive thing that direct reclaims provide is a means to prevent
the system from crashing or deadlocking when it falls too low on memory.

> If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
> 



Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-10 Thread Buddy Lumpkin

> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>> indicator that too much was being asked of a Linux system. Kswapd was
>>> likely wasting time scanning pages that were ineligible for eviction.
>>> Adding RAM or reducing the working set size would usually make the problem
>>> go away. Since then hardware has evolved to bring a new struggle for
>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>> clock speeds stayed the same or even slowed down in exchange for more
>>> cores per package. This presents a throughput problem for a single
>>> threaded kswapd that will get worse with each generation of new hardware.
>> 
>> AFAIR we used to scale the number of kswapd workers many years ago. It
>> just turned out to be not all that great. We have a kswapd reclaim
>> window for quite some time and that can allow to tune how much proactive
>> kswapd should be.
>> 
>> Also please note that the direct reclaim is a way to throttle overly
>> aggressive memory consumers. The more we do in the background context
>> the easier for them it will be to allocate faster. So I am not really
>> sure that more background threads will solve the underlying problem. It
>> is just a matter of memory hogs tunning to end in the very same
>> situtation AFAICS. Moreover the more they are going to allocate the more
>> less CPU time will _other_ (non-allocating) task get.
>> 
>>> Test Details
>> 
>> I will have to study this more to comment.
>> 
>> [...]
>>> By increasing the number of kswapd threads, throughput increased by ~50%
>>> while kernel mode CPU utilization decreased or stayed the same, likely due
>>> to a decrease in the number of parallel tasks at any given time doing page
>>> replacement.
>> 
>> Well, isn't that just an effect of more work being done on behalf of
>> other workload that might run along with your tests (and which doesn't
>> really need to allocate a lot of memory)? In other words how
>> does the patch behaves with a non-artificial mixed workloads?
>> 
>> Please note that I am not saying that we absolutely have to stick with the
>> current single-thread-per-node implementation but I would really like to
>> see more background on why we should be allowing heavy memory hogs to
>> allocate faster or how to prevent that. I would be also very interested
>> to see how to scale the number of threads based on how CPUs are utilized
>> by other workloads.
> 
> Yes, very much this.  If you have a single-threaded workload which is
> using the entirety of memory and would like to use even more, then it
> makes sense to use as many CPUs as necessary getting memory out of its
> way.  If you have N CPUs and N-1 threads happily occupying themselves in
> their own reasonably-sized working sets with one monster process trying
> to use as much RAM as possible, then I'd be pretty unimpressed to see
> the N-1 well-behaved threads preempted by kswapd.

A single thread cannot create the demand to keep any number of kswapd tasks
busy, so this memory hog is going to need to have multiple threads if it is 
going
to do any measurable damage to the amount of work performed by the compute
bound tasks, and once we increase the number of tasks used for the memory
hog, preemption is already happening.

So let’s say we are willing to accept that it is going to take multiple threads 
to
create enough demand to keep multiple kswapd tasks busy, we just do not want
any additional preemptions strictly due to additional kswapd tasks. You have to
consider, If we managed to create enough demand to keep multiple kswapd tasks
busy, then we are creating enough demand to trigger direct reclaims. A _lot_ of
direct reclaims, and direct reclaims consume A _lot_ of cpu. So if we are 
running
multiple kswapd threads, they might be preempting your N-1 threads, but if they
were not running, the memory hog tasks would be preempting your N-1 threads.

> 
> My biggest problem with the patch-as-presented is that it's yet one more
> thing for admins to get wrong.  We should spawn more threads automatically
> if system conditions are right to do that.

One thing about this patch-as-presented that an admin could get wrong is by
starting with a setting of 16, deciding that it didn’t help and reducing it 
back to
one. It allows for 16 threads because I actually saw a benefit with large 
numbers
of kswapd threads when a substantial amount of the memory pressure was 
created using anonymous memory mappings that do not involve the page cache.
This really is a special case, and the maximum number of threads allowed should
probably be reduced to a more sensible value like 8 or even 6 if there is 
concern
about admins doing the wrong thing.







Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-10 Thread Buddy Lumpkin

> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>> indicator that too much was being asked of a Linux system. Kswapd was
>>> likely wasting time scanning pages that were ineligible for eviction.
>>> Adding RAM or reducing the working set size would usually make the problem
>>> go away. Since then hardware has evolved to bring a new struggle for
>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>> clock speeds stayed the same or even slowed down in exchange for more
>>> cores per package. This presents a throughput problem for a single
>>> threaded kswapd that will get worse with each generation of new hardware.
>> 
>> AFAIR we used to scale the number of kswapd workers many years ago. It
>> just turned out to be not all that great. We have a kswapd reclaim
>> window for quite some time and that can allow to tune how much proactive
>> kswapd should be.
>> 
>> Also please note that the direct reclaim is a way to throttle overly
>> aggressive memory consumers. The more we do in the background context
>> the easier for them it will be to allocate faster. So I am not really
>> sure that more background threads will solve the underlying problem. It
>> is just a matter of memory hogs tunning to end in the very same
>> situtation AFAICS. Moreover the more they are going to allocate the more
>> less CPU time will _other_ (non-allocating) task get.
>> 
>>> Test Details
>> 
>> I will have to study this more to comment.
>> 
>> [...]
>>> By increasing the number of kswapd threads, throughput increased by ~50%
>>> while kernel mode CPU utilization decreased or stayed the same, likely due
>>> to a decrease in the number of parallel tasks at any given time doing page
>>> replacement.
>> 
>> Well, isn't that just an effect of more work being done on behalf of
>> other workload that might run along with your tests (and which doesn't
>> really need to allocate a lot of memory)? In other words how
>> does the patch behaves with a non-artificial mixed workloads?
>> 
>> Please note that I am not saying that we absolutely have to stick with the
>> current single-thread-per-node implementation but I would really like to
>> see more background on why we should be allowing heavy memory hogs to
>> allocate faster or how to prevent that. I would be also very interested
>> to see how to scale the number of threads based on how CPUs are utilized
>> by other workloads.
> 
> Yes, very much this.  If you have a single-threaded workload which is
> using the entirety of memory and would like to use even more, then it
> makes sense to use as many CPUs as necessary getting memory out of its
> way.  If you have N CPUs and N-1 threads happily occupying themselves in
> their own reasonably-sized working sets with one monster process trying
> to use as much RAM as possible, then I'd be pretty unimpressed to see
> the N-1 well-behaved threads preempted by kswapd.

A single thread cannot create the demand to keep any number of kswapd tasks
busy, so this memory hog is going to need to have multiple threads if it is 
going
to do any measurable damage to the amount of work performed by the compute
bound tasks, and once we increase the number of tasks used for the memory
hog, preemption is already happening.

So let’s say we are willing to accept that it is going to take multiple threads 
to
create enough demand to keep multiple kswapd tasks busy, we just do not want
any additional preemptions strictly due to additional kswapd tasks. You have to
consider, If we managed to create enough demand to keep multiple kswapd tasks
busy, then we are creating enough demand to trigger direct reclaims. A _lot_ of
direct reclaims, and direct reclaims consume A _lot_ of cpu. So if we are 
running
multiple kswapd threads, they might be preempting your N-1 threads, but if they
were not running, the memory hog tasks would be preempting your N-1 threads.

> 
> My biggest problem with the patch-as-presented is that it's yet one more
> thing for admins to get wrong.  We should spawn more threads automatically
> if system conditions are right to do that.

One thing about this patch-as-presented that an admin could get wrong is by
starting with a setting of 16, deciding that it didn’t help and reducing it 
back to
one. It allows for 16 threads because I actually saw a benefit with large 
numbers
of kswapd threads when a substantial amount of the memory pressure was 
created using anonymous memory mappings that do not involve the page cache.
This really is a special case, and the maximum number of threads allowed should
probably be reduced to a more sensible value like 8 or even 6 if there is 
concern
about admins doing the wrong thing.







Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-10 Thread Buddy Lumpkin

> On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
> 
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>> Page replacement is handled in the Linux Kernel in one of two ways:
>> 
>> 1) Asynchronously via kswapd
>> 2) Synchronously, via direct reclaim
>> 
>> At page allocation time the allocating task is immediately given a page
>> from the zone free list allowing it to go right back to work doing
>> whatever it was doing; Probably directly or indirectly executing business
>> logic.
>> 
>> Just prior to satisfying the allocation, free pages is checked to see if
>> it has reached the zone low watermark and if so, kswapd is awakened.
>> Kswapd will start scanning pages looking for inactive pages to evict to
>> make room for new page allocations. The work of kswapd allows tasks to
>> continue allocating memory from their respective zone free list without
>> incurring any delay.
>> 
>> When the demand for free pages exceeds the rate that kswapd tasks can
>> supply them, page allocation works differently. Once the allocating task
>> finds that the number of free pages is at or below the zone min watermark,
>> the task will no longer pull pages from the free list. Instead, the task
>> will run the same CPU-bound routines as kswapd to satisfy its own
>> allocation by scanning and evicting pages. This is called a direct reclaim.
>> 
>> The time spent performing a direct reclaim can be substantial, often
>> taking tens to hundreds of milliseconds for small order0 allocations to
>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>> not actually required on a linux system. It exists for the sole purpose of
>> optimizing performance by preventing direct reclaims.
>> 
>> When memory shortfall is sufficient to trigger direct reclaims, they can
>> occur in any task that is running on the system. A single aggressive
>> memory allocating task can set the stage for collateral damage to occur in
>> small tasks that rarely allocate additional memory. Consider the impact of
>> injecting an additional 100ms of latency when nscd allocates memory to
>> facilitate caching of a DNS query.
>> 
>> The presence of direct reclaims 10 years ago was a fairly reliable
>> indicator that too much was being asked of a Linux system. Kswapd was
>> likely wasting time scanning pages that were ineligible for eviction.
>> Adding RAM or reducing the working set size would usually make the problem
>> go away. Since then hardware has evolved to bring a new struggle for
>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>> clock speeds stayed the same or even slowed down in exchange for more
>> cores per package. This presents a throughput problem for a single
>> threaded kswapd that will get worse with each generation of new hardware.
> 
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.

I am not aware of a previous version of Linux that offered more than one kswapd
thread per NUMA node.

> 
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers. The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem.

A single kswapd thread used to keep up with all of the demand you could
create on a Linux system quite easily provided it didn’t have to scan a lot
of pages that were ineligible for eviction. 10 years ago, Fibre Channel was
the popular high performance interconnect and if you were lucky enough
to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host
bus adapter. Also, most high end storage solutions were still using spinning
rust so it took an insane number of spindles behind each host bus adapter
to saturate the channel if the access patterns were random. There really
wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t
been any attempts to do this in the last 10 years.

> It is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.

Please describe the scenario a bit more clearly. Once you start constructing
the workload that can create this scenario, I think you will find that you end
up with a mix that is rarely seen in practice.

> 
>> Test Details
> 
> I will have to study this more to comment.
> 
> [...]
>> By increasing the number of kswapd threads, throughput increased by ~50%
>> while kernel mode CPU utilization decreased or stayed the same, likely due
>> to a decrease in the number of parallel tasks at any given time doing page
>> replacement.
> 
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-10 Thread Buddy Lumpkin

> On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
> 
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>> Page replacement is handled in the Linux Kernel in one of two ways:
>> 
>> 1) Asynchronously via kswapd
>> 2) Synchronously, via direct reclaim
>> 
>> At page allocation time the allocating task is immediately given a page
>> from the zone free list allowing it to go right back to work doing
>> whatever it was doing; Probably directly or indirectly executing business
>> logic.
>> 
>> Just prior to satisfying the allocation, free pages is checked to see if
>> it has reached the zone low watermark and if so, kswapd is awakened.
>> Kswapd will start scanning pages looking for inactive pages to evict to
>> make room for new page allocations. The work of kswapd allows tasks to
>> continue allocating memory from their respective zone free list without
>> incurring any delay.
>> 
>> When the demand for free pages exceeds the rate that kswapd tasks can
>> supply them, page allocation works differently. Once the allocating task
>> finds that the number of free pages is at or below the zone min watermark,
>> the task will no longer pull pages from the free list. Instead, the task
>> will run the same CPU-bound routines as kswapd to satisfy its own
>> allocation by scanning and evicting pages. This is called a direct reclaim.
>> 
>> The time spent performing a direct reclaim can be substantial, often
>> taking tens to hundreds of milliseconds for small order0 allocations to
>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>> not actually required on a linux system. It exists for the sole purpose of
>> optimizing performance by preventing direct reclaims.
>> 
>> When memory shortfall is sufficient to trigger direct reclaims, they can
>> occur in any task that is running on the system. A single aggressive
>> memory allocating task can set the stage for collateral damage to occur in
>> small tasks that rarely allocate additional memory. Consider the impact of
>> injecting an additional 100ms of latency when nscd allocates memory to
>> facilitate caching of a DNS query.
>> 
>> The presence of direct reclaims 10 years ago was a fairly reliable
>> indicator that too much was being asked of a Linux system. Kswapd was
>> likely wasting time scanning pages that were ineligible for eviction.
>> Adding RAM or reducing the working set size would usually make the problem
>> go away. Since then hardware has evolved to bring a new struggle for
>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>> clock speeds stayed the same or even slowed down in exchange for more
>> cores per package. This presents a throughput problem for a single
>> threaded kswapd that will get worse with each generation of new hardware.
> 
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.

I am not aware of a previous version of Linux that offered more than one kswapd
thread per NUMA node.

> 
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers. The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem.

A single kswapd thread used to keep up with all of the demand you could
create on a Linux system quite easily provided it didn’t have to scan a lot
of pages that were ineligible for eviction. 10 years ago, Fibre Channel was
the popular high performance interconnect and if you were lucky enough
to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host
bus adapter. Also, most high end storage solutions were still using spinning
rust so it took an insane number of spindles behind each host bus adapter
to saturate the channel if the access patterns were random. There really
wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t
been any attempts to do this in the last 10 years.

> It is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.

Please describe the scenario a bit more clearly. Once you start constructing
the workload that can create this scenario, I think you will find that you end
up with a mix that is rarely seen in practice.

> 
>> Test Details
> 
> I will have to study this more to comment.
> 
> [...]
>> By increasing the number of kswapd threads, throughput increased by ~50%
>> while kernel mode CPU utilization decreased or stayed the same, likely due
>> to a decrease in the number of parallel tasks at any given time doing page
>> replacement.
> 
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might run along with your 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-04 Thread Buddy Lumpkin

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you 
>> devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd 
>> threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.  If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
> 


Trying to distinguish between the monster and a high value task that you want
to run as quickly as possible would be challenging. I like your idea of using
renice. It probably makes sense to continue to run the first thread on each node
at a standard nice value, and run each additional task with a positive nice 
value.








Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-04 Thread Buddy Lumpkin

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you 
>> devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd 
>> threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.  If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
> 


Trying to distinguish between the monster and a high value task that you want
to run as quickly as possible would be challenging. I like your idea of using
renice. It probably makes sense to continue to run the first thread on each node
at a standard nice value, and run each additional task with a positive nice 
value.








Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-04 Thread Buddy Lumpkin

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you 
>> devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd 
>> threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.  If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.

For the scenario you describe above. I have my own opinions, but I would rather 
not
speculate on what happens. Tomorrow I will try to simulate this situation and 
i’ll
report back on the results. I think this actually makes a case for accepting 
the patch 
as-is for now.  Please hear me out on this:

You mentioned being concerned that an admin will do the wrong thing with this
tunable. I worked in the System Administrator/System Engineering job families 
for
many years and even though I transitioned to spending most of my time on
performance and kernel work, I still maintain an active role in System 
Engineering
related projects, hiring and mentoring.

The kswapd_threads tunable defaults to a value of one, which is the current 
default
behavior. I think there are plenty of sysctls that are more confusing than this 
one. 
If you want to make a comparison, I would say that Transparent Hugepages is one
of the best examples of a feature that has confused System Administrators. I am 
sure
it works a lot better today, but it has a history of really sharp edges, and it 
has been
shipping enabled by default for a long time in the OS distributions I am 
familiar with.
I am hopeful that it works better in later kernels as I think we need more 
features
like it. Specifically, features that bring high performance to naive third 
party apps
that do not make use of advanced features like hugetlbfs, spoke, direct IO, or 
clumsy
interfaces like posix_fadvise. But until they are absolutely polished, I wish 
these kinds
of features would not be turned on by default. This includes kswapd_threads.

More reasons why implementing this tunable makes sense for now:
- A feature like this is a lot easier to reason about after it has been used in 
the field
   for a while. This includes trying to auto-tune it
- We need an answer for this problem today. Today there are single NVMe drives
   capable of 10GB/s and larger systems than the system I used for testing
- In the scenario you describe above, an admin would have no reason to touch
  this sysctl
- I think I mentioned this before. I honestly thought a lot of tuning would be 
necessary
  after implementing this but so far that hasn’t been the case. It works pretty 
well.


> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
> 



Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-04 Thread Buddy Lumpkin

> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this.  If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way.  If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>> 
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you 
>> devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd 
>> threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
> 
> [...]
> 
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
> 
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.  In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.  If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.

For the scenario you describe above. I have my own opinions, but I would rather 
not
speculate on what happens. Tomorrow I will try to simulate this situation and 
i’ll
report back on the results. I think this actually makes a case for accepting 
the patch 
as-is for now.  Please hear me out on this:

You mentioned being concerned that an admin will do the wrong thing with this
tunable. I worked in the System Administrator/System Engineering job families 
for
many years and even though I transitioned to spending most of my time on
performance and kernel work, I still maintain an active role in System 
Engineering
related projects, hiring and mentoring.

The kswapd_threads tunable defaults to a value of one, which is the current 
default
behavior. I think there are plenty of sysctls that are more confusing than this 
one. 
If you want to make a comparison, I would say that Transparent Hugepages is one
of the best examples of a feature that has confused System Administrators. I am 
sure
it works a lot better today, but it has a history of really sharp edges, and it 
has been
shipping enabled by default for a long time in the OS distributions I am 
familiar with.
I am hopeful that it works better in later kernels as I think we need more 
features
like it. Specifically, features that bring high performance to naive third 
party apps
that do not make use of advanced features like hugetlbfs, spoke, direct IO, or 
clumsy
interfaces like posix_fadvise. But until they are absolutely polished, I wish 
these kinds
of features would not be turned on by default. This includes kswapd_threads.

More reasons why implementing this tunable makes sense for now:
- A feature like this is a lot easier to reason about after it has been used in 
the field
   for a while. This includes trying to auto-tune it
- We need an answer for this problem today. Today there are single NVMe drives
   capable of 10GB/s and larger systems than the system I used for testing
- In the scenario you describe above, an admin would have no reason to touch
  this sysctl
- I think I mentioned this before. I honestly thought a lot of tuning would be 
necessary
  after implementing this but so far that hasn’t been the case. It works pretty 
well.


> 
> Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
> 



Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Matthew Wilcox
On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
> > Yes, very much this.  If you have a single-threaded workload which is
> > using the entirety of memory and would like to use even more, then it
> > makes sense to use as many CPUs as necessary getting memory out of its
> > way.  If you have N CPUs and N-1 threads happily occupying themselves in
> > their own reasonably-sized working sets with one monster process trying
> > to use as much RAM as possible, then I'd be pretty unimpressed to see
> > the N-1 well-behaved threads preempted by kswapd.
> 
> The default value provides one kswapd thread per NUMA node, the same
> it was without the patch. Also, I would point out that just because you devote
> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
> are busy, they are almost certainly doing work that would have resulted in
> direct reclaims, which are often substantially more expensive than a couple
> extra context switches due to preemption.

[...]

> In my previous response to Michal Hocko, I described
> how I think we could scale watermarks in response to direct reclaims, and
> launch more kswapd threads when kswapd peaks at 100% CPU usage.

I think you're missing my point about the workload ... kswapd isn't
"nice", so it will compete with the N-1 threads which are chugging along
at 100% CPU inside their working sets.  In this scenario, we _don't_
want to kick off kswapd at all; we want the monster thread to clean up
its own mess.  If we have idle CPUs, then yes, absolutely, lets have
them clean up for the monster, but otherwise, I want my N-1 threads
doing their own thing.

Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
and discovered it was a bad idea?



Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Matthew Wilcox
On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
> > Yes, very much this.  If you have a single-threaded workload which is
> > using the entirety of memory and would like to use even more, then it
> > makes sense to use as many CPUs as necessary getting memory out of its
> > way.  If you have N CPUs and N-1 threads happily occupying themselves in
> > their own reasonably-sized working sets with one monster process trying
> > to use as much RAM as possible, then I'd be pretty unimpressed to see
> > the N-1 well-behaved threads preempted by kswapd.
> 
> The default value provides one kswapd thread per NUMA node, the same
> it was without the patch. Also, I would point out that just because you devote
> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
> are busy, they are almost certainly doing work that would have resulted in
> direct reclaims, which are often substantially more expensive than a couple
> extra context switches due to preemption.

[...]

> In my previous response to Michal Hocko, I described
> how I think we could scale watermarks in response to direct reclaims, and
> launch more kswapd threads when kswapd peaks at 100% CPU usage.

I think you're missing my point about the workload ... kswapd isn't
"nice", so it will compete with the N-1 threads which are chugging along
at 100% CPU inside their working sets.  In this scenario, we _don't_
want to kick off kswapd at all; we want the monster thread to clean up
its own mess.  If we have idle CPUs, then yes, absolutely, lets have
them clean up for the monster, but otherwise, I want my N-1 threads
doing their own thing.

Maybe we should renice kswapd anyway ... thoughts?  We don't seem to have
had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
and discovered it was a bad idea?



Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Buddy Lumpkin

> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>> indicator that too much was being asked of a Linux system. Kswapd was
>>> likely wasting time scanning pages that were ineligible for eviction.
>>> Adding RAM or reducing the working set size would usually make the problem
>>> go away. Since then hardware has evolved to bring a new struggle for
>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>> clock speeds stayed the same or even slowed down in exchange for more
>>> cores per package. This presents a throughput problem for a single
>>> threaded kswapd that will get worse with each generation of new hardware.
>> 
>> AFAIR we used to scale the number of kswapd workers many years ago. It
>> just turned out to be not all that great. We have a kswapd reclaim
>> window for quite some time and that can allow to tune how much proactive
>> kswapd should be.
>> 
>> Also please note that the direct reclaim is a way to throttle overly
>> aggressive memory consumers. The more we do in the background context
>> the easier for them it will be to allocate faster. So I am not really
>> sure that more background threads will solve the underlying problem. It
>> is just a matter of memory hogs tunning to end in the very same
>> situtation AFAICS. Moreover the more they are going to allocate the more
>> less CPU time will _other_ (non-allocating) task get.
>> 
>>> Test Details
>> 
>> I will have to study this more to comment.
>> 
>> [...]
>>> By increasing the number of kswapd threads, throughput increased by ~50%
>>> while kernel mode CPU utilization decreased or stayed the same, likely due
>>> to a decrease in the number of parallel tasks at any given time doing page
>>> replacement.
>> 
>> Well, isn't that just an effect of more work being done on behalf of
>> other workload that might run along with your tests (and which doesn't
>> really need to allocate a lot of memory)? In other words how
>> does the patch behaves with a non-artificial mixed workloads?
>> 
>> Please note that I am not saying that we absolutely have to stick with the
>> current single-thread-per-node implementation but I would really like to
>> see more background on why we should be allowing heavy memory hogs to
>> allocate faster or how to prevent that. I would be also very interested
>> to see how to scale the number of threads based on how CPUs are utilized
>> by other workloads.
> 
> Yes, very much this.  If you have a single-threaded workload which is
> using the entirety of memory and would like to use even more, then it
> makes sense to use as many CPUs as necessary getting memory out of its
> way.  If you have N CPUs and N-1 threads happily occupying themselves in
> their own reasonably-sized working sets with one monster process trying
> to use as much RAM as possible, then I'd be pretty unimpressed to see
> the N-1 well-behaved threads preempted by kswapd.

The default value provides one kswapd thread per NUMA node, the same
it was without the patch. Also, I would point out that just because you devote
more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
are busy, they are almost certainly doing work that would have resulted in
direct reclaims, which are often substantially more expensive than a couple
extra context switches due to preemption.

Also, the code still uses wake_up_interruptible to wake kswapd threads, so
after starting the first kswapd thread, free pages minus the size of the 
allocation
would still need to be below the low watermark for a page allocation at that 
time
to cause another kswapd thread to wake up.

When I first decided to try this out, I figured a lot of tuning would be needed 
to
see good behavior. But what I found in practice was that it actually works quite
well. When you look closely, you see that there is very little difference 
between
a direct reclaim and kswapd. In fact, direct reclaims work a little harder than
kswapd, and they should continue to do so because that prevents the number
of parallel scanning tasks from increasing unnecessarily.

Please try it out, you might be surprised at how well it works. 

> 
> My biggest problem with the patch-as-presented is that it's yet one more
> thing for admins to get wrong.  We should spawn more threads automatically
> if system conditions are right to do that.

I totally agree with this. In my previous response to Michal Hocko, I described
how I think we could scale watermarks in response to direct reclaims, and
launch more kswapd threads when kswapd peaks at 100% CPU usage.






Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Buddy Lumpkin

> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox  wrote:
> 
> On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>> indicator that too much was being asked of a Linux system. Kswapd was
>>> likely wasting time scanning pages that were ineligible for eviction.
>>> Adding RAM or reducing the working set size would usually make the problem
>>> go away. Since then hardware has evolved to bring a new struggle for
>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>> clock speeds stayed the same or even slowed down in exchange for more
>>> cores per package. This presents a throughput problem for a single
>>> threaded kswapd that will get worse with each generation of new hardware.
>> 
>> AFAIR we used to scale the number of kswapd workers many years ago. It
>> just turned out to be not all that great. We have a kswapd reclaim
>> window for quite some time and that can allow to tune how much proactive
>> kswapd should be.
>> 
>> Also please note that the direct reclaim is a way to throttle overly
>> aggressive memory consumers. The more we do in the background context
>> the easier for them it will be to allocate faster. So I am not really
>> sure that more background threads will solve the underlying problem. It
>> is just a matter of memory hogs tunning to end in the very same
>> situtation AFAICS. Moreover the more they are going to allocate the more
>> less CPU time will _other_ (non-allocating) task get.
>> 
>>> Test Details
>> 
>> I will have to study this more to comment.
>> 
>> [...]
>>> By increasing the number of kswapd threads, throughput increased by ~50%
>>> while kernel mode CPU utilization decreased or stayed the same, likely due
>>> to a decrease in the number of parallel tasks at any given time doing page
>>> replacement.
>> 
>> Well, isn't that just an effect of more work being done on behalf of
>> other workload that might run along with your tests (and which doesn't
>> really need to allocate a lot of memory)? In other words how
>> does the patch behaves with a non-artificial mixed workloads?
>> 
>> Please note that I am not saying that we absolutely have to stick with the
>> current single-thread-per-node implementation but I would really like to
>> see more background on why we should be allowing heavy memory hogs to
>> allocate faster or how to prevent that. I would be also very interested
>> to see how to scale the number of threads based on how CPUs are utilized
>> by other workloads.
> 
> Yes, very much this.  If you have a single-threaded workload which is
> using the entirety of memory and would like to use even more, then it
> makes sense to use as many CPUs as necessary getting memory out of its
> way.  If you have N CPUs and N-1 threads happily occupying themselves in
> their own reasonably-sized working sets with one monster process trying
> to use as much RAM as possible, then I'd be pretty unimpressed to see
> the N-1 well-behaved threads preempted by kswapd.

The default value provides one kswapd thread per NUMA node, the same
it was without the patch. Also, I would point out that just because you devote
more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
are busy, they are almost certainly doing work that would have resulted in
direct reclaims, which are often substantially more expensive than a couple
extra context switches due to preemption.

Also, the code still uses wake_up_interruptible to wake kswapd threads, so
after starting the first kswapd thread, free pages minus the size of the 
allocation
would still need to be below the low watermark for a page allocation at that 
time
to cause another kswapd thread to wake up.

When I first decided to try this out, I figured a lot of tuning would be needed 
to
see good behavior. But what I found in practice was that it actually works quite
well. When you look closely, you see that there is very little difference 
between
a direct reclaim and kswapd. In fact, direct reclaims work a little harder than
kswapd, and they should continue to do so because that prevents the number
of parallel scanning tasks from increasing unnecessarily.

Please try it out, you might be surprised at how well it works. 

> 
> My biggest problem with the patch-as-presented is that it's yet one more
> thing for admins to get wrong.  We should spawn more threads automatically
> if system conditions are right to do that.

I totally agree with this. In my previous response to Michal Hocko, I described
how I think we could scale watermarks in response to direct reclaims, and
launch more kswapd threads when kswapd peaks at 100% CPU usage.






Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Buddy Lumpkin
Very sorry, I forgot to send my last response as plain text.

> On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
> 
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>> Page replacement is handled in the Linux Kernel in one of two ways:
>> 
>> 1) Asynchronously via kswapd
>> 2) Synchronously, via direct reclaim
>> 
>> At page allocation time the allocating task is immediately given a page
>> from the zone free list allowing it to go right back to work doing
>> whatever it was doing; Probably directly or indirectly executing business
>> logic.
>> 
>> Just prior to satisfying the allocation, free pages is checked to see if
>> it has reached the zone low watermark and if so, kswapd is awakened.
>> Kswapd will start scanning pages looking for inactive pages to evict to
>> make room for new page allocations. The work of kswapd allows tasks to
>> continue allocating memory from their respective zone free list without
>> incurring any delay.
>> 
>> When the demand for free pages exceeds the rate that kswapd tasks can
>> supply them, page allocation works differently. Once the allocating task
>> finds that the number of free pages is at or below the zone min watermark,
>> the task will no longer pull pages from the free list. Instead, the task
>> will run the same CPU-bound routines as kswapd to satisfy its own
>> allocation by scanning and evicting pages. This is called a direct reclaim.
>> 
>> The time spent performing a direct reclaim can be substantial, often
>> taking tens to hundreds of milliseconds for small order0 allocations to
>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>> not actually required on a linux system. It exists for the sole purpose of
>> optimizing performance by preventing direct reclaims.
>> 
>> When memory shortfall is sufficient to trigger direct reclaims, they can
>> occur in any task that is running on the system. A single aggressive
>> memory allocating task can set the stage for collateral damage to occur in
>> small tasks that rarely allocate additional memory. Consider the impact of
>> injecting an additional 100ms of latency when nscd allocates memory to
>> facilitate caching of a DNS query.
>> 
>> The presence of direct reclaims 10 years ago was a fairly reliable
>> indicator that too much was being asked of a Linux system. Kswapd was
>> likely wasting time scanning pages that were ineligible for eviction.
>> Adding RAM or reducing the working set size would usually make the problem
>> go away. Since then hardware has evolved to bring a new struggle for
>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>> clock speeds stayed the same or even slowed down in exchange for more
>> cores per package. This presents a throughput problem for a single
>> threaded kswapd that will get worse with each generation of new hardware.
> 
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.

Are you referring to vm.watermark_scale_factor? This helps quite a bit. 
Previously
I had to increase min_free_kbytes in order to get a larger gap between the low
and min watemarks. I was very excited when saw that this had been added
upstream. 

> 
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers.

I totally agree, in fact I think this should be the primary role of direct 
reclaims
because they have a substantial impact on performance. Direct reclaims are
the emergency brakes for page allocation, and the case I am making here is 
that they used to only occur when kswapd had to skip over a lot of pages. 

This changed over time as the rate a system can allocate pages increased. 
Direct reclaims slowly became a normal part of page replacement. 


> The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem. It
> is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.

The important thing to realize here is that kswapd and direct reclaims run the
same code paths. There is very little that they do differently. If you compare
my test results with one kswapd vs four, your an see that direct reclaims
increase the kernel mode CPU consumption considerably. By dedicating
more threads to proactive page replacement, you eliminate direct reclaims
which reduces the total number of parallel threads that are spinning on the
CPU.

> 
>> Test Details
> 
> I will have to study this more to comment.
> 
> [...]
>> By increasing the number of kswapd threads, throughput increased by ~50%
>> while kernel mode CPU utilization decreased or stayed the same, likely due
>> 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Buddy Lumpkin
Very sorry, I forgot to send my last response as plain text.

> On Apr 3, 2018, at 6:31 AM, Michal Hocko  wrote:
> 
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>> Page replacement is handled in the Linux Kernel in one of two ways:
>> 
>> 1) Asynchronously via kswapd
>> 2) Synchronously, via direct reclaim
>> 
>> At page allocation time the allocating task is immediately given a page
>> from the zone free list allowing it to go right back to work doing
>> whatever it was doing; Probably directly or indirectly executing business
>> logic.
>> 
>> Just prior to satisfying the allocation, free pages is checked to see if
>> it has reached the zone low watermark and if so, kswapd is awakened.
>> Kswapd will start scanning pages looking for inactive pages to evict to
>> make room for new page allocations. The work of kswapd allows tasks to
>> continue allocating memory from their respective zone free list without
>> incurring any delay.
>> 
>> When the demand for free pages exceeds the rate that kswapd tasks can
>> supply them, page allocation works differently. Once the allocating task
>> finds that the number of free pages is at or below the zone min watermark,
>> the task will no longer pull pages from the free list. Instead, the task
>> will run the same CPU-bound routines as kswapd to satisfy its own
>> allocation by scanning and evicting pages. This is called a direct reclaim.
>> 
>> The time spent performing a direct reclaim can be substantial, often
>> taking tens to hundreds of milliseconds for small order0 allocations to
>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>> not actually required on a linux system. It exists for the sole purpose of
>> optimizing performance by preventing direct reclaims.
>> 
>> When memory shortfall is sufficient to trigger direct reclaims, they can
>> occur in any task that is running on the system. A single aggressive
>> memory allocating task can set the stage for collateral damage to occur in
>> small tasks that rarely allocate additional memory. Consider the impact of
>> injecting an additional 100ms of latency when nscd allocates memory to
>> facilitate caching of a DNS query.
>> 
>> The presence of direct reclaims 10 years ago was a fairly reliable
>> indicator that too much was being asked of a Linux system. Kswapd was
>> likely wasting time scanning pages that were ineligible for eviction.
>> Adding RAM or reducing the working set size would usually make the problem
>> go away. Since then hardware has evolved to bring a new struggle for
>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>> clock speeds stayed the same or even slowed down in exchange for more
>> cores per package. This presents a throughput problem for a single
>> threaded kswapd that will get worse with each generation of new hardware.
> 
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.

Are you referring to vm.watermark_scale_factor? This helps quite a bit. 
Previously
I had to increase min_free_kbytes in order to get a larger gap between the low
and min watemarks. I was very excited when saw that this had been added
upstream. 

> 
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers.

I totally agree, in fact I think this should be the primary role of direct 
reclaims
because they have a substantial impact on performance. Direct reclaims are
the emergency brakes for page allocation, and the case I am making here is 
that they used to only occur when kswapd had to skip over a lot of pages. 

This changed over time as the rate a system can allocate pages increased. 
Direct reclaims slowly became a normal part of page replacement. 


> The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem. It
> is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.

The important thing to realize here is that kswapd and direct reclaims run the
same code paths. There is very little that they do differently. If you compare
my test results with one kswapd vs four, your an see that direct reclaims
increase the kernel mode CPU consumption considerably. By dedicating
more threads to proactive page replacement, you eliminate direct reclaims
which reduces the total number of parallel threads that are spinning on the
CPU.

> 
>> Test Details
> 
> I will have to study this more to comment.
> 
> [...]
>> By increasing the number of kswapd threads, throughput increased by ~50%
>> while kernel mode CPU utilization decreased or stayed the same, likely due
>> to a decrease in 

Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Matthew Wilcox
On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> > The presence of direct reclaims 10 years ago was a fairly reliable
> > indicator that too much was being asked of a Linux system. Kswapd was
> > likely wasting time scanning pages that were ineligible for eviction.
> > Adding RAM or reducing the working set size would usually make the problem
> > go away. Since then hardware has evolved to bring a new struggle for
> > kswapd. Storage speeds have increased by orders of magnitude while CPU
> > clock speeds stayed the same or even slowed down in exchange for more
> > cores per package. This presents a throughput problem for a single
> > threaded kswapd that will get worse with each generation of new hardware.
> 
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.
> 
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers. The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem. It
> is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.
> 
> > Test Details
> 
> I will have to study this more to comment.
> 
> [...]
> > By increasing the number of kswapd threads, throughput increased by ~50%
> > while kernel mode CPU utilization decreased or stayed the same, likely due
> > to a decrease in the number of parallel tasks at any given time doing page
> > replacement.
> 
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might run along with your tests (and which doesn't
> really need to allocate a lot of memory)? In other words how
> does the patch behaves with a non-artificial mixed workloads?
> 
> Please note that I am not saying that we absolutely have to stick with the
> current single-thread-per-node implementation but I would really like to
> see more background on why we should be allowing heavy memory hogs to
> allocate faster or how to prevent that. I would be also very interested
> to see how to scale the number of threads based on how CPUs are utilized
> by other workloads.

Yes, very much this.  If you have a single-threaded workload which is
using the entirety of memory and would like to use even more, then it
makes sense to use as many CPUs as necessary getting memory out of its
way.  If you have N CPUs and N-1 threads happily occupying themselves in
their own reasonably-sized working sets with one monster process trying
to use as much RAM as possible, then I'd be pretty unimpressed to see
the N-1 well-behaved threads preempted by kswapd.

My biggest problem with the patch-as-presented is that it's yet one more
thing for admins to get wrong.  We should spawn more threads automatically
if system conditions are right to do that.


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Matthew Wilcox
On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> > The presence of direct reclaims 10 years ago was a fairly reliable
> > indicator that too much was being asked of a Linux system. Kswapd was
> > likely wasting time scanning pages that were ineligible for eviction.
> > Adding RAM or reducing the working set size would usually make the problem
> > go away. Since then hardware has evolved to bring a new struggle for
> > kswapd. Storage speeds have increased by orders of magnitude while CPU
> > clock speeds stayed the same or even slowed down in exchange for more
> > cores per package. This presents a throughput problem for a single
> > threaded kswapd that will get worse with each generation of new hardware.
> 
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.
> 
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers. The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem. It
> is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.
> 
> > Test Details
> 
> I will have to study this more to comment.
> 
> [...]
> > By increasing the number of kswapd threads, throughput increased by ~50%
> > while kernel mode CPU utilization decreased or stayed the same, likely due
> > to a decrease in the number of parallel tasks at any given time doing page
> > replacement.
> 
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might run along with your tests (and which doesn't
> really need to allocate a lot of memory)? In other words how
> does the patch behaves with a non-artificial mixed workloads?
> 
> Please note that I am not saying that we absolutely have to stick with the
> current single-thread-per-node implementation but I would really like to
> see more background on why we should be allowing heavy memory hogs to
> allocate faster or how to prevent that. I would be also very interested
> to see how to scale the number of threads based on how CPUs are utilized
> by other workloads.

Yes, very much this.  If you have a single-threaded workload which is
using the entirety of memory and would like to use even more, then it
makes sense to use as many CPUs as necessary getting memory out of its
way.  If you have N CPUs and N-1 threads happily occupying themselves in
their own reasonably-sized working sets with one monster process trying
to use as much RAM as possible, then I'd be pretty unimpressed to see
the N-1 well-behaved threads preempted by kswapd.

My biggest problem with the patch-as-presented is that it's yet one more
thing for admins to get wrong.  We should spawn more threads automatically
if system conditions are right to do that.


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Michal Hocko
On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> Page replacement is handled in the Linux Kernel in one of two ways:
> 
> 1) Asynchronously via kswapd
> 2) Synchronously, via direct reclaim
> 
> At page allocation time the allocating task is immediately given a page
> from the zone free list allowing it to go right back to work doing
> whatever it was doing; Probably directly or indirectly executing business
> logic.
> 
> Just prior to satisfying the allocation, free pages is checked to see if
> it has reached the zone low watermark and if so, kswapd is awakened.
> Kswapd will start scanning pages looking for inactive pages to evict to
> make room for new page allocations. The work of kswapd allows tasks to
> continue allocating memory from their respective zone free list without
> incurring any delay.
> 
> When the demand for free pages exceeds the rate that kswapd tasks can
> supply them, page allocation works differently. Once the allocating task
> finds that the number of free pages is at or below the zone min watermark,
> the task will no longer pull pages from the free list. Instead, the task
> will run the same CPU-bound routines as kswapd to satisfy its own
> allocation by scanning and evicting pages. This is called a direct reclaim.
> 
> The time spent performing a direct reclaim can be substantial, often
> taking tens to hundreds of milliseconds for small order0 allocations to
> half a second or more for order9 huge-page allocations. In fact, kswapd is
> not actually required on a linux system. It exists for the sole purpose of
> optimizing performance by preventing direct reclaims.
> 
> When memory shortfall is sufficient to trigger direct reclaims, they can
> occur in any task that is running on the system. A single aggressive
> memory allocating task can set the stage for collateral damage to occur in
> small tasks that rarely allocate additional memory. Consider the impact of
> injecting an additional 100ms of latency when nscd allocates memory to
> facilitate caching of a DNS query.
> 
> The presence of direct reclaims 10 years ago was a fairly reliable
> indicator that too much was being asked of a Linux system. Kswapd was
> likely wasting time scanning pages that were ineligible for eviction.
> Adding RAM or reducing the working set size would usually make the problem
> go away. Since then hardware has evolved to bring a new struggle for
> kswapd. Storage speeds have increased by orders of magnitude while CPU
> clock speeds stayed the same or even slowed down in exchange for more
> cores per package. This presents a throughput problem for a single
> threaded kswapd that will get worse with each generation of new hardware.

AFAIR we used to scale the number of kswapd workers many years ago. It
just turned out to be not all that great. We have a kswapd reclaim
window for quite some time and that can allow to tune how much proactive
kswapd should be.

Also please note that the direct reclaim is a way to throttle overly
aggressive memory consumers. The more we do in the background context
the easier for them it will be to allocate faster. So I am not really
sure that more background threads will solve the underlying problem. It
is just a matter of memory hogs tunning to end in the very same
situtation AFAICS. Moreover the more they are going to allocate the more
less CPU time will _other_ (non-allocating) task get.

> Test Details

I will have to study this more to comment.

[...]
> By increasing the number of kswapd threads, throughput increased by ~50%
> while kernel mode CPU utilization decreased or stayed the same, likely due
> to a decrease in the number of parallel tasks at any given time doing page
> replacement.

Well, isn't that just an effect of more work being done on behalf of
other workload that might run along with your tests (and which doesn't
really need to allocate a lot of memory)? In other words how
does the patch behaves with a non-artificial mixed workloads?

Please note that I am not saying that we absolutely have to stick with the
current single-thread-per-node implementation but I would really like to
see more background on why we should be allowing heavy memory hogs to
allocate faster or how to prevent that. I would be also very interested
to see how to scale the number of threads based on how CPUs are utilized
by other workloads.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2018-04-03 Thread Michal Hocko
On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> Page replacement is handled in the Linux Kernel in one of two ways:
> 
> 1) Asynchronously via kswapd
> 2) Synchronously, via direct reclaim
> 
> At page allocation time the allocating task is immediately given a page
> from the zone free list allowing it to go right back to work doing
> whatever it was doing; Probably directly or indirectly executing business
> logic.
> 
> Just prior to satisfying the allocation, free pages is checked to see if
> it has reached the zone low watermark and if so, kswapd is awakened.
> Kswapd will start scanning pages looking for inactive pages to evict to
> make room for new page allocations. The work of kswapd allows tasks to
> continue allocating memory from their respective zone free list without
> incurring any delay.
> 
> When the demand for free pages exceeds the rate that kswapd tasks can
> supply them, page allocation works differently. Once the allocating task
> finds that the number of free pages is at or below the zone min watermark,
> the task will no longer pull pages from the free list. Instead, the task
> will run the same CPU-bound routines as kswapd to satisfy its own
> allocation by scanning and evicting pages. This is called a direct reclaim.
> 
> The time spent performing a direct reclaim can be substantial, often
> taking tens to hundreds of milliseconds for small order0 allocations to
> half a second or more for order9 huge-page allocations. In fact, kswapd is
> not actually required on a linux system. It exists for the sole purpose of
> optimizing performance by preventing direct reclaims.
> 
> When memory shortfall is sufficient to trigger direct reclaims, they can
> occur in any task that is running on the system. A single aggressive
> memory allocating task can set the stage for collateral damage to occur in
> small tasks that rarely allocate additional memory. Consider the impact of
> injecting an additional 100ms of latency when nscd allocates memory to
> facilitate caching of a DNS query.
> 
> The presence of direct reclaims 10 years ago was a fairly reliable
> indicator that too much was being asked of a Linux system. Kswapd was
> likely wasting time scanning pages that were ineligible for eviction.
> Adding RAM or reducing the working set size would usually make the problem
> go away. Since then hardware has evolved to bring a new struggle for
> kswapd. Storage speeds have increased by orders of magnitude while CPU
> clock speeds stayed the same or even slowed down in exchange for more
> cores per package. This presents a throughput problem for a single
> threaded kswapd that will get worse with each generation of new hardware.

AFAIR we used to scale the number of kswapd workers many years ago. It
just turned out to be not all that great. We have a kswapd reclaim
window for quite some time and that can allow to tune how much proactive
kswapd should be.

Also please note that the direct reclaim is a way to throttle overly
aggressive memory consumers. The more we do in the background context
the easier for them it will be to allocate faster. So I am not really
sure that more background threads will solve the underlying problem. It
is just a matter of memory hogs tunning to end in the very same
situtation AFAICS. Moreover the more they are going to allocate the more
less CPU time will _other_ (non-allocating) task get.

> Test Details

I will have to study this more to comment.

[...]
> By increasing the number of kswapd threads, throughput increased by ~50%
> while kernel mode CPU utilization decreased or stayed the same, likely due
> to a decrease in the number of parallel tasks at any given time doing page
> replacement.

Well, isn't that just an effect of more work being done on behalf of
other workload that might run along with your tests (and which doesn't
really need to allocate a lot of memory)? In other words how
does the patch behaves with a non-artificial mixed workloads?

Please note that I am not saying that we absolutely have to stick with the
current single-thread-per-node implementation but I would really like to
see more background on why we should be allowing heavy memory hogs to
allocate faster or how to prevent that. I would be also very interested
to see how to scale the number of threads based on how CPUs are utilized
by other workloads.
-- 
Michal Hocko
SUSE Labs