Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-10 Thread Buddy Lumpkin

> On Apr 6, 2018, at 12:38 AM, Michal Hocko  wrote:
> 
> On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote:
>> 
>>> On Apr 4, 2018, at 11:10 PM, Michal Hocko  wrote:
>>> 
>>> On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
 v2:
 - Make update_kswapd_threads_node less racy
 - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n
>>> 
>>> Please do not repost with such a small changes. It is much more
>>> important to sort out the big picture first and only then deal with
>>> minor implementation details. The more versions you post the more
>>> fragmented and messy the discussion will become.
>>> 
>>> You will have to be patient because this is a rather big change and it
>>> will take _quite_ some time to get sorted.
>>> 
>>> Thanks!
>>> -- 
>>> Michal Hocko
>>> SUSE Labs
>>> 
>> 
>> 
>> Sorry about that, I actually had three people review my code internally,
>> then I managed to send out an old version. 100% guilty of submitting
>> code when I needed sleep. As for the change, that was in response
>> to a request from Andrew to make the update function less racy.
>> 
>> Should I resend a correct v2 now that the thread exists?
> 
> Let's just discuss open questions for now. Specifics of the code are the
> least interesting at this stage.
> 
> If you want some help with the code review, you can put it somewhere in
> the git tree and send a reference for those who are interested.
> -- 
> Michal Hocko
> SUSE Labs

Ok, I will go back through the thread and make sure all questions and
concerns have been addressed.




Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-10 Thread Buddy Lumpkin

> On Apr 6, 2018, at 12:38 AM, Michal Hocko  wrote:
> 
> On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote:
>> 
>>> On Apr 4, 2018, at 11:10 PM, Michal Hocko  wrote:
>>> 
>>> On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
 v2:
 - Make update_kswapd_threads_node less racy
 - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n
>>> 
>>> Please do not repost with such a small changes. It is much more
>>> important to sort out the big picture first and only then deal with
>>> minor implementation details. The more versions you post the more
>>> fragmented and messy the discussion will become.
>>> 
>>> You will have to be patient because this is a rather big change and it
>>> will take _quite_ some time to get sorted.
>>> 
>>> Thanks!
>>> -- 
>>> Michal Hocko
>>> SUSE Labs
>>> 
>> 
>> 
>> Sorry about that, I actually had three people review my code internally,
>> then I managed to send out an old version. 100% guilty of submitting
>> code when I needed sleep. As for the change, that was in response
>> to a request from Andrew to make the update function less racy.
>> 
>> Should I resend a correct v2 now that the thread exists?
> 
> Let's just discuss open questions for now. Specifics of the code are the
> least interesting at this stage.
> 
> If you want some help with the code review, you can put it somewhere in
> the git tree and send a reference for those who are interested.
> -- 
> Michal Hocko
> SUSE Labs

Ok, I will go back through the thread and make sure all questions and
concerns have been addressed.




Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-06 Thread Michal Hocko
On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote:
> 
> > On Apr 4, 2018, at 11:10 PM, Michal Hocko  wrote:
> > 
> > On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
> >> v2:
> >> - Make update_kswapd_threads_node less racy
> >> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n
> > 
> > Please do not repost with such a small changes. It is much more
> > important to sort out the big picture first and only then deal with
> > minor implementation details. The more versions you post the more
> > fragmented and messy the discussion will become.
> > 
> > You will have to be patient because this is a rather big change and it
> > will take _quite_ some time to get sorted.
> > 
> > Thanks!
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 
> 
> Sorry about that, I actually had three people review my code internally,
> then I managed to send out an old version. 100% guilty of submitting
> code when I needed sleep. As for the change, that was in response
> to a request from Andrew to make the update function less racy.
> 
> Should I resend a correct v2 now that the thread exists?

Let's just discuss open questions for now. Specifics of the code are the
least interesting at this stage.

If you want some help with the code review, you can put it somewhere in
the git tree and send a reference for those who are interested.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-06 Thread Michal Hocko
On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote:
> 
> > On Apr 4, 2018, at 11:10 PM, Michal Hocko  wrote:
> > 
> > On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
> >> v2:
> >> - Make update_kswapd_threads_node less racy
> >> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n
> > 
> > Please do not repost with such a small changes. It is much more
> > important to sort out the big picture first and only then deal with
> > minor implementation details. The more versions you post the more
> > fragmented and messy the discussion will become.
> > 
> > You will have to be patient because this is a rather big change and it
> > will take _quite_ some time to get sorted.
> > 
> > Thanks!
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 
> 
> Sorry about that, I actually had three people review my code internally,
> then I managed to send out an old version. 100% guilty of submitting
> code when I needed sleep. As for the change, that was in response
> to a request from Andrew to make the update function less racy.
> 
> Should I resend a correct v2 now that the thread exists?

Let's just discuss open questions for now. Specifics of the code are the
least interesting at this stage.

If you want some help with the code review, you can put it somewhere in
the git tree and send a reference for those who are interested.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-06 Thread Buddy Lumpkin

> On Apr 4, 2018, at 11:10 PM, Michal Hocko  wrote:
> 
> On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
>> v2:
>> - Make update_kswapd_threads_node less racy
>> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n
> 
> Please do not repost with such a small changes. It is much more
> important to sort out the big picture first and only then deal with
> minor implementation details. The more versions you post the more
> fragmented and messy the discussion will become.
> 
> You will have to be patient because this is a rather big change and it
> will take _quite_ some time to get sorted.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 


Sorry about that, I actually had three people review my code internally,
then I managed to send out an old version. 100% guilty of submitting
code when I needed sleep. As for the change, that was in response
to a request from Andrew to make the update function less racy.

Should I resend a correct v2 now that the thread exists?

—Buddy

Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-06 Thread Buddy Lumpkin

> On Apr 4, 2018, at 11:10 PM, Michal Hocko  wrote:
> 
> On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
>> v2:
>> - Make update_kswapd_threads_node less racy
>> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n
> 
> Please do not repost with such a small changes. It is much more
> important to sort out the big picture first and only then deal with
> minor implementation details. The more versions you post the more
> fragmented and messy the discussion will become.
> 
> You will have to be patient because this is a rather big change and it
> will take _quite_ some time to get sorted.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 


Sorry about that, I actually had three people review my code internally,
then I managed to send out an old version. 100% guilty of submitting
code when I needed sleep. As for the change, that was in response
to a request from Andrew to make the update function less racy.

Should I resend a correct v2 now that the thread exists?

—Buddy

Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-05 Thread Michal Hocko
On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
> v2:
> - Make update_kswapd_threads_node less racy
> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n

Please do not repost with such a small changes. It is much more
important to sort out the big picture first and only then deal with
minor implementation details. The more versions you post the more
fragmented and messy the discussion will become.

You will have to be patient because this is a rather big change and it
will take _quite_ some time to get sorted.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-05 Thread Michal Hocko
On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote:
> v2:
> - Make update_kswapd_threads_node less racy
> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n

Please do not repost with such a small changes. It is much more
important to sort out the big picture first and only then deal with
minor implementation details. The more versions you post the more
fragmented and messy the discussion will become.

You will have to be patient because this is a rather big change and it
will take _quite_ some time to get sorted.

Thanks!
-- 
Michal Hocko
SUSE Labs


[RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-04 Thread Buddy Lumpkin
Page replacement is handled in the Linux Kernel in one of two ways:

1) Asynchronously via kswapd
2) Synchronously, via direct reclaim

At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.

Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.

When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.

The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.

When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.

The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.

Test Details

NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details

The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.

Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
11 250GB zero-filled files on each drive so that I could test with
parallel reads.

The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.

During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:

CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
   doesn't tend to fluctuate much so I just grab the highest value.
   Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
   there is a lot of variation.

Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total

This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.

The dd command for this test looks like this:

Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M

Test #1: Direct IO
dd sy dd_cpu throughput
6  0  2.33   14726026.40
10 1  2.95   19954974.80
16 1  2.63   24419689.30
22 1  2.63   25430303.20
28 1  2.91   26026513.20
34 1  2.53   26178618.00
40 1  2.18   26239229.20
46 1  1.91   26250550.40
52 1  1.69   26251845.60
58 1  1.54   26253205.60
64 1  1.43 

[RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node

2018-04-04 Thread Buddy Lumpkin
Page replacement is handled in the Linux Kernel in one of two ways:

1) Asynchronously via kswapd
2) Synchronously, via direct reclaim

At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.

Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.

When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.

The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.

When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.

The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.

Test Details

NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details

The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.

Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
11 250GB zero-filled files on each drive so that I could test with
parallel reads.

The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.

During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:

CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
   doesn't tend to fluctuate much so I just grab the highest value.
   Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
   there is a lot of variation.

Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total

This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.

The dd command for this test looks like this:

Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M

Test #1: Direct IO
dd sy dd_cpu throughput
6  0  2.33   14726026.40
10 1  2.95   19954974.80
16 1  2.63   24419689.30
22 1  2.63   25430303.20
28 1  2.91   26026513.20
34 1  2.53   26178618.00
40 1  2.18   26239229.20
46 1  1.91   26250550.40
52 1  1.69   26251845.60
58 1  1.54   26253205.60
64 1  1.43