Re: [PATCH] swap: choose swap device according to numa node

2017-08-15 Thread Andrew Morton
On Tue, 15 Aug 2017 13:49:45 +0800 Aaron Lu  wrote:

> On Mon, Aug 14, 2017 at 04:33:37PM -0700, Andrew Morton wrote:
> > On Mon, 14 Aug 2017 13:31:30 +0800 Aaron Lu  wrote:
> > 
> > > --- /dev/null
> > > +++ b/Documentation/vm/swap_numa.txt
> > > @@ -0,0 +1,18 @@
> > > +If the system has more than one swap device and swap device has the node
> > > +information, we can make use of this information to decide which swap
> > > +device to use in get_swap_pages() to get better performance.
> > > +
> > > +The current code uses a priority based list, swap_avail_list, to decide
> > > +which swap device to use and if multiple swap devices share the same
> > > +priority, they are used round robin. This change here replaces the single
> > > +global swap_avail_list with a per-numa-node list, i.e. for each numa 
> > > node,
> > > +it sees its own priority based list of available swap devices. Swap
> > > +device's priority can be promoted on its matching node's swap_avail_list.
> > > +
> > > +The current swap device's priority is set as: user can set a >=0 value,
> > > +or the system will pick one starting from -1 then downwards. The priority
> > > +value in the swap_avail_list is the negated value of the swap device's
> > > +due to plist being sorted from low to high. The new policy doesn't change
> > > +the semantics for priority >=0 cases, the previous starting from -1 then
> > > +downwards now becomes starting from -2 then downwards and -1 is reserved
> > > +as the promoted value.
> > 
> > Could we please add a little "user guide" here?  Tell people how to set
> > up their system to exploit this?  Sample /etc/fstab entries, perhaps?
> 
> That's a good idea.
> 
> How about this:
> 
> ...
>

Looks good.  Please send it along as a patch some time?

> 
> I'm not sure what to do...any hint?
> Adding a pr_err() perhaps?

pr_emerg(), probably.  Would it make sense to disable all swapon()s
after this?


Re: [PATCH] swap: choose swap device according to numa node

2017-08-14 Thread Aaron Lu
On Mon, Aug 14, 2017 at 04:33:37PM -0700, Andrew Morton wrote:
> On Mon, 14 Aug 2017 13:31:30 +0800 Aaron Lu  wrote:
> 
> > --- /dev/null
> > +++ b/Documentation/vm/swap_numa.txt
> > @@ -0,0 +1,18 @@
> > +If the system has more than one swap device and swap device has the node
> > +information, we can make use of this information to decide which swap
> > +device to use in get_swap_pages() to get better performance.
> > +
> > +The current code uses a priority based list, swap_avail_list, to decide
> > +which swap device to use and if multiple swap devices share the same
> > +priority, they are used round robin. This change here replaces the single
> > +global swap_avail_list with a per-numa-node list, i.e. for each numa node,
> > +it sees its own priority based list of available swap devices. Swap
> > +device's priority can be promoted on its matching node's swap_avail_list.
> > +
> > +The current swap device's priority is set as: user can set a >=0 value,
> > +or the system will pick one starting from -1 then downwards. The priority
> > +value in the swap_avail_list is the negated value of the swap device's
> > +due to plist being sorted from low to high. The new policy doesn't change
> > +the semantics for priority >=0 cases, the previous starting from -1 then
> > +downwards now becomes starting from -2 then downwards and -1 is reserved
> > +as the promoted value.
> 
> Could we please add a little "user guide" here?  Tell people how to set
> up their system to exploit this?  Sample /etc/fstab entries, perhaps?

That's a good idea.

How about this:

Automatically bind swap device to numa node
---

If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.


How to use this feature
---

Swap device has priority and that decides the order of it to be used. To make
use of automatically binding, there is no need to manipulate priority settings
for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
swapB, with swapA attached to node 0 and swapB attached to node 1, are going
to be swapped on. Simply swapping them on by doing:
# swapon /dev/swapA
# swapon /dev/swapB

Then node 0 will use the two swap devices in the order of swapA then swapB and
node 1 will use the two swap devices in the order of swapB then swapA. Note
that the order of them being swapped on doesn't matter.

A more complex example on a 4 node machine. Assume 6 swap devices are going to
be swapped on: swapA and swapB are attached to node 0, swapC is attached to
node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
The way to swap them on is the same as above:
# swapon /dev/swapA
# swapon /dev/swapB
# swapon /dev/swapC
# swapon /dev/swapD
# swapon /dev/swapE
# swapon /dev/swapF

Then node 0 will use them in the order of:
swapA/swapB -> swapC -> swapD -> swapE -> swapF
swapA and swapB will be used in a round robin mode before any other swap device.

node 1 will use them in the order of:
swapC -> swapA -> swapB -> swapD -> swapE -> swapF

node 2 will use them in the order of:
swapD/swapE -> swapA -> swapB -> swapC -> swapF
Similaly, swapD and swapE will be used in a round robin mode before any
other swap devices.

node 3 will use them in the order of:
swapF -> swapA -> swapB -> swapC -> swapD -> swapE


Implementation details
--

The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin. This change here replaces the single
global swap_avail_list with a per-numa-node list, i.e. for each numa node,
it sees its own priority based list of available swap devices. Swap
device's priority can be promoted on its matching node's swap_avail_list.

The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one starting from -1 then downwards. The priority
value in the swap_avail_list is the negated value of the swap device's
due to plist being sorted from low to high. The new policy doesn't change
the semantics for priority >=0 cases, the previous starting from -1 then
downwards now becomes starting from -2 then downwards and -1 is reserved
as the promoted value. So if multiple swap devices are attached to the same
node, they will all be promoted to priority -1 on that node's plist and will
be used round robin before any other swap devices.

> 
> >
> > ...
> >
> > +static int __init swapfile_init(void)
> > +{
> > +   int nid;
> > +
> > +   swap_avail_heads = kmalloc(nr_node_ids * sizeof(struct plist_head), 
> > GFP_KERNEL);
> > +   if (!swap_avail_heads)
> > +   return -ENOMEM;
> 
> Well, a kmalloc failure at __init time is generally considered "can't
> happen", but if it _does_ happen, the system will later oops, I think. 

Agree.

>

Re: [PATCH] swap: choose swap device according to numa node

2017-08-14 Thread Andrew Morton
On Mon, 14 Aug 2017 13:31:30 +0800 Aaron Lu  wrote:

> If the system has more than one swap device and swap device has the node
> information, we can make use of this information to decide which swap
> device to use in get_swap_pages() to get better performance.
> 
> The current code uses a priority based list, swap_avail_list, to decide
> which swap device to use and if multiple swap devices share the same
> priority, they are used round robin. This patch changes the previous
> single global swap_avail_list into a per-numa-node list, i.e. for each
> numa node, it sees its own priority based list of available swap devices.
> Swap device's priority can be promoted on its matching node's swap_avail_list.
> 
> The current swap device's priority is set as: user can set a >=0 value,
> or the system will pick one starting from -1 then downwards. The priority
> value in the swap_avail_list is the negated value of the swap device's
> due to plist being sorted from low to high. The new policy doesn't change
> the semantics for priority >=0 cases, the previous starting from -1 then
> downwards now becomes starting from -2 then downwards and -1 is reserved
> as the promoted value.
> 
> ...
>
> On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
> are used as swap devices with each attached to a different node, the
> result is:
> 
> runtime=30m/processes=32/total test size=128G/each process mmap region=4G
> kernel throughput
> vanilla13306
> auto-binding   15169 +14%
> 
> runtime=30m/processes=64/total test size=128G/each process mmap region=2G
> kernel throughput
> vanilla11885
> auto-binding   14879 25%
> 

Sounds nice.

> ...
>
> --- /dev/null
> +++ b/Documentation/vm/swap_numa.txt
> @@ -0,0 +1,18 @@
> +If the system has more than one swap device and swap device has the node
> +information, we can make use of this information to decide which swap
> +device to use in get_swap_pages() to get better performance.
> +
> +The current code uses a priority based list, swap_avail_list, to decide
> +which swap device to use and if multiple swap devices share the same
> +priority, they are used round robin. This change here replaces the single
> +global swap_avail_list with a per-numa-node list, i.e. for each numa node,
> +it sees its own priority based list of available swap devices. Swap
> +device's priority can be promoted on its matching node's swap_avail_list.
> +
> +The current swap device's priority is set as: user can set a >=0 value,
> +or the system will pick one starting from -1 then downwards. The priority
> +value in the swap_avail_list is the negated value of the swap device's
> +due to plist being sorted from low to high. The new policy doesn't change
> +the semantics for priority >=0 cases, the previous starting from -1 then
> +downwards now becomes starting from -2 then downwards and -1 is reserved
> +as the promoted value.

Could we please add a little "user guide" here?  Tell people how to set
up their system to exploit this?  Sample /etc/fstab entries, perhaps?

>
> ...
>
> +static int __init swapfile_init(void)
> +{
> + int nid;
> +
> + swap_avail_heads = kmalloc(nr_node_ids * sizeof(struct plist_head), 
> GFP_KERNEL);
> + if (!swap_avail_heads)
> + return -ENOMEM;

Well, a kmalloc failure at __init time is generally considered "can't
happen", but if it _does_ happen, the system will later oops, I think. 
Can we do something nicer here?


> + for_each_node(nid)
> + plist_head_init(&swap_avail_heads[nid]);
> +
> + return 0;
> +}
> +subsys_initcall(swapfile_init);



[PATCH] swap: choose swap device according to numa node

2017-08-13 Thread Aaron Lu
If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.

The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin. This patch changes the previous
single global swap_avail_list into a per-numa-node list, i.e. for each
numa node, it sees its own priority based list of available swap devices.
Swap device's priority can be promoted on its matching node's swap_avail_list.

The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one starting from -1 then downwards. The priority
value in the swap_avail_list is the negated value of the swap device's
due to plist being sorted from low to high. The new policy doesn't change
the semantics for priority >=0 cases, the previous starting from -1 then
downwards now becomes starting from -2 then downwards and -1 is reserved
as the promoted value.

Take 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3

After they are all swapped on in the sequence of ABCD.

Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA   -> swapB   -> swapC   -> swapD
prio:1 prio:2 prio:3 prio:4

New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA   -> swapB   -> swapC   -> swapD
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB   -> swapA   -> swapC   -> swapD
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC   -> swapA   -> swapB   -> swapD
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD   -> swapA   -> swapB   -> swapC
prio:1 prio:2 prio:3 prio:4

To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.

On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:

runtime=30m/processes=32/total test size=128G/each process mmap region=4G
kernel throughput
vanilla13306
auto-binding   15169 +14%

runtime=30m/processes=64/total test size=128G/each process mmap region=2G
kernel throughput
vanilla11885
auto-binding   14879 25%

Signed-off-by: Aaron Lu 
---
A previous version was sent without catching much attention:
https://lkml.org/lkml/2016/7/4/633
I'm sending again with some minor modifications and after test showed
performance gains for SSD.

 Documentation/vm/swap_numa.txt |  18 +++
 include/linux/swap.h   |   2 +-
 mm/swapfile.c  | 113 +++--
 3 files changed, 106 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/vm/swap_numa.txt

diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.txt
new file mode 100644
index ..e63fe485567c
--- /dev/null
+++ b/Documentation/vm/swap_numa.txt
@@ -0,0 +1,18 @@
+If the system has more than one swap device and swap device has the node
+information, we can make use of this information to decide which swap
+device to use in get_swap_pages() to get better performance.
+
+The current code uses a priority based list, swap_avail_list, to decide
+which swap device to use and if multiple swap devices share the same
+priority, they are used round robin. This change here replaces the single
+global swap_avail_list with a per-numa-node list, i.e. for each numa node,
+it sees its own priority based list of available swap devices. Swap
+device's priority can be promoted on its matching node's swap_avail_list.
+
+The current swap device's priority is set as: user can set a >=0 value,
+or the system will pick one starting from -1 then downwards. The priority
+value in the swap_avail_list is the negated value of the swap device's
+due to plist being sorted from low to high. The new policy doesn't change
+the semantics for priority >=0 cases, the previous starting from -1 then
+downwards now becomes starting from -2 then downwards and -1 is reserved
+as the promoted value.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d83d28e53e62..28262fe683ad 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -211,7 +211,7 @@ struct swap_info_struct {
unsigned long   flags;  /* SWP_USED etc: see abo

Re: [RFC RESEND PATCH] swap: choose swap device according to numa node

2016-07-04 Thread Aaron Lu
On Tue, Jul 05, 2016 at 01:57:35PM +0800, Yu Chen wrote:
> On Tue, Jul 5, 2016 at 11:19 AM, Aaron Lu  wrote:
> > Resend:
> > This is a resend, the original patch doesn't catch much attention.
> > It may not be a big deal for swap devices that used to be hosted on
> > HDD but with devices like 3D Xpoint to be used as swap device, it could
> > make a real difference if we consider NUMA information when doing IO.
> > Comments are appreciated, thanks for your time.
> >
> -%<-
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 71b1c29948db..dd7e44a315b0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3659,9 +3659,11 @@ void kswapd_stop(int nid)
> >
> >  static int __init kswapd_init(void)
> >  {
> > -   int nid;
> > +   int nid, err;
> >
> > -   swap_setup();
> > +   err = swap_setup();
> > +   if (err)
> > +   return err;
> > for_each_node_state(nid, N_MEMORY)
> > kswapd_run(nid);
> > hotcpu_notifier(cpu_callback, 0);
> In original implementation, although swap_setup failed,

In current implementaion swap_setup never fail :-)

> the swapd would also be created, since swapd is
> not only  used for swap out but also for other page reclaim,
> so this change above might modify its semantic? Sorry if
> I understand incorrectly.

Indeed it's a behaviour change. The only reason swap_setup can return an
error code now is when it fails to allocate nr_node_ids * sizeof(struct
plist_head) memory and if that happens, I don't think it makes much
sense to continue boot the system.

Thanks,
Aaron


Re: [RFC RESEND PATCH] swap: choose swap device according to numa node

2016-07-04 Thread Yu Chen
On Tue, Jul 5, 2016 at 11:19 AM, Aaron Lu  wrote:
> Resend:
> This is a resend, the original patch doesn't catch much attention.
> It may not be a big deal for swap devices that used to be hosted on
> HDD but with devices like 3D Xpoint to be used as swap device, it could
> make a real difference if we consider NUMA information when doing IO.
> Comments are appreciated, thanks for your time.
>
-%<-
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 71b1c29948db..dd7e44a315b0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3659,9 +3659,11 @@ void kswapd_stop(int nid)
>
>  static int __init kswapd_init(void)
>  {
> -   int nid;
> +   int nid, err;
>
> -   swap_setup();
> +   err = swap_setup();
> +   if (err)
> +   return err;
> for_each_node_state(nid, N_MEMORY)
> kswapd_run(nid);
> hotcpu_notifier(cpu_callback, 0);
In original implementation, although swap_setup failed,
the swapd would also be created, since swapd is
not only  used for swap out but also for other page reclaim,
so this change above might modify its semantic? Sorry if
I understand incorrectly.


[RFC RESEND PATCH] swap: choose swap device according to numa node

2016-07-04 Thread Aaron Lu
Resend:
This is a resend, the original patch doesn't catch much attention.
It may not be a big deal for swap devices that used to be hosted on
HDD but with devices like 3D Xpoint to be used as swap device, it could
make a real difference if we consider NUMA information when doing IO.
Comments are appreciated, thanks for your time.

Original changelog:
If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_page.

The current code uses a priority based list, swap_avail_list, to decide
which swap device to use each time and if multiple swap devices share
the same priority, they are used round robin. This patch change the
previous single global swap_avail_list into a per-numa-node list, i.e.
for each numa node, it sees its own priority based list of available
swap devices. This will require checking a swap device's node value
during swap on time and then promote its priority(more on this below)
in the swap_avail_list according to which node's list it is being added
to. Once this is done, there should be little, if not none, cost in
get_swap_page time.

The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one by starting from -1 then downwards.
And the priority value in the swap_avail_list is the negated value of
the swap device's priority due to plist is sorted from low to high. The
new policy doesn't change the semantics for priority >=0 cases, the
previous starting from -1 then downwards now becomes starting from -2
then downwards. -1 is reserved as the promoted value.

Take an 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3

After they are all swapped on in the sequence of ABCD.

Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA   -> swapB   -> swapC   -> swapD
prio:1 prio:2 prio:3 prio:4

New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[node] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA   -> swapB   -> swapC   -> swapD
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB   -> swapA   -> swapC   -> swapD
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC   -> swapA   -> swapB   -> swapD
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD   -> swapA   -> swapB   -> swapC
prio:1 prio:2 prio:3 prio:4

The test case used is:
https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/tree/usemem.c
What the test does is: start N process, each map a region of anonymous
space and then write to it sequentially to trigger swap outs.
On Haswell EP 2 node machine with 128GiB memory, two persistent memory
devices are created, each with a size of 48GiB sitting on a different
node are used as swap devices, they are swapped on without being
specified a priority value and the test result is:
1 task/write size is around 95GiB
throughput of v4.5: 1475358.0
throughput of the patch: 1751160.0
18% increase in throughput
16 task/write size of each is around 6.6GiB
throughput of v4.5: 2148972.4
throughput of the patch: 5713310.0
165% increase in throughput

The huge increase is partly due to the lock contention on the single
swapper_space's radix tree lock since v4.5 will always use the higher
priority swap device till it's full before using another one. Setting
them with the same priority could avoid this, so here are the results
considering this case:
1 task/write size is around 95GiB
throughput of v4.5: 1475358.0
throughput of v4.5(swap device with equal priority): 1707893.4
throughput of the patch: 1751160.0
almost the same for the latter two
16 task/write size of each is around 6.6GiB
throughput of v4.5: 2148972.4
throughput of v4.5(swap device with equal priority): 3804688.25
throughput of the patch: 5713310.0
increase reduced to 50%

Comments are appreciated.

Signed-off-by: Aaron Lu 
---
 include/linux/swap.h |  4 +--
 mm/swap.c| 12 +++-
 mm/swapfile.c| 81 ++--
 mm/vmscan.c  |  6 ++--
 4 files changed, 71 insertions(+), 32 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..eafda3ac42eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -207,7 +207,7 @@ struct swap_info_struct {
unsigned long   flags;  /* SWP_USED etc: see above */
signed shortprio;   /* swap priority of