Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-03-13 Thread Zi Yan

On 19 Feb 2019, at 20:38, Anshuman Khandual wrote:


On 02/19/2019 06:26 PM, Matthew Wilcox wrote:

On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
But the location of this temp page matters as well because you would 
like to
saturate the inter node interface. It needs to be either of the 
nodes where
the source or destination page belongs. Any other node would 
generate two

internode copy process which is not what you intend here I guess.
That makes no sense.  It should be allocated on the local node of the 
CPU
performing the copy.  If the CPU is in node A, the destination is in 
node B
and the source is in node C, then you're doing 4k worth of reads from 
node C,
4k worth of reads from node B, 4k worth of writes to node C followed 
by
4k worth of writes to node B.  Eventually the 4k of dirty cachelines 
on
node A will be written back from cache to the local memory (... or 
not,

if that page gets reused for some other purpose first).

If you allocate the page on node B or node C, that's an extra 4k of 
writes

to be sent across the inter-node link.


Thats right there will be an extra remote write. My assumption was 
that the CPU

performing the copy belongs to either node B or node C.



I have some interesting throughput results for exchange per u64 and 
exchange per 4KB page.
What I discovered is that using a 4KB page as the temporary storage for 
exchanging
2MB THPs does not improve the throughput. On contrary, when we are 
exchanging more than 2^4=16 THPs,
exchanging per 4KB page has lower throughput than exchanging per u64. 
Please see results below.


The experiments are done on a two socket machine with two Intel Xeon 
E5-2640 v3 CPUs.

All exchanges are done via the QPI link across two sockets.


Results
===

Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA 
nodes


| 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7
| 8| 9
| u64| 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 
| 9.57 | 9.62
| per_page   | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 
| 7.32 | 7.31


Normalized throughput (to per_page)

 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7
| 8| 9
 u64| 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26  | 1.30 
| 1.30 | 1.31




Exchange page code
===

For exchanging per u64, I use the following function:

static void exchange_page(char *to, char *from)
{
u64 tmp;
int i;

for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
tmp = *((u64 *)(from + i));
*((u64 *)(from + i)) = *((u64 *)(to + i));
*((u64 *)(to + i)) = tmp;
}
}


For exchange per 4KB, I use the following function:

static void exchange_page2(char *to, char *from)
{
int cpu = smp_processor_id();

VM_BUG_ON(!in_atomic());

if (!page_tmp[cpu]) {
int nid = cpu_to_node(cpu);
struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 
0);
if (!page_tmp_page) {
exchange_page(to, from);
return;
}
page_tmp[cpu] = kmap(page_tmp_page);
}

copy_page(page_tmp[cpu], to);
copy_page(to, from);
copy_page(from, page_tmp[cpu]);
}

where page_tmp is pre-allocated local to each CPU and alloc_pages_node() 
above

is for hot-added CPUs, which is not used in the tests.


The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc
To do a comparison, you can clone this repo: 
https://gitlab.com/ziy/thp-migration-bench,
then make, ./run_test.sh, and ./get_results.sh using the kernel from 
above.


Let me know if I missed anything or did something wrong. Thanks.


--
Best Regards,
Yan Zi


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-21 Thread Zi Yan
On 21 Feb 2019, at 13:10, Jerome Glisse wrote:

> On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
>> From: Zi Yan 
>>
>> In stead of using two migrate_pages(), a single exchange_pages() would
>> be sufficient and without allocating new pages.
>
> So i believe it would be better to arrange the code differently instead
> of having one function that special case combination, define function for
> each one ie:
> exchange_anon_to_share()
> exchange_anon_to_anon()
> exchange_share_to_share()
>
> Then you could define function to test if a page is in correct states:
> can_exchange_anon_page() // return true if page can be exchange
> can_exchange_share_page()
>
> In fact both of this function can be factor out as common helpers with the
> existing migrate code within migrate.c This way we would have one place
> only where we need to handle all the special casing, test and exceptions.
>
> Other than that i could not spot anything obviously wrong but i did not
> spent enough time to check everything. Re-architecturing the code like
> i propose above would make this a lot easier to review i believe.
>

Thank you for reviewing the patch. Your suggestions are very helpful.
I will restructure the code to help people review it.


>> +from_page_count = page_count(from_page);
>> +from_map_count = page_mapcount(from_page);
>> +to_page_count = page_count(to_page);
>> +to_map_count = page_mapcount(to_page);
>> +from_flags = from_page->flags;
>> +to_flags = to_page->flags;
>> +from_mapping = from_page->mapping;
>> +to_mapping = to_page->mapping;
>> +from_index = from_page->index;
>> +to_index = to_page->index;
>
> Those are not use anywhere ...

Will remove them. Thanks.

--
Best Regards,
Yan Zi


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-21 Thread Jerome Glisse
On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
> From: Zi Yan 
> 
> In stead of using two migrate_pages(), a single exchange_pages() would
> be sufficient and without allocating new pages.

So i believe it would be better to arrange the code differently instead
of having one function that special case combination, define function for
each one ie:
exchange_anon_to_share()
exchange_anon_to_anon()
exchange_share_to_share()

Then you could define function to test if a page is in correct states:
can_exchange_anon_page() // return true if page can be exchange
can_exchange_share_page()

In fact both of this function can be factor out as common helpers with the
existing migrate code within migrate.c This way we would have one place
only where we need to handle all the special casing, test and exceptions.

Other than that i could not spot anything obviously wrong but i did not
spent enough time to check everything. Re-architecturing the code like
i propose above would make this a lot easier to review i believe.

Cheers,
Jérôme

> 
> Signed-off-by: Zi Yan 
> ---
>  include/linux/ksm.h |   5 +
>  mm/Makefile |   1 +
>  mm/exchange.c   | 846 
>  mm/internal.h   |   6 +
>  mm/ksm.c|  35 ++
>  mm/migrate.c|   4 +-
>  6 files changed, 895 insertions(+), 2 deletions(-)
>  create mode 100644 mm/exchange.c

[...]

> + from_page_count = page_count(from_page);
> + from_map_count = page_mapcount(from_page);
> + to_page_count = page_count(to_page);
> + to_map_count = page_mapcount(to_page);
> + from_flags = from_page->flags;
> + to_flags = to_page->flags;
> + from_mapping = from_page->mapping;
> + to_mapping = to_page->mapping;
> + from_index = from_page->index;
> + to_index = to_page->index;

Those are not use anywhere ...


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-19 Thread Anshuman Khandual



On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
> On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
>> But the location of this temp page matters as well because you would like to
>> saturate the inter node interface. It needs to be either of the nodes where
>> the source or destination page belongs. Any other node would generate two
>> internode copy process which is not what you intend here I guess.
> That makes no sense.  It should be allocated on the local node of the CPU
> performing the copy.  If the CPU is in node A, the destination is in node B
> and the source is in node C, then you're doing 4k worth of reads from node C,
> 4k worth of reads from node B, 4k worth of writes to node C followed by
> 4k worth of writes to node B.  Eventually the 4k of dirty cachelines on
> node A will be written back from cache to the local memory (... or not,
> if that page gets reused for some other purpose first).
> 
> If you allocate the page on node B or node C, that's an extra 4k of writes
> to be sent across the inter-node link.

Thats right there will be an extra remote write. My assumption was that the CPU
performing the copy belongs to either node B or node C.


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-19 Thread Matthew Wilcox
On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
> But the location of this temp page matters as well because you would like to
> saturate the inter node interface. It needs to be either of the nodes where
> the source or destination page belongs. Any other node would generate two
> internode copy process which is not what you intend here I guess.

That makes no sense.  It should be allocated on the local node of the CPU
performing the copy.  If the CPU is in node A, the destination is in node B
and the source is in node C, then you're doing 4k worth of reads from node C,
4k worth of reads from node B, 4k worth of writes to node C followed by
4k worth of writes to node B.  Eventually the 4k of dirty cachelines on
node A will be written back from cache to the local memory (... or not,
if that page gets reused for some other purpose first).

If you allocate the page on node B or node C, that's an extra 4k of writes
to be sent across the inter-node link.


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-18 Thread Anshuman Khandual



On 02/18/2019 11:29 PM, Zi Yan wrote:
> On 18 Feb 2019, at 9:52, Matthew Wilcox wrote:
> 
>> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
>>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
 On 2/18/19 6:31 PM, Zi Yan wrote:
> The purpose of proposing exchange_pages() is to avoid allocating any
> new
> page,
> so that we would not trigger any potential page reclaim or memory
> compaction.
> Allocating a temporary page defeats the purpose.

 Compaction can only happen for order > 0 temporary pages. Even if you
 used
 single order = 0 page to gradually exchange e.g. a THP, it should be
 better than
 u64. Allocating order = 0 should be a non-issue. If it's an issue, then
 the
 system is in a bad state and physically contiguous layout is a secondary
 concern.
>>>
>>> You are right if we only need to allocate one order-0 page. But this also
>>> means
>>> we can only exchange two pages at a time. We need to add a lock to make sure
>>> the temporary page is used exclusively or we need to keep allocating
>>> temporary pages
>>> when multiple exchange_pages() are happening at the same time.
>>
>> You allocate one temporary page per thread that's doing an exchange_page().
> 
> Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try
> it. Thanks.

But the location of this temp page matters as well because you would like to
saturate the inter node interface. It needs to be either of the nodes where
the source or destination page belongs. Any other node would generate two
internode copy process which is not what you intend here I guess.


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-18 Thread Zi Yan
On 18 Feb 2019, at 9:52, Matthew Wilcox wrote:

> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
>>> On 2/18/19 6:31 PM, Zi Yan wrote:
 The purpose of proposing exchange_pages() is to avoid allocating any
 new
 page,
 so that we would not trigger any potential page reclaim or memory
 compaction.
 Allocating a temporary page defeats the purpose.
>>>
>>> Compaction can only happen for order > 0 temporary pages. Even if you
>>> used
>>> single order = 0 page to gradually exchange e.g. a THP, it should be
>>> better than
>>> u64. Allocating order = 0 should be a non-issue. If it's an issue, then
>>> the
>>> system is in a bad state and physically contiguous layout is a secondary
>>> concern.
>>
>> You are right if we only need to allocate one order-0 page. But this also
>> means
>> we can only exchange two pages at a time. We need to add a lock to make sure
>> the temporary page is used exclusively or we need to keep allocating
>> temporary pages
>> when multiple exchange_pages() are happening at the same time.
>
> You allocate one temporary page per thread that's doing an exchange_page().

Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try
it. Thanks.

--
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-18 Thread Matthew Wilcox
On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
> > On 2/18/19 6:31 PM, Zi Yan wrote:
> > > The purpose of proposing exchange_pages() is to avoid allocating any
> > > new
> > > page,
> > > so that we would not trigger any potential page reclaim or memory
> > > compaction.
> > > Allocating a temporary page defeats the purpose.
> > 
> > Compaction can only happen for order > 0 temporary pages. Even if you
> > used
> > single order = 0 page to gradually exchange e.g. a THP, it should be
> > better than
> > u64. Allocating order = 0 should be a non-issue. If it's an issue, then
> > the
> > system is in a bad state and physically contiguous layout is a secondary
> > concern.
> 
> You are right if we only need to allocate one order-0 page. But this also
> means
> we can only exchange two pages at a time. We need to add a lock to make sure
> the temporary page is used exclusively or we need to keep allocating
> temporary pages
> when multiple exchange_pages() are happening at the same time.

You allocate one temporary page per thread that's doing an exchange_page().


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-18 Thread Zi Yan

On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:


On 2/18/19 6:31 PM, Zi Yan wrote:
The purpose of proposing exchange_pages() is to avoid allocating any 
new

page,
so that we would not trigger any potential page reclaim or memory
compaction.
Allocating a temporary page defeats the purpose.


Compaction can only happen for order > 0 temporary pages. Even if you 
used
single order = 0 page to gradually exchange e.g. a THP, it should be 
better than
u64. Allocating order = 0 should be a non-issue. If it's an issue, 
then the
system is in a bad state and physically contiguous layout is a 
secondary concern.


You are right if we only need to allocate one order-0 page. But this 
also means
we can only exchange two pages at a time. We need to add a lock to make 
sure
the temporary page is used exclusively or we need to keep allocating 
temporary pages

when multiple exchange_pages() are happening at the same time.

--
Best Regards,
Yan Zi


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-18 Thread Vlastimil Babka
On 2/18/19 6:31 PM, Zi Yan wrote:
> The purpose of proposing exchange_pages() is to avoid allocating any new 
> page,
> so that we would not trigger any potential page reclaim or memory 
> compaction.
> Allocating a temporary page defeats the purpose.

Compaction can only happen for order > 0 temporary pages. Even if you used
single order = 0 page to gradually exchange e.g. a THP, it should be better than
u64. Allocating order = 0 should be a non-issue. If it's an issue, then the
system is in a bad state and physically contiguous layout is a secondary 
concern.


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-18 Thread Zi Yan

On 17 Feb 2019, at 3:29, Matthew Wilcox wrote:


On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:

+struct page_flags {
+   unsigned int page_error :1;
+   unsigned int page_referenced:1;
+   unsigned int page_uptodate:1;
+   unsigned int page_active:1;
+   unsigned int page_unevictable:1;
+   unsigned int page_checked:1;
+   unsigned int page_mappedtodisk:1;
+   unsigned int page_dirty:1;
+   unsigned int page_is_young:1;
+   unsigned int page_is_idle:1;
+   unsigned int page_swapcache:1;
+   unsigned int page_writeback:1;
+   unsigned int page_private:1;
+   unsigned int __pad:3;
+};


I'm not sure how to feel about this.  It's a bit fragile versus 
somebody adding
new page flags.  I don't know whether it's needed or whether you can 
just

copy page->flags directly because you're holding PageLock.


I agree with you that current way of copying page flags individually 
could miss
new page flags. I will try to come up with something better. Copying 
page->flags as a whole
might not simply work, since the upper part of page->flags has the page 
node information,
which should not be changed. I think I need to add a helper function to 
just copy/exchange

all page flags, like calling migrate_page_stats() twice.


+static void exchange_page(char *to, char *from)
+{
+   u64 tmp;
+   int i;
+
+   for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
+   tmp = *((u64 *)(from + i));
+   *((u64 *)(from + i)) = *((u64 *)(to + i));
+   *((u64 *)(to + i)) = tmp;
+   }
+}


I have a suspicion you'd be better off allocating a temporary page and
using copy_page().  Some architectures have put a lot of effort into
making copy_page() run faster.


When I am doing exchange_pages() between two NUMA nodes on a x86_64 
machine,
I actually can saturate the QPI bandwidth with this operation. I think 
cache

prefetching was doing its job.

The purpose of proposing exchange_pages() is to avoid allocating any new 
page,
so that we would not trigger any potential page reclaim or memory 
compaction.

Allocating a temporary page defeats the purpose.




+   xa_lock_irq(_mapping->i_pages);
+
+   to_pslot = radix_tree_lookup_slot(_mapping->i_pages,
+   page_index(to_page));


This needs to be converted to the XArray.  radix_tree_lookup_slot() is
going away soon.  You probably need:

XA_STATE(to_xas, _mapping->i_pages, page_index(to_page));


Thank you for pointing this out. I will do the change.



This is a lot of code and I'm still trying to get my head aroud it 
all.

Thanks for putting in this work; it's good to see this approach being
explored.


Thank you for taking a look at the code.

--
Best Regards,
Yan Zi


Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-17 Thread Matthew Wilcox
On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
> +struct page_flags {
> + unsigned int page_error :1;
> + unsigned int page_referenced:1;
> + unsigned int page_uptodate:1;
> + unsigned int page_active:1;
> + unsigned int page_unevictable:1;
> + unsigned int page_checked:1;
> + unsigned int page_mappedtodisk:1;
> + unsigned int page_dirty:1;
> + unsigned int page_is_young:1;
> + unsigned int page_is_idle:1;
> + unsigned int page_swapcache:1;
> + unsigned int page_writeback:1;
> + unsigned int page_private:1;
> + unsigned int __pad:3;
> +};

I'm not sure how to feel about this.  It's a bit fragile versus somebody adding
new page flags.  I don't know whether it's needed or whether you can just
copy page->flags directly because you're holding PageLock.

> +static void exchange_page(char *to, char *from)
> +{
> + u64 tmp;
> + int i;
> +
> + for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
> + tmp = *((u64 *)(from + i));
> + *((u64 *)(from + i)) = *((u64 *)(to + i));
> + *((u64 *)(to + i)) = tmp;
> + }
> +}

I have a suspicion you'd be better off allocating a temporary page and
using copy_page().  Some architectures have put a lot of effort into
making copy_page() run faster.

> + xa_lock_irq(_mapping->i_pages);
> +
> + to_pslot = radix_tree_lookup_slot(_mapping->i_pages,
> + page_index(to_page));

This needs to be converted to the XArray.  radix_tree_lookup_slot() is
going away soon.  You probably need:

XA_STATE(to_xas, _mapping->i_pages, page_index(to_page));

This is a lot of code and I'm still trying to get my head aroud it all.
Thanks for putting in this work; it's good to see this approach being
explored.


[RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-15 Thread Zi Yan
From: Zi Yan 

In stead of using two migrate_pages(), a single exchange_pages() would
be sufficient and without allocating new pages.

Signed-off-by: Zi Yan 
---
 include/linux/ksm.h |   5 +
 mm/Makefile |   1 +
 mm/exchange.c   | 846 
 mm/internal.h   |   6 +
 mm/ksm.c|  35 ++
 mm/migrate.c|   4 +-
 6 files changed, 895 insertions(+), 2 deletions(-)
 create mode 100644 mm/exchange.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..87c5b943a73c 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,7 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+void ksm_exchange_page(struct page *to_page, struct page *from_page);
 
 #else  /* !CONFIG_KSM */
 
@@ -86,6 +87,10 @@ static inline void rmap_walk_ksm(struct page *page,
 static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
 }
+static inline void ksm_exchange_page(struct page *to_page,
+   struct page *from_page)
+{
+}
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
 
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..1574ea5743e4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -43,6 +43,7 @@ obj-y := filemap.o mempool.o oom_kill.o 
fadvise.o \
 
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += exchange.o
 
 ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
diff --git a/mm/exchange.c b/mm/exchange.c
new file mode 100644
index ..a607348cc6f4
--- /dev/null
+++ b/mm/exchange.c
@@ -0,0 +1,846 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 NVIDIA, Zi Yan 
+ *
+ * Exchange two in-use pages. Page flags and page->mapping are exchanged
+ * as well. Only anonymous pages are supported.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include  /* buffer_migrate_page  */
+#include 
+
+
+#include "internal.h"
+
+struct exchange_page_info {
+   struct page *from_page;
+   struct page *to_page;
+
+   struct anon_vma *from_anon_vma;
+   struct anon_vma *to_anon_vma;
+
+   struct list_head list;
+};
+
+struct page_flags {
+   unsigned int page_error :1;
+   unsigned int page_referenced:1;
+   unsigned int page_uptodate:1;
+   unsigned int page_active:1;
+   unsigned int page_unevictable:1;
+   unsigned int page_checked:1;
+   unsigned int page_mappedtodisk:1;
+   unsigned int page_dirty:1;
+   unsigned int page_is_young:1;
+   unsigned int page_is_idle:1;
+   unsigned int page_swapcache:1;
+   unsigned int page_writeback:1;
+   unsigned int page_private:1;
+   unsigned int __pad:3;
+};
+
+
+static void exchange_page(char *to, char *from)
+{
+   u64 tmp;
+   int i;
+
+   for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
+   tmp = *((u64 *)(from + i));
+   *((u64 *)(from + i)) = *((u64 *)(to + i));
+   *((u64 *)(to + i)) = tmp;
+   }
+}
+
+static inline void exchange_highpage(struct page *to, struct page *from)
+{
+   char *vfrom, *vto;
+
+   vfrom = kmap_atomic(from);
+   vto = kmap_atomic(to);
+   exchange_page(vto, vfrom);
+   kunmap_atomic(vto);
+   kunmap_atomic(vfrom);
+}
+
+static void __exchange_gigantic_page(struct page *dst, struct page *src,
+   int nr_pages)
+{
+   int i;
+   struct page *dst_base = dst;
+   struct page *src_base = src;
+
+   for (i = 0; i < nr_pages; ) {
+   cond_resched();
+   exchange_highpage(dst, src);
+
+   i++;
+   dst = mem_map_next(dst, dst_base, i);
+   src = mem_map_next(src, src_base, i);
+   }
+}
+
+static void exchange_huge_page(struct page *dst, struct page *src)
+{
+   int i;
+   int nr_pages;
+
+   if (PageHuge(src)) {
+   /* hugetlbfs page */
+   struct hstate *h = page_hstate(src);
+
+   nr_pages = pages_per_huge_page(h);
+
+   if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+   __exchange_gigantic_page(dst, src, nr_pages);
+   return;
+   }
+   } else {
+   /* thp page */
+   VM_BUG_ON(!PageTransHuge(src));
+   nr_pages = hpage_nr_pages(src);
+   }
+
+   for (i = 0; i < nr_pages; i++) {
+   cond_resched();
+   exchange_highpage(dst + i, src + i);
+   }
+}
+
+/*
+ * Copy the page to its new location without polluting cache
+ */
+static void exchange_page_flags(struct page *to_page, struct page *from_page)
+{
+   int from_cpupid, to_cpupid;
+   struct page_flags from_page_flags, to_page_flags;
+  

[RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-02-15 Thread Zi Yan
In stead of using two migrate_pages(), a single exchange_pages() would
be sufficient and without allocating new pages.

Signed-off-by: Zi Yan 
---
 include/linux/ksm.h |   5 +
 mm/Makefile |   1 +
 mm/exchange.c   | 846 
 mm/internal.h   |   6 +
 mm/ksm.c|  35 ++
 mm/migrate.c|   4 +-
 6 files changed, 895 insertions(+), 2 deletions(-)
 create mode 100644 mm/exchange.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..87c5b943a73c 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,7 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+void ksm_exchange_page(struct page *to_page, struct page *from_page);
 
 #else  /* !CONFIG_KSM */
 
@@ -86,6 +87,10 @@ static inline void rmap_walk_ksm(struct page *page,
 static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
 }
+static inline void ksm_exchange_page(struct page *to_page,
+   struct page *from_page)
+{
+}
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
 
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..1574ea5743e4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -43,6 +43,7 @@ obj-y := filemap.o mempool.o oom_kill.o 
fadvise.o \
 
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += exchange.o
 
 ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
diff --git a/mm/exchange.c b/mm/exchange.c
new file mode 100644
index ..a607348cc6f4
--- /dev/null
+++ b/mm/exchange.c
@@ -0,0 +1,846 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 NVIDIA, Zi Yan 
+ *
+ * Exchange two in-use pages. Page flags and page->mapping are exchanged
+ * as well. Only anonymous pages are supported.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include  /* buffer_migrate_page  */
+#include 
+
+
+#include "internal.h"
+
+struct exchange_page_info {
+   struct page *from_page;
+   struct page *to_page;
+
+   struct anon_vma *from_anon_vma;
+   struct anon_vma *to_anon_vma;
+
+   struct list_head list;
+};
+
+struct page_flags {
+   unsigned int page_error :1;
+   unsigned int page_referenced:1;
+   unsigned int page_uptodate:1;
+   unsigned int page_active:1;
+   unsigned int page_unevictable:1;
+   unsigned int page_checked:1;
+   unsigned int page_mappedtodisk:1;
+   unsigned int page_dirty:1;
+   unsigned int page_is_young:1;
+   unsigned int page_is_idle:1;
+   unsigned int page_swapcache:1;
+   unsigned int page_writeback:1;
+   unsigned int page_private:1;
+   unsigned int __pad:3;
+};
+
+
+static void exchange_page(char *to, char *from)
+{
+   u64 tmp;
+   int i;
+
+   for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
+   tmp = *((u64 *)(from + i));
+   *((u64 *)(from + i)) = *((u64 *)(to + i));
+   *((u64 *)(to + i)) = tmp;
+   }
+}
+
+static inline void exchange_highpage(struct page *to, struct page *from)
+{
+   char *vfrom, *vto;
+
+   vfrom = kmap_atomic(from);
+   vto = kmap_atomic(to);
+   exchange_page(vto, vfrom);
+   kunmap_atomic(vto);
+   kunmap_atomic(vfrom);
+}
+
+static void __exchange_gigantic_page(struct page *dst, struct page *src,
+   int nr_pages)
+{
+   int i;
+   struct page *dst_base = dst;
+   struct page *src_base = src;
+
+   for (i = 0; i < nr_pages; ) {
+   cond_resched();
+   exchange_highpage(dst, src);
+
+   i++;
+   dst = mem_map_next(dst, dst_base, i);
+   src = mem_map_next(src, src_base, i);
+   }
+}
+
+static void exchange_huge_page(struct page *dst, struct page *src)
+{
+   int i;
+   int nr_pages;
+
+   if (PageHuge(src)) {
+   /* hugetlbfs page */
+   struct hstate *h = page_hstate(src);
+
+   nr_pages = pages_per_huge_page(h);
+
+   if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+   __exchange_gigantic_page(dst, src, nr_pages);
+   return;
+   }
+   } else {
+   /* thp page */
+   VM_BUG_ON(!PageTransHuge(src));
+   nr_pages = hpage_nr_pages(src);
+   }
+
+   for (i = 0; i < nr_pages; i++) {
+   cond_resched();
+   exchange_highpage(dst + i, src + i);
+   }
+}
+
+/*
+ * Copy the page to its new location without polluting cache
+ */
+static void exchange_page_flags(struct page *to_page, struct page *from_page)
+{
+   int from_cpupid, to_cpupid;
+   struct page_flags from_page_flags, to_page_flags;
+   struct