Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 19 Feb 2019, at 20:38, Anshuman Khandual wrote: On 02/19/2019 06:26 PM, Matthew Wilcox wrote: On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote: But the location of this temp page matters as well because you would like to saturate the inter node interface. It needs to be either of the nodes where the source or destination page belongs. Any other node would generate two internode copy process which is not what you intend here I guess. That makes no sense. It should be allocated on the local node of the CPU performing the copy. If the CPU is in node A, the destination is in node B and the source is in node C, then you're doing 4k worth of reads from node C, 4k worth of reads from node B, 4k worth of writes to node C followed by 4k worth of writes to node B. Eventually the 4k of dirty cachelines on node A will be written back from cache to the local memory (... or not, if that page gets reused for some other purpose first). If you allocate the page on node B or node C, that's an extra 4k of writes to be sent across the inter-node link. Thats right there will be an extra remote write. My assumption was that the CPU performing the copy belongs to either node B or node C. I have some interesting throughput results for exchange per u64 and exchange per 4KB page. What I discovered is that using a 4KB page as the temporary storage for exchanging 2MB THPs does not improve the throughput. On contrary, when we are exchanging more than 2^4=16 THPs, exchanging per 4KB page has lower throughput than exchanging per u64. Please see results below. The experiments are done on a two socket machine with two Intel Xeon E5-2640 v3 CPUs. All exchanges are done via the QPI link across two sockets. Results === Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA nodes | 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7 | 8| 9 | u64| 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 | 9.57 | 9.62 | per_page | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 | 7.32 | 7.31 Normalized throughput (to per_page) 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7 | 8| 9 u64| 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26 | 1.30 | 1.30 | 1.31 Exchange page code === For exchanging per u64, I use the following function: static void exchange_page(char *to, char *from) { u64 tmp; int i; for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { tmp = *((u64 *)(from + i)); *((u64 *)(from + i)) = *((u64 *)(to + i)); *((u64 *)(to + i)) = tmp; } } For exchange per 4KB, I use the following function: static void exchange_page2(char *to, char *from) { int cpu = smp_processor_id(); VM_BUG_ON(!in_atomic()); if (!page_tmp[cpu]) { int nid = cpu_to_node(cpu); struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0); if (!page_tmp_page) { exchange_page(to, from); return; } page_tmp[cpu] = kmap(page_tmp_page); } copy_page(page_tmp[cpu], to); copy_page(to, from); copy_page(from, page_tmp[cpu]); } where page_tmp is pre-allocated local to each CPU and alloc_pages_node() above is for hot-added CPUs, which is not used in the tests. The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc To do a comparison, you can clone this repo: https://gitlab.com/ziy/thp-migration-bench, then make, ./run_test.sh, and ./get_results.sh using the kernel from above. Let me know if I missed anything or did something wrong. Thanks. -- Best Regards, Yan Zi
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 21 Feb 2019, at 13:10, Jerome Glisse wrote: > On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote: >> From: Zi Yan >> >> In stead of using two migrate_pages(), a single exchange_pages() would >> be sufficient and without allocating new pages. > > So i believe it would be better to arrange the code differently instead > of having one function that special case combination, define function for > each one ie: > exchange_anon_to_share() > exchange_anon_to_anon() > exchange_share_to_share() > > Then you could define function to test if a page is in correct states: > can_exchange_anon_page() // return true if page can be exchange > can_exchange_share_page() > > In fact both of this function can be factor out as common helpers with the > existing migrate code within migrate.c This way we would have one place > only where we need to handle all the special casing, test and exceptions. > > Other than that i could not spot anything obviously wrong but i did not > spent enough time to check everything. Re-architecturing the code like > i propose above would make this a lot easier to review i believe. > Thank you for reviewing the patch. Your suggestions are very helpful. I will restructure the code to help people review it. >> +from_page_count = page_count(from_page); >> +from_map_count = page_mapcount(from_page); >> +to_page_count = page_count(to_page); >> +to_map_count = page_mapcount(to_page); >> +from_flags = from_page->flags; >> +to_flags = to_page->flags; >> +from_mapping = from_page->mapping; >> +to_mapping = to_page->mapping; >> +from_index = from_page->index; >> +to_index = to_page->index; > > Those are not use anywhere ... Will remove them. Thanks. -- Best Regards, Yan Zi
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote: > From: Zi Yan > > In stead of using two migrate_pages(), a single exchange_pages() would > be sufficient and without allocating new pages. So i believe it would be better to arrange the code differently instead of having one function that special case combination, define function for each one ie: exchange_anon_to_share() exchange_anon_to_anon() exchange_share_to_share() Then you could define function to test if a page is in correct states: can_exchange_anon_page() // return true if page can be exchange can_exchange_share_page() In fact both of this function can be factor out as common helpers with the existing migrate code within migrate.c This way we would have one place only where we need to handle all the special casing, test and exceptions. Other than that i could not spot anything obviously wrong but i did not spent enough time to check everything. Re-architecturing the code like i propose above would make this a lot easier to review i believe. Cheers, Jérôme > > Signed-off-by: Zi Yan > --- > include/linux/ksm.h | 5 + > mm/Makefile | 1 + > mm/exchange.c | 846 > mm/internal.h | 6 + > mm/ksm.c| 35 ++ > mm/migrate.c| 4 +- > 6 files changed, 895 insertions(+), 2 deletions(-) > create mode 100644 mm/exchange.c [...] > + from_page_count = page_count(from_page); > + from_map_count = page_mapcount(from_page); > + to_page_count = page_count(to_page); > + to_map_count = page_mapcount(to_page); > + from_flags = from_page->flags; > + to_flags = to_page->flags; > + from_mapping = from_page->mapping; > + to_mapping = to_page->mapping; > + from_index = from_page->index; > + to_index = to_page->index; Those are not use anywhere ...
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 02/19/2019 06:26 PM, Matthew Wilcox wrote: > On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote: >> But the location of this temp page matters as well because you would like to >> saturate the inter node interface. It needs to be either of the nodes where >> the source or destination page belongs. Any other node would generate two >> internode copy process which is not what you intend here I guess. > That makes no sense. It should be allocated on the local node of the CPU > performing the copy. If the CPU is in node A, the destination is in node B > and the source is in node C, then you're doing 4k worth of reads from node C, > 4k worth of reads from node B, 4k worth of writes to node C followed by > 4k worth of writes to node B. Eventually the 4k of dirty cachelines on > node A will be written back from cache to the local memory (... or not, > if that page gets reused for some other purpose first). > > If you allocate the page on node B or node C, that's an extra 4k of writes > to be sent across the inter-node link. Thats right there will be an extra remote write. My assumption was that the CPU performing the copy belongs to either node B or node C.
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote: > But the location of this temp page matters as well because you would like to > saturate the inter node interface. It needs to be either of the nodes where > the source or destination page belongs. Any other node would generate two > internode copy process which is not what you intend here I guess. That makes no sense. It should be allocated on the local node of the CPU performing the copy. If the CPU is in node A, the destination is in node B and the source is in node C, then you're doing 4k worth of reads from node C, 4k worth of reads from node B, 4k worth of writes to node C followed by 4k worth of writes to node B. Eventually the 4k of dirty cachelines on node A will be written back from cache to the local memory (... or not, if that page gets reused for some other purpose first). If you allocate the page on node B or node C, that's an extra 4k of writes to be sent across the inter-node link.
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 02/18/2019 11:29 PM, Zi Yan wrote: > On 18 Feb 2019, at 9:52, Matthew Wilcox wrote: > >> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote: >>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote: On 2/18/19 6:31 PM, Zi Yan wrote: > The purpose of proposing exchange_pages() is to avoid allocating any > new > page, > so that we would not trigger any potential page reclaim or memory > compaction. > Allocating a temporary page defeats the purpose. Compaction can only happen for order > 0 temporary pages. Even if you used single order = 0 page to gradually exchange e.g. a THP, it should be better than u64. Allocating order = 0 should be a non-issue. If it's an issue, then the system is in a bad state and physically contiguous layout is a secondary concern. >>> >>> You are right if we only need to allocate one order-0 page. But this also >>> means >>> we can only exchange two pages at a time. We need to add a lock to make sure >>> the temporary page is used exclusively or we need to keep allocating >>> temporary pages >>> when multiple exchange_pages() are happening at the same time. >> >> You allocate one temporary page per thread that's doing an exchange_page(). > > Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try > it. Thanks. But the location of this temp page matters as well because you would like to saturate the inter node interface. It needs to be either of the nodes where the source or destination page belongs. Any other node would generate two internode copy process which is not what you intend here I guess.
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 18 Feb 2019, at 9:52, Matthew Wilcox wrote: > On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote: >> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote: >>> On 2/18/19 6:31 PM, Zi Yan wrote: The purpose of proposing exchange_pages() is to avoid allocating any new page, so that we would not trigger any potential page reclaim or memory compaction. Allocating a temporary page defeats the purpose. >>> >>> Compaction can only happen for order > 0 temporary pages. Even if you >>> used >>> single order = 0 page to gradually exchange e.g. a THP, it should be >>> better than >>> u64. Allocating order = 0 should be a non-issue. If it's an issue, then >>> the >>> system is in a bad state and physically contiguous layout is a secondary >>> concern. >> >> You are right if we only need to allocate one order-0 page. But this also >> means >> we can only exchange two pages at a time. We need to add a lock to make sure >> the temporary page is used exclusively or we need to keep allocating >> temporary pages >> when multiple exchange_pages() are happening at the same time. > > You allocate one temporary page per thread that's doing an exchange_page(). Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try it. Thanks. -- Best Regards, Yan Zi signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote: > On 18 Feb 2019, at 9:42, Vlastimil Babka wrote: > > On 2/18/19 6:31 PM, Zi Yan wrote: > > > The purpose of proposing exchange_pages() is to avoid allocating any > > > new > > > page, > > > so that we would not trigger any potential page reclaim or memory > > > compaction. > > > Allocating a temporary page defeats the purpose. > > > > Compaction can only happen for order > 0 temporary pages. Even if you > > used > > single order = 0 page to gradually exchange e.g. a THP, it should be > > better than > > u64. Allocating order = 0 should be a non-issue. If it's an issue, then > > the > > system is in a bad state and physically contiguous layout is a secondary > > concern. > > You are right if we only need to allocate one order-0 page. But this also > means > we can only exchange two pages at a time. We need to add a lock to make sure > the temporary page is used exclusively or we need to keep allocating > temporary pages > when multiple exchange_pages() are happening at the same time. You allocate one temporary page per thread that's doing an exchange_page().
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 18 Feb 2019, at 9:42, Vlastimil Babka wrote: On 2/18/19 6:31 PM, Zi Yan wrote: The purpose of proposing exchange_pages() is to avoid allocating any new page, so that we would not trigger any potential page reclaim or memory compaction. Allocating a temporary page defeats the purpose. Compaction can only happen for order > 0 temporary pages. Even if you used single order = 0 page to gradually exchange e.g. a THP, it should be better than u64. Allocating order = 0 should be a non-issue. If it's an issue, then the system is in a bad state and physically contiguous layout is a secondary concern. You are right if we only need to allocate one order-0 page. But this also means we can only exchange two pages at a time. We need to add a lock to make sure the temporary page is used exclusively or we need to keep allocating temporary pages when multiple exchange_pages() are happening at the same time. -- Best Regards, Yan Zi
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 2/18/19 6:31 PM, Zi Yan wrote: > The purpose of proposing exchange_pages() is to avoid allocating any new > page, > so that we would not trigger any potential page reclaim or memory > compaction. > Allocating a temporary page defeats the purpose. Compaction can only happen for order > 0 temporary pages. Even if you used single order = 0 page to gradually exchange e.g. a THP, it should be better than u64. Allocating order = 0 should be a non-issue. If it's an issue, then the system is in a bad state and physically contiguous layout is a secondary concern.
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 17 Feb 2019, at 3:29, Matthew Wilcox wrote: On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote: +struct page_flags { + unsigned int page_error :1; + unsigned int page_referenced:1; + unsigned int page_uptodate:1; + unsigned int page_active:1; + unsigned int page_unevictable:1; + unsigned int page_checked:1; + unsigned int page_mappedtodisk:1; + unsigned int page_dirty:1; + unsigned int page_is_young:1; + unsigned int page_is_idle:1; + unsigned int page_swapcache:1; + unsigned int page_writeback:1; + unsigned int page_private:1; + unsigned int __pad:3; +}; I'm not sure how to feel about this. It's a bit fragile versus somebody adding new page flags. I don't know whether it's needed or whether you can just copy page->flags directly because you're holding PageLock. I agree with you that current way of copying page flags individually could miss new page flags. I will try to come up with something better. Copying page->flags as a whole might not simply work, since the upper part of page->flags has the page node information, which should not be changed. I think I need to add a helper function to just copy/exchange all page flags, like calling migrate_page_stats() twice. +static void exchange_page(char *to, char *from) +{ + u64 tmp; + int i; + + for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { + tmp = *((u64 *)(from + i)); + *((u64 *)(from + i)) = *((u64 *)(to + i)); + *((u64 *)(to + i)) = tmp; + } +} I have a suspicion you'd be better off allocating a temporary page and using copy_page(). Some architectures have put a lot of effort into making copy_page() run faster. When I am doing exchange_pages() between two NUMA nodes on a x86_64 machine, I actually can saturate the QPI bandwidth with this operation. I think cache prefetching was doing its job. The purpose of proposing exchange_pages() is to avoid allocating any new page, so that we would not trigger any potential page reclaim or memory compaction. Allocating a temporary page defeats the purpose. + xa_lock_irq(_mapping->i_pages); + + to_pslot = radix_tree_lookup_slot(_mapping->i_pages, + page_index(to_page)); This needs to be converted to the XArray. radix_tree_lookup_slot() is going away soon. You probably need: XA_STATE(to_xas, _mapping->i_pages, page_index(to_page)); Thank you for pointing this out. I will do the change. This is a lot of code and I'm still trying to get my head aroud it all. Thanks for putting in this work; it's good to see this approach being explored. Thank you for taking a look at the code. -- Best Regards, Yan Zi
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote: > +struct page_flags { > + unsigned int page_error :1; > + unsigned int page_referenced:1; > + unsigned int page_uptodate:1; > + unsigned int page_active:1; > + unsigned int page_unevictable:1; > + unsigned int page_checked:1; > + unsigned int page_mappedtodisk:1; > + unsigned int page_dirty:1; > + unsigned int page_is_young:1; > + unsigned int page_is_idle:1; > + unsigned int page_swapcache:1; > + unsigned int page_writeback:1; > + unsigned int page_private:1; > + unsigned int __pad:3; > +}; I'm not sure how to feel about this. It's a bit fragile versus somebody adding new page flags. I don't know whether it's needed or whether you can just copy page->flags directly because you're holding PageLock. > +static void exchange_page(char *to, char *from) > +{ > + u64 tmp; > + int i; > + > + for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { > + tmp = *((u64 *)(from + i)); > + *((u64 *)(from + i)) = *((u64 *)(to + i)); > + *((u64 *)(to + i)) = tmp; > + } > +} I have a suspicion you'd be better off allocating a temporary page and using copy_page(). Some architectures have put a lot of effort into making copy_page() run faster. > + xa_lock_irq(_mapping->i_pages); > + > + to_pslot = radix_tree_lookup_slot(_mapping->i_pages, > + page_index(to_page)); This needs to be converted to the XArray. radix_tree_lookup_slot() is going away soon. You probably need: XA_STATE(to_xas, _mapping->i_pages, page_index(to_page)); This is a lot of code and I'm still trying to get my head aroud it all. Thanks for putting in this work; it's good to see this approach being explored.
[RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
From: Zi Yan In stead of using two migrate_pages(), a single exchange_pages() would be sufficient and without allocating new pages. Signed-off-by: Zi Yan --- include/linux/ksm.h | 5 + mm/Makefile | 1 + mm/exchange.c | 846 mm/internal.h | 6 + mm/ksm.c| 35 ++ mm/migrate.c| 4 +- 6 files changed, 895 insertions(+), 2 deletions(-) create mode 100644 mm/exchange.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..87c5b943a73c 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -53,6 +53,7 @@ struct page *ksm_might_need_to_copy(struct page *page, void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage, struct page *oldpage); +void ksm_exchange_page(struct page *to_page, struct page *from_page); #else /* !CONFIG_KSM */ @@ -86,6 +87,10 @@ static inline void rmap_walk_ksm(struct page *page, static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) { } +static inline void ksm_exchange_page(struct page *to_page, + struct page *from_page) +{ +} #endif /* CONFIG_MMU */ #endif /* !CONFIG_KSM */ diff --git a/mm/Makefile b/mm/Makefile index d210cc9d6f80..1574ea5743e4 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -43,6 +43,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ obj-y += init-mm.o obj-y += memblock.o +obj-y += exchange.o ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o diff --git a/mm/exchange.c b/mm/exchange.c new file mode 100644 index ..a607348cc6f4 --- /dev/null +++ b/mm/exchange.c @@ -0,0 +1,846 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2016 NVIDIA, Zi Yan + * + * Exchange two in-use pages. Page flags and page->mapping are exchanged + * as well. Only anonymous pages are supported. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* buffer_migrate_page */ +#include + + +#include "internal.h" + +struct exchange_page_info { + struct page *from_page; + struct page *to_page; + + struct anon_vma *from_anon_vma; + struct anon_vma *to_anon_vma; + + struct list_head list; +}; + +struct page_flags { + unsigned int page_error :1; + unsigned int page_referenced:1; + unsigned int page_uptodate:1; + unsigned int page_active:1; + unsigned int page_unevictable:1; + unsigned int page_checked:1; + unsigned int page_mappedtodisk:1; + unsigned int page_dirty:1; + unsigned int page_is_young:1; + unsigned int page_is_idle:1; + unsigned int page_swapcache:1; + unsigned int page_writeback:1; + unsigned int page_private:1; + unsigned int __pad:3; +}; + + +static void exchange_page(char *to, char *from) +{ + u64 tmp; + int i; + + for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { + tmp = *((u64 *)(from + i)); + *((u64 *)(from + i)) = *((u64 *)(to + i)); + *((u64 *)(to + i)) = tmp; + } +} + +static inline void exchange_highpage(struct page *to, struct page *from) +{ + char *vfrom, *vto; + + vfrom = kmap_atomic(from); + vto = kmap_atomic(to); + exchange_page(vto, vfrom); + kunmap_atomic(vto); + kunmap_atomic(vfrom); +} + +static void __exchange_gigantic_page(struct page *dst, struct page *src, + int nr_pages) +{ + int i; + struct page *dst_base = dst; + struct page *src_base = src; + + for (i = 0; i < nr_pages; ) { + cond_resched(); + exchange_highpage(dst, src); + + i++; + dst = mem_map_next(dst, dst_base, i); + src = mem_map_next(src, src_base, i); + } +} + +static void exchange_huge_page(struct page *dst, struct page *src) +{ + int i; + int nr_pages; + + if (PageHuge(src)) { + /* hugetlbfs page */ + struct hstate *h = page_hstate(src); + + nr_pages = pages_per_huge_page(h); + + if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) { + __exchange_gigantic_page(dst, src, nr_pages); + return; + } + } else { + /* thp page */ + VM_BUG_ON(!PageTransHuge(src)); + nr_pages = hpage_nr_pages(src); + } + + for (i = 0; i < nr_pages; i++) { + cond_resched(); + exchange_highpage(dst + i, src + i); + } +} + +/* + * Copy the page to its new location without polluting cache + */ +static void exchange_page_flags(struct page *to_page, struct page *from_page) +{ + int from_cpupid, to_cpupid; + struct page_flags from_page_flags, to_page_flags; +
[RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
In stead of using two migrate_pages(), a single exchange_pages() would be sufficient and without allocating new pages. Signed-off-by: Zi Yan --- include/linux/ksm.h | 5 + mm/Makefile | 1 + mm/exchange.c | 846 mm/internal.h | 6 + mm/ksm.c| 35 ++ mm/migrate.c| 4 +- 6 files changed, 895 insertions(+), 2 deletions(-) create mode 100644 mm/exchange.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..87c5b943a73c 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -53,6 +53,7 @@ struct page *ksm_might_need_to_copy(struct page *page, void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage, struct page *oldpage); +void ksm_exchange_page(struct page *to_page, struct page *from_page); #else /* !CONFIG_KSM */ @@ -86,6 +87,10 @@ static inline void rmap_walk_ksm(struct page *page, static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) { } +static inline void ksm_exchange_page(struct page *to_page, + struct page *from_page) +{ +} #endif /* CONFIG_MMU */ #endif /* !CONFIG_KSM */ diff --git a/mm/Makefile b/mm/Makefile index d210cc9d6f80..1574ea5743e4 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -43,6 +43,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ obj-y += init-mm.o obj-y += memblock.o +obj-y += exchange.o ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o diff --git a/mm/exchange.c b/mm/exchange.c new file mode 100644 index ..a607348cc6f4 --- /dev/null +++ b/mm/exchange.c @@ -0,0 +1,846 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2016 NVIDIA, Zi Yan + * + * Exchange two in-use pages. Page flags and page->mapping are exchanged + * as well. Only anonymous pages are supported. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* buffer_migrate_page */ +#include + + +#include "internal.h" + +struct exchange_page_info { + struct page *from_page; + struct page *to_page; + + struct anon_vma *from_anon_vma; + struct anon_vma *to_anon_vma; + + struct list_head list; +}; + +struct page_flags { + unsigned int page_error :1; + unsigned int page_referenced:1; + unsigned int page_uptodate:1; + unsigned int page_active:1; + unsigned int page_unevictable:1; + unsigned int page_checked:1; + unsigned int page_mappedtodisk:1; + unsigned int page_dirty:1; + unsigned int page_is_young:1; + unsigned int page_is_idle:1; + unsigned int page_swapcache:1; + unsigned int page_writeback:1; + unsigned int page_private:1; + unsigned int __pad:3; +}; + + +static void exchange_page(char *to, char *from) +{ + u64 tmp; + int i; + + for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { + tmp = *((u64 *)(from + i)); + *((u64 *)(from + i)) = *((u64 *)(to + i)); + *((u64 *)(to + i)) = tmp; + } +} + +static inline void exchange_highpage(struct page *to, struct page *from) +{ + char *vfrom, *vto; + + vfrom = kmap_atomic(from); + vto = kmap_atomic(to); + exchange_page(vto, vfrom); + kunmap_atomic(vto); + kunmap_atomic(vfrom); +} + +static void __exchange_gigantic_page(struct page *dst, struct page *src, + int nr_pages) +{ + int i; + struct page *dst_base = dst; + struct page *src_base = src; + + for (i = 0; i < nr_pages; ) { + cond_resched(); + exchange_highpage(dst, src); + + i++; + dst = mem_map_next(dst, dst_base, i); + src = mem_map_next(src, src_base, i); + } +} + +static void exchange_huge_page(struct page *dst, struct page *src) +{ + int i; + int nr_pages; + + if (PageHuge(src)) { + /* hugetlbfs page */ + struct hstate *h = page_hstate(src); + + nr_pages = pages_per_huge_page(h); + + if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) { + __exchange_gigantic_page(dst, src, nr_pages); + return; + } + } else { + /* thp page */ + VM_BUG_ON(!PageTransHuge(src)); + nr_pages = hpage_nr_pages(src); + } + + for (i = 0; i < nr_pages; i++) { + cond_resched(); + exchange_highpage(dst + i, src + i); + } +} + +/* + * Copy the page to its new location without polluting cache + */ +static void exchange_page_flags(struct page *to_page, struct page *from_page) +{ + int from_cpupid, to_cpupid; + struct page_flags from_page_flags, to_page_flags; + struct