Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-08 Thread Dan Williams
On Sat, Dec 8, 2018 at 8:48 AM Christoph Hellwig  wrote:
>
> On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > Patchset to use HMM inside nouveau have already been posted, some
> > of the bits have already made upstream and more are line up for
> > next merge window.
>
> Even with that it is a relative fringe feature compared to making
> something like get_user_pages() that is literally used every to actually
> work properly.
>
> So I think we need to kick out HMM here and just find another place for
> it to store data.
>
> And just to make clear that I'm not picking just on this - the same is
> true to a just a little smaller extent for the pgmap..

Fair enough, I cringed as I took a full pointer for that use case, I'm
happy to look at ways of consolidating or dropping that usage.

Another fix that may put pressure 'struct page' is resolving the
untenable situation of dax being incompatible with reflink, i.e.
reflink currently requires page-cache pages. Dave has talked about
silently establishing page-cache entries when a dax-page is cow'd for
reflink, but I wonder if we could go the other way and introduce the
mechanism of a page belonging to multiple mappings simultaneously and
managed by the filesystem.

Both HMM and ZONE_DEVICE in general are guilty of side-stepping the mm
and I'm in favor of undoing that as much as possible,


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-07 Thread Dan Williams
On Fri, Dec 7, 2018 at 4:53 PM John Hubbard  wrote:
>
> On 12/7/18 11:16 AM, Jerome Glisse wrote:
> > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
[..]
> I see. OK, HMM has done an efficient job of mopping up unused fields, and now 
> we are
> completely out of space. At this point, after thinking about it carefully, it 
> seems clear
> that it's time for a single, new field:
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..1c789e324da8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -182,6 +182,9 @@ struct page {
> /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> atomic_t _refcount;
>
> +   /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> +   atomic_t _dma_pinned_count;
> +
>  #ifdef CONFIG_MEMCG
> struct mem_cgroup *mem_cgroup;
>  #endif
>
>
> ...because after all, the reason this is so difficult is that this fix has to 
> work
> in pretty much every configuration. get_user_pages() use is widespread, it's 
> a very
> general facility, and...it needs fixing.  And we're out of space.

HMM seems entirely too greedy in this regard. Especially with zero
upstream users. When can we start to delete the pieces of HMM that
have no upstream consumers? I would think that would be 4.21 / 5.0 as
there needs to be some forcing function. We can always re-add pieces
of HMM with it's users when / if they arrive.


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-07 Thread Dan Williams
On Fri, Dec 7, 2018 at 11:16 AM Jerome Glisse  wrote:
>
> On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> > On 12/4/18 5:57 PM, John Hubbard wrote:
> > > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> > >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> > >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> > >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > >>>>> does this proposal interact with those?
> > >>>>
> > >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is 
> > >>>> there an entire
> > >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? 
> > >>>> Said another
> > >>>> way: is it reasonable to disallow calling get_user_pages() on 
> > >>>> ZONE_DEVICE pages?
> > >>>>
> > >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the 
> > >>>> whole
> > >>>> LRU field approach is unusable.
> > >>>
> > >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the 
> > >>> whitespace
> > >>> damage:
> > >>>
> > >>> +++ b/include/linux/mm_types.h
> > >>> @@ -151,10 +151,12 @@ struct page {
> > >>>  #endif
> > >>> };
> > >>> struct {/* ZONE_DEVICE pages */
> > >>> +   unsigned long _zd_pad_2;/* LRU */
> > >>> +   unsigned long _zd_pad_3;/* LRU */
> > >>> +   unsigned long _zd_pad_1;/* uses mapping 
> > >>> */
> > >>> /** @pgmap: Points to the hosting device page 
> > >>> map. */
> > >>> struct dev_pagemap *pgmap;
> > >>> unsigned long hmm_data;
> > >>> -   unsigned long _zd_pad_1;/* uses mapping 
> > >>> */
> > >>> };
> > >>>
> > >>> /** @rcu_head: You can use this to free a page by RCU. 
> > >>> */
> > >>>
> > >>> You don't use page->private or page->index, do you Dan?
> > >>
> > >> page->private and page->index are use by HMM DEVICE page.
> > >>
> > >
> > > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field 
> > > remaining for
> > > dma-pinned information. Which might work. To recap, we need:
> > >
> > > -- 1 bit for PageDmaPinned
> > > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > > -- N bits for a reference count
> > >
> > > Those *could* be packed into a single 64-bit field, if really necessary.
> > >
> >
> > ...actually, this needs to work on 32-bit systems, as well. And HMM is 
> > using a lot.
> > However, it is still possible for this to work.
> >
> > Matthew, can I have that bit now please? I'm about out of options, and now 
> > it will actually
> > solve the problem here.
> >
> > Given:
> >
> > 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not 
> > on the LRU.
> > That, in turn, means only 1 bit instead of 2 bits (in addition to a 
> > counter) is required,
> > for that case.
> >
> > 2) There is an independent bit available (according to Matthew).
> >
> > 3) HMM uses 4 of the 5 struct page fields, so only one field is available 
> > for a counter
> >in that case.
>
> To expend on this, HMM private page are use for anonymous page
> so the index and mapping fields have the value you expect for
> such pages. Down the road i want also to support file backed
> page with HMM private (mapping, private, index).
>
> For HMM public both anonymous and file back page are supported
> today (HMM public is only useful on platform with something like
> OpenCAPI, CCIX or NVlink ... so PowerPC for now).
>
> > 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
>
> get_user_pages() only need to work with HMM public page not the
> private one as we can not allow _anyone_ to pin HMM private page.

How does HMM enforce that? Because the kernel should not allow *any*
memory management facility to arbitrarily fail direct-I/O operations.
That's why CONFIG_FS_DAX_LIMITED is a temporary / experimental hack
for S390 and ZONE_DEVICE was invented to bypass that hack for X86 and
any arch that plans to properly support DAX. I would classify any
memory management that can't support direct-I/O in the same
"experimental" category.


Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

2018-12-05 Thread Dan Williams
On Wed, Dec 5, 2018 at 3:27 PM Jerome Glisse  wrote:
>
> On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> > > And my proposal is under /sys/bus and have symlink to all existing
> > > device it agregate in there.
> >
> > That's so not the point. Use the existing buses don't invent some
> > virtual tree. I don't know how many times I have to say this or in how
> > many ways. I'm not responding anymore.
>
> And how do i express interaction with different buses because i just
> do not see how to do that in the existing scheme. It would be like
> teaching to each bus about all the other bus versus having each bus
> register itself under a common framework and have all the interaction
> between bus mediated through that common framework avoiding code
> duplication accross buses.
>
> >
> > > So you agree with my proposal ? A sysfs directory in which all the
> > > bus and how they are connected to each other and what is connected
> > > to each of them (device, CPU, memory).
> >
> > I'm fine with the motivation. What I'm arguing against is the
> > implementation and the fact you have to create a whole grand new
> > userspace API and hierarchy to accomplish it.

Right, GPUs show up in /sys today. Don't register a whole new
hierarchy as an alias to what already exists, add a new attribute
scheme to the existing hierarchy. This is what the HMAT enabling is
doing, this is what p2pdma is doing.


[tip:x86/mm] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-12-05 Thread tip-bot for Dan Williams
Commit-ID:  ba6f508d0ec4adb09f0a939af6d5e19cdfa8667d
Gitweb: https://git.kernel.org/tip/ba6f508d0ec4adb09f0a939af6d5e19cdfa8667d
Author: Dan Williams 
AuthorDate: Tue, 4 Dec 2018 13:37:27 -0800
Committer:  Ingo Molnar 
CommitDate: Wed, 5 Dec 2018 09:03:07 +0100

x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

Commit:

  f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()"

addressed a case where __flush_tlb_all() is called without preemption
being disabled. It also left a warning to catch other cases where
preemption is not disabled.

That warning triggers for the memory hotplug path which is also used for
persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing PTE entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

[1]: 
https://lkml.kernel.org/r/9dfd717d-857d-493d-a606-b635d72ba...@amacapital.net
[2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa...@intel.com

[ mingo: Minor readability edits. ]

Suggested-by: Dave Hansen 
Reported-by: Andy Lutomirski 
Signed-off-by: Dan Williams 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Kirill A. Shutemov 
Cc: 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: dave.han...@intel.com
Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Link: 
http://lkml.kernel.org/r/154395944713.32119.15611079023837132638.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/init_64.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3e25ac2793ef..484c1b92f078 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -584,7 +584,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
   paddr_end,
   page_size_mask,
   prot);
-   __flush_tlb_all();
continue;
}
/*
@@ -627,7 +626,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
pud_populate_safe(_mm, pud, pmd);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
update_page_count(PG_LEVEL_1G, pages);
 
@@ -668,7 +666,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
paddr_last = phys_pud_init(pud, paddr,
paddr_end,
page_size_mask);
-   __flush_tlb_all();
continue;
}
 
@@ -680,7 +677,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
p4d_populate_safe(_mm, p4d, pud);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
return paddr_last;
 }
@@ -733,8 +729,6 @@ kernel_physical_mapping_init(unsigned long paddr_start,
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
 
-   __flush_tlb_all();
-
return paddr_last;
 }
 


[tip:x86/mm] generic/pgtable: Introduce set_pte_safe()

2018-12-05 Thread tip-bot for Dan Williams
Commit-ID:  4369deaa2f022ef92da45a0e7eec8a4a52e8e8a4
Gitweb: https://git.kernel.org/tip/4369deaa2f022ef92da45a0e7eec8a4a52e8e8a4
Author: Dan Williams 
AuthorDate: Tue, 4 Dec 2018 13:37:16 -0800
Committer:  Ingo Molnar 
CommitDate: Wed, 5 Dec 2018 09:03:06 +0100

generic/pgtable: Introduce set_pte_safe()

Commit:

  f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()"

introduced a warning to capture cases __flush_tlb_all() is called without
pre-emption disabled. It triggers a false positive warning in the memory
hotplug path.

On investigation it was found that the __flush_tlb_all() calls are not
necessary. However, they are only "not necessary" in practice provided
the ptes are being initially populated from the !present state.

Introduce set_pte_safe() as a sanity check that the pte is being updated
in a way that does not require a TLB flush.

Forgive the macro, the availability of the various of set_pte() levels
is hit and miss across architectures.

[ mingo: Minor readability edits. ]

Suggested-by: Peter Zijlstra 
Suggested-by: Dave Hansen 
Signed-off-by: Dan Williams 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Kirill A. Shutemov 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Rik van Riel 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/279dadae-9148-465c-7ec6-3f37e026c...@intel.com
Signed-off-by: Ingo Molnar 
---
 include/asm-generic/pgtable.h | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index dae7f98babed..a9cac82e9a7a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -400,6 +400,44 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
 }
 #endif
 
+/*
+ * Use set_p*_safe(), and elide TLB flushing, when confident that *no*
+ * TLB flush will be required as a result of the "set". For example, use
+ * in scenarios where it is known ahead of time that the routine is
+ * setting non-present entries, or re-setting an existing entry to the
+ * same value. Otherwise, use the typical "set" helpers and flush the
+ * TLB.
+ */
+#define set_pte_safe(ptep, pte) \
+({ \
+   WARN_ON_ONCE(pte_present(*ptep) && !pte_same(*ptep, pte)); \
+   set_pte(ptep, pte); \
+})
+
+#define set_pmd_safe(pmdp, pmd) \
+({ \
+   WARN_ON_ONCE(pmd_present(*pmdp) && !pmd_same(*pmdp, pmd)); \
+   set_pmd(pmdp, pmd); \
+})
+
+#define set_pud_safe(pudp, pud) \
+({ \
+   WARN_ON_ONCE(pud_present(*pudp) && !pud_same(*pudp, pud)); \
+   set_pud(pudp, pud); \
+})
+
+#define set_p4d_safe(p4dp, p4d) \
+({ \
+   WARN_ON_ONCE(p4d_present(*p4dp) && !p4d_same(*p4dp, p4d)); \
+   set_p4d(p4dp, p4d); \
+})
+
+#define set_pgd_safe(pgdp, pgd) \
+({ \
+   WARN_ON_ONCE(pgd_present(*pgdp) && !pgd_same(*pgdp, pgd)); \
+   set_pgd(pgdp, pgd); \
+})
+
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
 /*
  * Some architectures support metadata associated with a page. When a


[tip:x86/mm] x86/mm: Validate kernel_physical_mapping_init() PTE population

2018-12-05 Thread tip-bot for Dan Williams
Commit-ID:  0a9fe8ca844d43f3f547f0e166122b6048121c8f
Gitweb: https://git.kernel.org/tip/0a9fe8ca844d43f3f547f0e166122b6048121c8f
Author: Dan Williams 
AuthorDate: Tue, 4 Dec 2018 13:37:21 -0800
Committer:  Ingo Molnar 
CommitDate: Wed, 5 Dec 2018 09:03:06 +0100

x86/mm: Validate kernel_physical_mapping_init() PTE population

The usage of __flush_tlb_all() in the kernel_physical_mapping_init()
path is not necessary. In general flushing the TLB is not required when
updating an entry from the !present state. However, to give confidence
in the future removal of TLB flushing in this path, use the new
set_pte_safe() family of helpers to assert that the !present assumption
is true in this path.

[ mingo: Minor readability edits. ]

Suggested-by: Peter Zijlstra 
Suggested-by: Dave Hansen 
Signed-off-by: Dan Williams 
Acked-by: Kirill A. Shutemov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Rik van Riel 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/154395944177.32119.8524957429632012270.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/pgalloc.h   | 27 +++
 arch/x86/mm/init_64.c| 24 
 include/asm-generic/5level-fixup.h   |  1 +
 include/asm-generic/pgtable-nop4d-hack.h |  1 +
 include/asm-generic/pgtable-nop4d.h  |  1 +
 include/asm-generic/pgtable-nopud.h  |  1 +
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index ec7f43327033..1ea41aaef68b 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -80,6 +80,13 @@ static inline void pmd_populate_kernel(struct mm_struct *mm,
set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
 }
 
+static inline void pmd_populate_kernel_safe(struct mm_struct *mm,
+  pmd_t *pmd, pte_t *pte)
+{
+   paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
+   set_pmd_safe(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+}
+
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
struct page *pte)
 {
@@ -132,6 +139,12 @@ static inline void pud_populate(struct mm_struct *mm, 
pud_t *pud, pmd_t *pmd)
paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)));
 }
+
+static inline void pud_populate_safe(struct mm_struct *mm, pud_t *pud, pmd_t 
*pmd)
+{
+   paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
+   set_pud_safe(pud, __pud(_PAGE_TABLE | __pa(pmd)));
+}
 #endif /* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -141,6 +154,12 @@ static inline void p4d_populate(struct mm_struct *mm, 
p4d_t *p4d, pud_t *pud)
set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t 
*pud)
+{
+   paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
+   set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+}
+
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
gfp_t gfp = GFP_KERNEL_ACCOUNT;
@@ -173,6 +192,14 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, p4d_t *p4d)
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
 }
 
+static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t 
*p4d)
+{
+   if (!pgtable_l5_enabled())
+   return;
+   paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+   set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
gfp_t gfp = GFP_KERNEL_ACCOUNT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..3e25ac2793ef 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -432,7 +432,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, 
unsigned long paddr_end,
 E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
-   set_pte(pte, __pte(0));
+   set_pte_safe(pte, __pte(0));
continue;
}
 
@@ -452,7 +452,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, 
unsigned long paddr_end,
pr_info("   pte=%p addr=%lx pte=%016lx\n", pte, paddr,
pfn_pte(paddr >> PAGE_SHIFT, PAGE_KERNEL).pte);
pages++;
-   set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+   set_pte_safe(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
paddr_last = (paddr &

[tip:x86/mm] generic/pgtable: Introduce {p4d,pgd}_same()

2018-12-05 Thread tip-bot for Dan Williams
Commit-ID:  0cebbb60f759a709dabb3c87b9704f9844878850
Gitweb: https://git.kernel.org/tip/0cebbb60f759a709dabb3c87b9704f9844878850
Author: Dan Williams 
AuthorDate: Tue, 4 Dec 2018 13:37:11 -0800
Committer:  Ingo Molnar 
CommitDate: Wed, 5 Dec 2018 09:03:06 +0100

generic/pgtable: Introduce {p4d,pgd}_same()

In preparation for introducing '_safe' versions of page table entry 'set'
helpers, introduce generic versions of p4d_same() and pgd_same().

Signed-off-by: Dan Williams 
Acked-by: Kirill A. Shutemov 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/154395943153.32119.1733586547359626311.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar 
---
 include/asm-generic/pgtable.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index eea50ef8b8cd..dae7f98babed 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,6 +386,20 @@ static inline int pud_same(pud_t pud_a, pud_t pud_b)
 }
 #endif
 
+#ifndef __HAVE_ARCH_P4D_SAME
+static inline int p4d_same(p4d_t p4d_a, p4d_t p4d_b)
+{
+   return p4d_val(p4d_a) == p4d_val(p4d_b);
+}
+#endif
+
+#ifndef __HAVE_ARCH_PGD_SAME
+static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
+{
+   return pgd_val(pgd_a) == pgd_val(pgd_b);
+}
+#endif
+
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
 /*
  * Some architectures support metadata associated with a page. When a


[tip:x86/mm] generic/pgtable: Make {pmd, pud}_same() unconditionally available

2018-12-05 Thread tip-bot for Dan Williams
Commit-ID:  c683c37cd13246941924c48f6c6a9863425e0cec
Gitweb: https://git.kernel.org/tip/c683c37cd13246941924c48f6c6a9863425e0cec
Author: Dan Williams 
AuthorDate: Tue, 4 Dec 2018 13:37:06 -0800
Committer:  Ingo Molnar 
CommitDate: Wed, 5 Dec 2018 09:03:05 +0100

generic/pgtable: Make {pmd, pud}_same() unconditionally available

In preparation for {pmd,pud}_same() to be used outside of transparent
huge page code paths, make them unconditionally available. This enables
them to be used in the definition of a new family of
set_{pte,pmd,pud,p4d,pgd}_safe() helpers.

Signed-off-by: Dan Williams 
Acked-by: Kirill A. Shutemov 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/154395942644.32119.10238934183949418128.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar 
---
 include/asm-generic/pgtable.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 359fb935ded6..eea50ef8b8cd 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -375,7 +375,6 @@ static inline int pte_unused(pte_t pte)
 #endif
 
 #ifndef __HAVE_ARCH_PMD_SAME
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
@@ -385,19 +384,6 @@ static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
 }
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
-{
-   BUILD_BUG();
-   return 0;
-}
-
-static inline int pud_same(pud_t pud_a, pud_t pud_b)
-{
-   BUILD_BUG();
-   return 0;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 5:15 PM Matthew Wilcox  wrote:
>
> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > On 12/4/18 3:03 PM, Dan Williams wrote:
> > > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > does this proposal interact with those?
> >
> > Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an 
> > entire
> > use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said 
> > another
> > way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE 
> > pages?
> >
> > If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > LRU field approach is unusable.
>
> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> damage:
>
> +++ b/include/linux/mm_types.h
> @@ -151,10 +151,12 @@ struct page {
>  #endif
> };
> struct {/* ZONE_DEVICE pages */
> +   unsigned long _zd_pad_2;/* LRU */
> +   unsigned long _zd_pad_3;/* LRU */
> +   unsigned long _zd_pad_1;/* uses mapping */
> /** @pgmap: Points to the hosting device page map. */
> struct dev_pagemap *pgmap;
> unsigned long hmm_data;
> -   unsigned long _zd_pad_1;/* uses mapping */
> };
>
> /** @rcu_head: You can use this to free a page by RCU. */
>
> You don't use page->private or page->index, do you Dan?

I don't use page->private, but page->index is used by the
memory-failure path to do an rmap.


Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe  wrote:
>
>
>
> On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > two 8 GPUs node connected through each other with fast mesh (ie each
> > GPU can peer to peer to each other at the same bandwidth). Then this
> > 2 blocks are connected to the other block through a share link.
> >
> > So it looks like:
> > SOCKET0SOCKET1-SOCKET2SOCKET3
> > |  |   |  |
> > S0-GPU0S1-GPU0 S2-GPU0S1-GPU0
> > || \\//|| \\//
> > || //\\|| //\\
> > ......-......
> > || \\//|| \\//
> > || //\\|| //\\
> > S0-GPU7S1-GPU7 S2-GPU7S3-GPU7
>
> Well the existing NUMA node stuff tells userspace which GPU belongs to
> which socket (every device in sysfs already has a numa_node attribute).
> And if that's not good enough we should work to improve how that works
> for all devices. This problem isn't specific to GPUS or devices with
> memory and seems rather orthogonal to an API to bind to device memory.
>
> > How the above example would looks like ? I fail to see how to do it
> > inside current sysfs. Maybe by creating multiple virtual device for
> > each of the inter-connect ? So something like
> >
> > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
>
> I think the "links" between GPUs themselves would be a bus. In the same
> way a NUMA node is a bus. Each device in sysfs would then need a
> directory or something to describe what "link bus(es)" they are a part
> of. Though there are other ways to do this: a GPU driver could simply
> create symlinks to other GPUs inside a "neighbours" directory under the
> device path or something like that.
>
> The point is that this seems like it is specific to GPUs and could
> easily be solved in the GPU community without any new universal concepts
> or big APIs.
>
> And for applications that need topology information, a lot of it is
> already there, we just need to fill in the gaps with small changes that
> would be much less controversial. Then if you want to create a libhms
> (or whatever) to help applications parse this information out of
> existing sysfs that would make sense.
>
> > My proposal is to do HMS behind staging for a while and also avoid
> > any disruption to existing code path. See with people living on the
> > bleeding edge if they get interested in that informations. If not then
> > i can strip down my thing to the bare minimum which is about device
> > memory.
>
> This isn't my area or decision to make, but it seemed to me like this is
> not what staging is for. Staging is for introducing *drivers* that
> aren't up to the Kernel's quality level and they all reside under the
> drivers/staging path. It's not meant to introduce experimental APIs
> around the kernel that might be revoked at anytime.
>
> DAX introduced itself by marking the config option as EXPERIMENTAL and
> printing warnings to dmesg when someone tries to use it. But, to my
> knowledge, DAX also wasn't creating APIs with the intention of changing
> or revoking them -- it was introducing features using largely existing
> APIs that had many broken corner cases.
>
> Do you know of any precedents where big APIs were introduced and then
> later revoked or radically changed like you are proposing to do?

This came up before for apis even better defined than HMS as well as
more limited scope, i.e. experimental ABI availability only for -rc
kernels. Linus said this:

"There are no loopholes. No "but it's been only one release". No, no,
no. The whole point is that users are supposed to be able to *trust*
the kernel. If we do something, we keep on doing it.

And if it makes it harder to add new user-visible interfaces, then
that's a *good* thing." [1]

The takeaway being don't land work-in-progress ABIs in the kernel.
Once an application depends on it, there are no more incompatible
changes possible regardless of the warnings, experimental notices, or
"staging" designation. DAX is experimental because there are cases
where it currently does not work with respect to another kernel
feature like xfs-reflink, RDMA. The plan is to fix those, not continue
to hide behind an experimental designation, and fix them in a way that
preserves the user visible behavior that has already been exposed,
i.e. no regressions.

[1]: 
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 4:58 PM John Hubbard  wrote:
>
> On 12/4/18 3:03 PM, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 1:56 PM John Hubbard  wrote:
[..]
> > Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> > the end of the page pin event separate from the final 'put' of the
> > page? Makes sense.
> >
>
> Yes, that's it exactly.
>
> >> I was not able to actually find any place where a single additional page
> >> bit would help our situation, which is why this still uses LRU fields for
> >> both the two bits required (the RFC [1] still applies), and the 
> >> dma_pinned_count.
> >
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
>
> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an 
> entire
> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said 
> another
> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE 
> pages?
>
> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> LRU field approach is unusable.

Unfortunately, the entire motivation for ZONE_DEVICE was to support
get_user_pages() for persistent memory.


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 4:37 PM Jerome Glisse  wrote:
>
> On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 1:56 PM John Hubbard  wrote:
> > >
> > > On 12/4/18 12:28 PM, Dan Williams wrote:
> > > > On Mon, Dec 3, 2018 at 4:17 PM  wrote:
> > > >>
> > > >> From: John Hubbard 
> > > >>
> > > >> Introduces put_user_page(), which simply calls put_page().
> > > >> This provides a way to update all get_user_pages*() callers,
> > > >> so that they call put_user_page(), instead of put_page().
> > > >>
> > > >> Also introduces put_user_pages(), and a few dirty/locked variations,
> > > >> as a replacement for release_pages(), and also as a replacement
> > > >> for open-coded loops that release multiple pages.
> > > >> These may be used for subsequent performance improvements,
> > > >> via batching of pages to be released.
> > > >>
> > > >> This is the first step of fixing the problem described in [1]. The 
> > > >> steps
> > > >> are:
> > > >>
> > > >> 1) (This patch): provide put_user_page*() routines, intended to be used
> > > >>for releasing pages that were pinned via get_user_pages*().
> > > >>
> > > >> 2) Convert all of the call sites for get_user_pages*(), to
> > > >>invoke put_user_page*(), instead of put_page(). This involves 
> > > >> dozens of
> > > >>call sites, and will take some time.
> > > >>
> > > >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> > > >>implement tracking of these pages. This tracking will be separate 
> > > >> from
> > > >>the existing struct page refcounting.
> > > >>
> > > >> 4) Use the tracking and identification of these pages, to implement
> > > >>special handling (especially in writeback paths) when the pages are
> > > >>backed by a filesystem. Again, [1] provides details as to why that 
> > > >> is
> > > >>desirable.
> > > >
> > > > I thought at Plumbers we talked about using a page bit to tag pages
> > > > that have had their reference count elevated by get_user_pages()? That
> > > > way there is no need to distinguish put_page() from put_user_page() it
> > > > just happens internally to put_page(). At the conference Matthew was
> > > > offering to free up a page bit for this purpose.
> > > >
> > >
> > > ...but then, upon further discussion in that same session, we realized 
> > > that
> > > that doesn't help. You need a reference count. Otherwise a random put_page
> > > could affect your dma-pinned pages, etc, etc.
> >
> > Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> > the end of the page pin event separate from the final 'put' of the
> > page? Makes sense.
> >
> > > I was not able to actually find any place where a single additional page
> > > bit would help our situation, which is why this still uses LRU fields for
> > > both the two bits required (the RFC [1] still applies), and the 
> > > dma_pinned_count.
> >
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
> >
> > > [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubb...@nvidia.com
> > >
> > > >> [1] https://lwn.net/Articles/753027/ : "The Trouble with 
> > > >> get_user_pages()"
> > > >>
> > > >> Reviewed-by: Jan Kara 
> > > >
> > > > Wish, you could have been there Jan. I'm missing why it's safe to
> > > > assume that a single put_user_page() is paired with a get_user_page()?
> > > >
> > >
> > > A put_user_page() per page, or a put_user_pages() for an array of pages. 
> > > See
> > > patch 0002 for several examples.
> >
> > Yes, however I was more concerned about validation and trying to
> > locate missed places where put_page() is used instead of
> > put_user_page().
> >
> > It would be interesting to see if we could have a debug mode where
> > get_user_pages() returned dynamically allocated pages from a known
> > address range and catch drivers that operate on a user-pinned page
> > without using the proper helper to 'put' it. I think we might also
> > need a ref_user_page() for drivers that may do their own get_page()
> > and expect the dma_pinned_count to also increase.
>
> Total crazy idea for this, but this is the right time of day
> for this (for me at least it is beer time :)) What about mapping
> all struct page in two different range of kernel virtual address
> and when get user space is use it returns a pointer from the second
> range of kernel virtual address to the struct page. Then in put_page
> you know for sure if the code putting the page got it from GUP or
> from somewhere else. page_to_pfn() would need some trickery to
> handle that.

Yes, exactly what I was thinking, if only as a debug mode since
instrumenting every pfn/page translation would be expensive.

> Dunno if we are running out of kernel virtual address (outside
> 32bits that i believe we are trying to shot down quietly behind
> the bar).

There's room, KASAN is in a roughly similar place.


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 1:56 PM John Hubbard  wrote:
>
> On 12/4/18 12:28 PM, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 4:17 PM  wrote:
> >>
> >> From: John Hubbard 
> >>
> >> Introduces put_user_page(), which simply calls put_page().
> >> This provides a way to update all get_user_pages*() callers,
> >> so that they call put_user_page(), instead of put_page().
> >>
> >> Also introduces put_user_pages(), and a few dirty/locked variations,
> >> as a replacement for release_pages(), and also as a replacement
> >> for open-coded loops that release multiple pages.
> >> These may be used for subsequent performance improvements,
> >> via batching of pages to be released.
> >>
> >> This is the first step of fixing the problem described in [1]. The steps
> >> are:
> >>
> >> 1) (This patch): provide put_user_page*() routines, intended to be used
> >>for releasing pages that were pinned via get_user_pages*().
> >>
> >> 2) Convert all of the call sites for get_user_pages*(), to
> >>invoke put_user_page*(), instead of put_page(). This involves dozens of
> >>call sites, and will take some time.
> >>
> >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> >>implement tracking of these pages. This tracking will be separate from
> >>the existing struct page refcounting.
> >>
> >> 4) Use the tracking and identification of these pages, to implement
> >>special handling (especially in writeback paths) when the pages are
> >>backed by a filesystem. Again, [1] provides details as to why that is
> >>desirable.
> >
> > I thought at Plumbers we talked about using a page bit to tag pages
> > that have had their reference count elevated by get_user_pages()? That
> > way there is no need to distinguish put_page() from put_user_page() it
> > just happens internally to put_page(). At the conference Matthew was
> > offering to free up a page bit for this purpose.
> >
>
> ...but then, upon further discussion in that same session, we realized that
> that doesn't help. You need a reference count. Otherwise a random put_page
> could affect your dma-pinned pages, etc, etc.

Ok, sorry, I mis-remembered. So, you're effectively trying to capture
the end of the page pin event separate from the final 'put' of the
page? Makes sense.

> I was not able to actually find any place where a single additional page
> bit would help our situation, which is why this still uses LRU fields for
> both the two bits required (the RFC [1] still applies), and the 
> dma_pinned_count.

Except the LRU fields are already in use for ZONE_DEVICE pages... how
does this proposal interact with those?

> [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubb...@nvidia.com
>
> >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> >>
> >> Reviewed-by: Jan Kara 
> >
> > Wish, you could have been there Jan. I'm missing why it's safe to
> > assume that a single put_user_page() is paired with a get_user_page()?
> >
>
> A put_user_page() per page, or a put_user_pages() for an array of pages. See
> patch 0002 for several examples.

Yes, however I was more concerned about validation and trying to
locate missed places where put_page() is used instead of
put_user_page().

It would be interesting to see if we could have a debug mode where
get_user_pages() returned dynamically allocated pages from a known
address range and catch drivers that operate on a user-pinned page
without using the proper helper to 'put' it. I think we might also
need a ref_user_page() for drivers that may do their own get_page()
and expect the dma_pinned_count to also increase.


[PATCH v3 3/5] generic/pgtable: Introduce set_pte_safe()

2018-12-04 Thread Dan Williams
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" introduced a warning to capture cases
__flush_tlb_all() is called without pre-emption disabled. It triggers a
false positive warning in the memory hotplug path. On investigation it
was found that the __flush_tlb_all() calls are not necessary. However,
they are only "not necessary" in practice provided the ptes are being
initially populated from the !present state. Introduce set_pte_safe() as
a sanity check that the pte is being updated in a way that does not
require a tlb flush.

Forgive the macro, the availability of the various of set_pte() levels
is hit and miss across architectures.

Link: https://lkml.kernel.org/r/279dadae-9148-465c-7ec6-3f37e026c...@intel.com
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Suggested-by: Peter Zijlstra 
Suggested-by: Dave Hansen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Dan Williams 
---
 include/asm-generic/pgtable.h |   38 ++
 1 file changed, 38 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index dae7f98babed..a9cac82e9a7a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -400,6 +400,44 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
 }
 #endif
 
+/*
+ * Use set_p*_safe(), and elide TLB flushing, when confident that *no*
+ * TLB flush will be required as a result of the "set". For example, use
+ * in scenarios where it is known ahead of time that the routine is
+ * setting non-present entries, or re-setting an existing entry to the
+ * same value. Otherwise, use the typical "set" helpers and flush the
+ * TLB.
+ */
+#define set_pte_safe(ptep, pte) \
+({ \
+   WARN_ON_ONCE(pte_present(*ptep) && !pte_same(*ptep, pte)); \
+   set_pte(ptep, pte); \
+})
+
+#define set_pmd_safe(pmdp, pmd) \
+({ \
+   WARN_ON_ONCE(pmd_present(*pmdp) && !pmd_same(*pmdp, pmd)); \
+   set_pmd(pmdp, pmd); \
+})
+
+#define set_pud_safe(pudp, pud) \
+({ \
+   WARN_ON_ONCE(pud_present(*pudp) && !pud_same(*pudp, pud)); \
+   set_pud(pudp, pud); \
+})
+
+#define set_p4d_safe(p4dp, p4d) \
+({ \
+   WARN_ON_ONCE(p4d_present(*p4dp) && !p4d_same(*p4dp, p4d)); \
+   set_p4d(p4dp, p4d); \
+})
+
+#define set_pgd_safe(pgdp, pgd) \
+({ \
+   WARN_ON_ONCE(pgd_present(*pgdp) && !pgd_same(*pgdp, pgd)); \
+   set_pgd(pgdp, pgd); \
+})
+
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
 /*
  * Some architectures support metadata associated with a page. When a



[PATCH v3 1/5] generic/pgtable: Make {pmd, pud}_same() unconditionally available

2018-12-04 Thread Dan Williams
In preparation for {pmd,pud}_same() to be used outside of transparent
huge page code paths, make them unconditionally available. This enables
them to be used in the definition of a new family of
set_{pte,pmd,pud,p4d,pgd}_safe() helpers.

Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Peter Zijlstra 
Cc: Dave Hansen 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Dan Williams 
---
 include/asm-generic/pgtable.h |   14 --
 1 file changed, 14 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 359fb935ded6..eea50ef8b8cd 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -375,7 +375,6 @@ static inline int pte_unused(pte_t pte)
 #endif
 
 #ifndef __HAVE_ARCH_PMD_SAME
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
@@ -385,19 +384,6 @@ static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
 }
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
-{
-   BUILD_BUG();
-   return 0;
-}
-
-static inline int pud_same(pud_t pud_a, pud_t pud_b)
-{
-   BUILD_BUG();
-   return 0;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE



[PATCH v3 2/5] generic/pgtable: Introduce {p4d,pgd}_same()

2018-12-04 Thread Dan Williams
In preparation for introducing '_safe' versions of page table entry 'set'
helpers, introduce generic versions of p4d_same() and pgd_same().

Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Peter Zijlstra 
Cc: Dave Hansen 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Dan Williams 
---
 include/asm-generic/pgtable.h |   14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index eea50ef8b8cd..dae7f98babed 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,6 +386,20 @@ static inline int pud_same(pud_t pud_a, pud_t pud_b)
 }
 #endif
 
+#ifndef __HAVE_ARCH_P4D_SAME
+static inline int p4d_same(p4d_t p4d_a, p4d_t p4d_b)
+{
+   return p4d_val(p4d_a) == p4d_val(p4d_b);
+}
+#endif
+
+#ifndef __HAVE_ARCH_PGD_SAME
+static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
+{
+   return pgd_val(pgd_a) == pgd_val(pgd_b);
+}
+#endif
+
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
 /*
  * Some architectures support metadata associated with a page. When a



[PATCH v3 0/5] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-12-04 Thread Dan Williams
Changes since v2 [1]:
* Fix links in the changelogs to reference lkml.kernel.org instead of
  lore.kernel.org (Peter)
* Collect Acked-by's from Kirill and Peter.
* Add more Cc's for patches 1 and 2.
* Strengthen the lead-in comment for the set_p*_safe() helpers (Dave)

[1]: https://lkml.org/lkml/2018/12/1/358

---

>From patch 5:

Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing pte entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

Additionally, Dave wanted some runtime assurances that
kernel_physical_mapping_init() is only populating and not changing
existing page table entries. Patches 1 - 4 are implementing a new
set_pte() family of helpers for that purpose.

Patch 5 is tagged for -stable because the false positive warning is now
showing up on 4.19-stable kernels. Patches 1 - 4 are not tagged for
-stable, but if the sanity checking is needed please mark them as such.

The hang that was observed while developing the sanity checking
implementation was resolved by Peter's suggestion to not trigger when
the same pte value is being rewritten.

---

Dan Williams (5):
  generic/pgtable: Make {pmd,pud}_same() unconditionally available
  generic/pgtable: Introduce {p4d,pgd}_same()
  generic/pgtable: Introduce set_pte_safe()
  x86/mm: Validate kernel_physical_mapping_init() pte population
  x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()


 arch/x86/include/asm/pgalloc.h   |   27 ++
 arch/x86/mm/init_64.c|   30 ++--
 include/asm-generic/5level-fixup.h   |1 +
 include/asm-generic/pgtable-nop4d-hack.h |1 +
 include/asm-generic/pgtable-nop4d.h  |1 +
 include/asm-generic/pgtable-nopud.h  |1 +
 include/asm-generic/pgtable.h|   56 +-
 7 files changed, 90 insertions(+), 27 deletions(-)


[PATCH v3 5/5] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-12-04 Thread Dan Williams
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing pte entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

[1]: 
https://lkml.kernel.org/r/9dfd717d-857d-493d-a606-b635d72ba...@amacapital.net
[2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa...@intel.com

Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: 
Reported-by: Andy Lutomirski 
Suggested-by: Dave Hansen 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Dan Williams 
---
 arch/x86/mm/init_64.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3e25ac2793ef..484c1b92f078 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -584,7 +584,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
   paddr_end,
   page_size_mask,
   prot);
-   __flush_tlb_all();
continue;
}
/*
@@ -627,7 +626,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
pud_populate_safe(_mm, pud, pmd);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
update_page_count(PG_LEVEL_1G, pages);
 
@@ -668,7 +666,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
paddr_last = phys_pud_init(pud, paddr,
paddr_end,
page_size_mask);
-   __flush_tlb_all();
continue;
}
 
@@ -680,7 +677,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
p4d_populate_safe(_mm, p4d, pud);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
return paddr_last;
 }
@@ -733,8 +729,6 @@ kernel_physical_mapping_init(unsigned long paddr_start,
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
 
-   __flush_tlb_all();
-
return paddr_last;
 }
 



[PATCH v3 4/5] x86/mm: Validate kernel_physical_mapping_init() pte population

2018-12-04 Thread Dan Williams
The usage of __flush_tlb_all() in the kernel_physical_mapping_init()
path is not necessary. In general flushing the tlb is not required when
updating an entry from the !present state. However, to give confidence
in the future removal of tlb flushing in this path, use the new
set_pte_safe() family of helpers to assert that the !present assumption
is true in this path.

Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Suggested-by: Peter Zijlstra 
Suggested-by: Dave Hansen 
Acked-by: Kirill A. Shutemov 
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pgalloc.h   |   27 +++
 arch/x86/mm/init_64.c|   24 
 include/asm-generic/5level-fixup.h   |1 +
 include/asm-generic/pgtable-nop4d-hack.h |1 +
 include/asm-generic/pgtable-nop4d.h  |1 +
 include/asm-generic/pgtable-nopud.h  |1 +
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index ec7f43327033..1ea41aaef68b 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -80,6 +80,13 @@ static inline void pmd_populate_kernel(struct mm_struct *mm,
set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
 }
 
+static inline void pmd_populate_kernel_safe(struct mm_struct *mm,
+  pmd_t *pmd, pte_t *pte)
+{
+   paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
+   set_pmd_safe(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+}
+
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
struct page *pte)
 {
@@ -132,6 +139,12 @@ static inline void pud_populate(struct mm_struct *mm, 
pud_t *pud, pmd_t *pmd)
paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)));
 }
+
+static inline void pud_populate_safe(struct mm_struct *mm, pud_t *pud, pmd_t 
*pmd)
+{
+   paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
+   set_pud_safe(pud, __pud(_PAGE_TABLE | __pa(pmd)));
+}
 #endif /* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -141,6 +154,12 @@ static inline void p4d_populate(struct mm_struct *mm, 
p4d_t *p4d, pud_t *pud)
set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t 
*pud)
+{
+   paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
+   set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+}
+
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
gfp_t gfp = GFP_KERNEL_ACCOUNT;
@@ -173,6 +192,14 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, p4d_t *p4d)
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
 }
 
+static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t 
*p4d)
+{
+   if (!pgtable_l5_enabled())
+   return;
+   paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+   set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
gfp_t gfp = GFP_KERNEL_ACCOUNT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..3e25ac2793ef 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -432,7 +432,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, 
unsigned long paddr_end,
 E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
-   set_pte(pte, __pte(0));
+   set_pte_safe(pte, __pte(0));
continue;
}
 
@@ -452,7 +452,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, 
unsigned long paddr_end,
pr_info("   pte=%p addr=%lx pte=%016lx\n", pte, paddr,
pfn_pte(paddr >> PAGE_SHIFT, PAGE_KERNEL).pte);
pages++;
-   set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+   set_pte_safe(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
paddr_last = (paddr & PAGE_MASK) + PAGE_SIZE;
}
 
@@ -487,7 +487,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, 
unsigned long paddr_end,
 E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
-   set_pmd(pmd, __pmd(0));
+   set_pmd_safe(pmd, __pmd(0));
continue;
}
 
@@ -524,7 +524,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned l

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread Dan Williams
On Mon, Dec 3, 2018 at 4:17 PM  wrote:
>
> From: John Hubbard 
>
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
>
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
>
> This is the first step of fixing the problem described in [1]. The steps
> are:
>
> 1) (This patch): provide put_user_page*() routines, intended to be used
>for releasing pages that were pinned via get_user_pages*().
>
> 2) Convert all of the call sites for get_user_pages*(), to
>invoke put_user_page*(), instead of put_page(). This involves dozens of
>call sites, and will take some time.
>
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>implement tracking of these pages. This tracking will be separate from
>the existing struct page refcounting.
>
> 4) Use the tracking and identification of these pages, to implement
>special handling (especially in writeback paths) when the pages are
>backed by a filesystem. Again, [1] provides details as to why that is
>desirable.

I thought at Plumbers we talked about using a page bit to tag pages
that have had their reference count elevated by get_user_pages()? That
way there is no need to distinguish put_page() from put_user_page() it
just happens internally to put_page(). At the conference Matthew was
offering to free up a page bit for this purpose.

> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>
> Reviewed-by: Jan Kara 

Wish, you could have been there Jan. I'm missing why it's safe to
assume that a single put_user_page() is paired with a get_user_page()?


Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse  wrote:
>
> On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse  wrote:
> > >
> > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > > jgli...@redhat.com writes:
> > > >
> > > > > +
> > > > > +To help with forward compatibility each object as a version value and
> > > > > +it is mandatory for user space to only use target or initiator with
> > > > > +version supported by the user space. For instance if user space only
> > > > > +knows about what version 1 means and sees a target with version 2 
> > > > > then
> > > > > +the user space must ignore that target as if it does not exist.
> > > >
> > > > So once v2 is introduced all applications that only support v1 break.
> > > >
> > > > That seems very un-Linux and will break Linus' "do not break existing
> > > > applications" rule.
> > > >
> > > > The standard approach that if you add something incompatible is to
> > > > add new field, but keep the old ones.
> > >
> > > No that's not how it is suppose to work. So let says it is 2018 and you
> > > have v1 memory (like your regular main DDR memory for instance) then it
> > > will always be expose a v1 memory.
> > >
> > > Fast forward 2020 and you have this new type of memory that is not cache
> > > coherent and you want to expose this to userspace through HMS. What you
> > > do is a kernel patch that introduce the v2 type for target and define a
> > > set of new sysfs file to describe what v2 is. On this new computer you
> > > report your usual main memory as v1 and your new memory as v2.
> > >
> > > So the application that only knew about v1 will keep using any v1 memory
> > > on your new platform but it will not use any of the new memory v2 which
> > > is what you want to happen. You do not have to break existing application
> > > while allowing to add new type of memory.
> >
> > That sounds needlessly restrictive. Let the kernel arbitrate what
> > memory an application gets, don't design a system where applications
> > are hard coded to a memory type. Applications can hint, or optionally
> > specify an override and the kernel can react accordingly.
>
> You do not want to randomly use non cache coherent memory inside your
> application :)

The kernel arbitrates memory, it's a bug if it hands out something
that exotic to an unaware application.


Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

2018-12-04 Thread Dan Williams
On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse  wrote:
>
> On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > jgli...@redhat.com writes:
> >
> > > +
> > > +To help with forward compatibility each object as a version value and
> > > +it is mandatory for user space to only use target or initiator with
> > > +version supported by the user space. For instance if user space only
> > > +knows about what version 1 means and sees a target with version 2 then
> > > +the user space must ignore that target as if it does not exist.
> >
> > So once v2 is introduced all applications that only support v1 break.
> >
> > That seems very un-Linux and will break Linus' "do not break existing
> > applications" rule.
> >
> > The standard approach that if you add something incompatible is to
> > add new field, but keep the old ones.
>
> No that's not how it is suppose to work. So let says it is 2018 and you
> have v1 memory (like your regular main DDR memory for instance) then it
> will always be expose a v1 memory.
>
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
>
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.

That sounds needlessly restrictive. Let the kernel arbitrate what
memory an application gets, don't design a system where applications
are hard coded to a memory type. Applications can hint, or optionally
specify an override and the kernel can react accordingly.


Re: [PATCH v2 3/5] generic/pgtable: Introduce set_pte_safe()

2018-12-03 Thread Dan Williams
On Mon, Dec 3, 2018 at 9:53 AM Dave Hansen  wrote:
>
> On 11/30/18 4:35 PM, Dan Williams wrote:
> > +/*
> > + * The _safe versions of set_{pte,pmd,pud,p4d,pgd} validate that the
> > + * entry was not populated previously. I.e. for cases where a flush-tlb
> > + * is elided, double-check that there is no stale mapping to shoot down.
> > + */
>
> Functionally these look great to me.
>
> The only thing I'd suggest is to make the comment more about when to use
> these, instead of what they do:
>
> Use the set_p*_safe() version when confident that *no*
> TLB flush will be required as a result of the "set", such
> as setting non-present entries or when possibly superfluously
> re-setting an entry.

The second sentence was meant to be a "why", but yes, it's entirely too subtle.


Re: [PATCH v2 5/5] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-12-02 Thread Dan Williams
On Sat, Dec 1, 2018 at 10:43 PM Sasha Levin  wrote:
>
> On Fri, Nov 30, 2018 at 04:35:32PM -0800, Dan Williams wrote:
> >Commit f77084d96355 "x86/mm/pat: Disable preemption around
> >__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
> >without preemption being disabled. It also left a warning to catch other
> >cases where preemption is not disabled. That warning triggers for the
> >memory hotplug path which is also used for persistent memory enabling:
> >
> > WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
> > RIP: 0010:__flush_tlb_all+0x1b/0x3a
> > [..]
> > Call Trace:
> >  phys_pud_init+0x29c/0x2bb
> >  kernel_physical_mapping_init+0xfc/0x219
> >  init_memory_mapping+0x1a5/0x3b0
> >  arch_add_memory+0x2c/0x50
> >  devm_memremap_pages+0x3aa/0x610
> >  pmem_attach_disk+0x585/0x700 [nd_pmem]
> >
> >Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
> >and Dave confirmed the expectation for TLB flush is for modifying /
> >invalidating existing pte entries, but not initial population [2]. Drop
> >the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
> >expectation that this path is only ever populating empty entries for the
> >linear map. Note, at linear map teardown time there is a call to the
> >all-cpu flush_tlb_all() to invalidate the removed mappings.
> >
> >[1]: 
> >https://lore.kernel.org/lkml/9dfd717d-857d-493d-a606-b635d72ba...@amacapital.net
> >[2]: 
> >https://lore.kernel.org/lkml/749919a4-cdb1-48a3-adb4-adb81a5fa...@intel.com
> >
> >Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around 
> >__flush_tlb_all()")
> >Cc: Kirill A. Shutemov 
> >Cc: Sebastian Andrzej Siewior 
> >Cc: Thomas Gleixner 
> >Cc: Peter Zijlstra 
> >Cc: Borislav Petkov 
> >Cc: 
> >Reported-by: Andy Lutomirski 
> >Suggested-by: Dave Hansen 
> >Signed-off-by: Dan Williams 
>
> Hi Dan,
>
> This patch on it's own doesn't apply to any of the stable trees, does it
> maybe depend on some of the previous patches in this series?

It does not strictly depend on them, but it does need to be rebased
without them. The minimum patch for -stable are these
__flush_tlb_all() removals, but without the
set_{pte,pmd,pud,p4d,pgd}_safe() conversion. It's also an option to
backport those helpers, and conversion but I'd defer to x86/mm folks
to make that call.


Re: [PATCH v2 0/5] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-12-01 Thread Dan Williams
On Sat, Dec 1, 2018 at 2:28 AM Peter Zijlstra  wrote:
>
> On Fri, Nov 30, 2018 at 04:35:06PM -0800, Dan Williams wrote:
> >
> > Dan Williams (5):
> >   generic/pgtable: Make {pmd,pud}_same() unconditionally available
> >   generic/pgtable: Introduce {p4d,pgd}_same()
> >   generic/pgtable: Introduce set_pte_safe()
> >   x86/mm: Validate kernel_physical_mapping_init() pte population
> >   x86/mm: Drop usage of __flush_tlb_all() in 
> > kernel_physical_mapping_init()
> >
>
> Since you failed to send me 1,2, only for 3-5:

Whoops, sorry about that, I'll add you to my "auto-cc on all if cc'd
on one" list.

> Acked-by: Peter Zijlstra (Intel) 

Thanks.

> although going by the subjects, the earlier two patches should be fine
> too.


Re: [PATCH] PCI/P2PDMA: Match interface changes to devm_memremap_pages()

2018-11-30 Thread Dan Williams
On Fri, Nov 30, 2018 at 2:59 PM Logan Gunthorpe  wrote:
>
> "mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch" in the
> mm tree breaks p2pdma. The patch was written and reviewed before p2pdma
> was merged so the necessary changes were not done to the call site in
> that code.
>
> Without this patch, all drivers will fail to register P2P resources
> because devm_memremap_pages() will return -EINVAL due to the 'kill'
> member of the pagemap structure not yet being set.
>
> Signed-off-by: Logan Gunthorpe 
> Cc: Andrew Morton 
> Cc: Dan Williams 

Reviewed-by: Dan Williams 


[PATCH v2 5/5] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-11-30 Thread Dan Williams
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing pte entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

[1]: 
https://lore.kernel.org/lkml/9dfd717d-857d-493d-a606-b635d72ba...@amacapital.net
[2]: https://lore.kernel.org/lkml/749919a4-cdb1-48a3-adb4-adb81a5fa...@intel.com

Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Cc: Kirill A. Shutemov 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: 
Reported-by: Andy Lutomirski 
Suggested-by: Dave Hansen 
Signed-off-by: Dan Williams 
---
 arch/x86/mm/init_64.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3e25ac2793ef..484c1b92f078 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -584,7 +584,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
   paddr_end,
   page_size_mask,
   prot);
-   __flush_tlb_all();
continue;
}
/*
@@ -627,7 +626,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
pud_populate_safe(_mm, pud, pmd);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
update_page_count(PG_LEVEL_1G, pages);
 
@@ -668,7 +666,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
paddr_last = phys_pud_init(pud, paddr,
paddr_end,
page_size_mask);
-   __flush_tlb_all();
continue;
}
 
@@ -680,7 +677,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
p4d_populate_safe(_mm, p4d, pud);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
return paddr_last;
 }
@@ -733,8 +729,6 @@ kernel_physical_mapping_init(unsigned long paddr_start,
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
 
-   __flush_tlb_all();
-
return paddr_last;
 }
 



[PATCH v2 3/5] generic/pgtable: Introduce set_pte_safe()

2018-11-30 Thread Dan Williams
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" introduced a warning to capture cases
__flush_tlb_all() is called without pre-emption disabled. It triggers a
false positive warning in the memory hotplug path. On investigation it
was found that the __flush_tlb_all() calls are not necessary. However,
they are only "not necessary" in practice provided the ptes are being
initially populated from the !present state. Introduce set_pte_safe() as
a sanity check that the pte is being updated in a way that does not
require a tlb flush.

Forgive the macro, the availability of the various of set_pte() levels
is hit and miss across architectures.

Link: 
https://lore.kernel.org/lkml/279dadae-9148-465c-7ec6-3f37e026c...@intel.com
Cc: Kirill A. Shutemov 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Suggested-by: Peter Zijlstra 
Suggested-by: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/asm-generic/pgtable.h |   35 +++
 1 file changed, 35 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index dae7f98babed..62be0d5e1a9a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -400,6 +400,41 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
 }
 #endif
 
+/*
+ * The _safe versions of set_{pte,pmd,pud,p4d,pgd} validate that the
+ * entry was not populated previously. I.e. for cases where a flush-tlb
+ * is elided, double-check that there is no stale mapping to shoot down.
+ */
+#define set_pte_safe(ptep, pte) \
+({ \
+   WARN_ON_ONCE(pte_present(*ptep) && !pte_same(*ptep, pte)); \
+   set_pte(ptep, pte); \
+})
+
+#define set_pmd_safe(pmdp, pmd) \
+({ \
+   WARN_ON_ONCE(pmd_present(*pmdp) && !pmd_same(*pmdp, pmd)); \
+   set_pmd(pmdp, pmd); \
+})
+
+#define set_pud_safe(pudp, pud) \
+({ \
+   WARN_ON_ONCE(pud_present(*pudp) && !pud_same(*pudp, pud)); \
+   set_pud(pudp, pud); \
+})
+
+#define set_p4d_safe(p4dp, p4d) \
+({ \
+   WARN_ON_ONCE(p4d_present(*p4dp) && !p4d_same(*p4dp, p4d)); \
+   set_p4d(p4dp, p4d); \
+})
+
+#define set_pgd_safe(pgdp, pgd) \
+({ \
+   WARN_ON_ONCE(pgd_present(*pgdp) && !pgd_same(*pgdp, pgd)); \
+   set_pgd(pgdp, pgd); \
+})
+
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
 /*
  * Some architectures support metadata associated with a page. When a



[PATCH v2 2/5] generic/pgtable: Introduce {p4d,pgd}_same()

2018-11-30 Thread Dan Williams
In preparation for introducing '_safe' versions of page table entry 'set'
helpers, introduce generic versions of p4d_same() and pgd_same().

Signed-off-by: Dan Williams 
---
 include/asm-generic/pgtable.h |   14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index eea50ef8b8cd..dae7f98babed 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,6 +386,20 @@ static inline int pud_same(pud_t pud_a, pud_t pud_b)
 }
 #endif
 
+#ifndef __HAVE_ARCH_P4D_SAME
+static inline int p4d_same(p4d_t p4d_a, p4d_t p4d_b)
+{
+   return p4d_val(p4d_a) == p4d_val(p4d_b);
+}
+#endif
+
+#ifndef __HAVE_ARCH_PGD_SAME
+static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
+{
+   return pgd_val(pgd_a) == pgd_val(pgd_b);
+}
+#endif
+
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE
 /*
  * Some architectures support metadata associated with a page. When a



[PATCH v2 1/5] generic/pgtable: Make {pmd, pud}_same() unconditionally available

2018-11-30 Thread Dan Williams
In preparation for {pmd,pud}_same() to be used outside of transparent
huge page code paths, make them unconditionally available. This enables
them to be used in the definition of a new family of
set_{pte,pmd,pud,p4d,pgd}_safe() helpers.

Signed-off-by: Dan Williams 
---
 include/asm-generic/pgtable.h |   14 --
 1 file changed, 14 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 359fb935ded6..eea50ef8b8cd 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -375,7 +375,6 @@ static inline int pte_unused(pte_t pte)
 #endif
 
 #ifndef __HAVE_ARCH_PMD_SAME
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
@@ -385,19 +384,6 @@ static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
 }
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
-{
-   BUILD_BUG();
-   return 0;
-}
-
-static inline int pud_same(pud_t pud_a, pud_t pud_b)
-{
-   BUILD_BUG();
-   return 0;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_DO_SWAP_PAGE



[PATCH v2 0/5] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-11-30 Thread Dan Williams
Changes since v1 [1]:
* Introduce _safe versions of the set_pte family of helpers (Dave)
* Fix the validation check to accept rewriting the same entry (Peter)
* Fix up the email reference links in the changelogs (Peter)

[1]: https://lkml.org/lkml/2018/11/20/300

---

>From patch 5:

Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing pte entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

Additionally, Dave wanted some runtime assurances that
kernel_physical_mapping_init() is only populating and not changing
existing page table entries. Patches 1 - 4 are implementing a new
set_pte() family of helpers for that purpose.

Patch 5 is tagged for -stable because the false positive warning is now
showing up on 4.19-stable kernels. Patches 1 - 4 are not tagged for
-stable, but if the sanity checking is needed please mark them as such.

The hang that was observed while developing the sanity checking
implementation was resolved by Peter's suggestion to not trigger when
the same pte value is being rewritten.

---

Dan Williams (5):
  generic/pgtable: Make {pmd,pud}_same() unconditionally available
  generic/pgtable: Introduce {p4d,pgd}_same()
  generic/pgtable: Introduce set_pte_safe()
  x86/mm: Validate kernel_physical_mapping_init() pte population
  x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()


 arch/x86/include/asm/pgalloc.h   |   27 +++
 arch/x86/mm/init_64.c|   30 +++--
 include/asm-generic/5level-fixup.h   |1 +
 include/asm-generic/pgtable-nop4d-hack.h |1 +
 include/asm-generic/pgtable-nop4d.h  |1 +
 include/asm-generic/pgtable-nopud.h  |1 +
 include/asm-generic/pgtable.h|   53 +-
 7 files changed, 87 insertions(+), 27 deletions(-)


[PATCH v2 4/5] x86/mm: Validate kernel_physical_mapping_init() pte population

2018-11-30 Thread Dan Williams
The usage of __flush_tlb_all() in the kernel_physical_mapping_init()
path is not necessary. In general flushing the tlb is not required when
updating an entry from the !present state. However, to give confidence
in the future removal of tlb flushing in this path, use the new
set_pte_safe() family of helpers to assert that the !present assumption
is true in this path.

Cc: Kirill A. Shutemov 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Suggested-by: Peter Zijlstra 
Suggested-by: Dave Hansen 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pgalloc.h   |   27 +++
 arch/x86/mm/init_64.c|   24 
 include/asm-generic/5level-fixup.h   |1 +
 include/asm-generic/pgtable-nop4d-hack.h |1 +
 include/asm-generic/pgtable-nop4d.h  |1 +
 include/asm-generic/pgtable-nopud.h  |1 +
 6 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index ec7f43327033..1ea41aaef68b 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -80,6 +80,13 @@ static inline void pmd_populate_kernel(struct mm_struct *mm,
set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
 }
 
+static inline void pmd_populate_kernel_safe(struct mm_struct *mm,
+  pmd_t *pmd, pte_t *pte)
+{
+   paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
+   set_pmd_safe(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+}
+
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
struct page *pte)
 {
@@ -132,6 +139,12 @@ static inline void pud_populate(struct mm_struct *mm, 
pud_t *pud, pmd_t *pmd)
paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)));
 }
+
+static inline void pud_populate_safe(struct mm_struct *mm, pud_t *pud, pmd_t 
*pmd)
+{
+   paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
+   set_pud_safe(pud, __pud(_PAGE_TABLE | __pa(pmd)));
+}
 #endif /* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -141,6 +154,12 @@ static inline void p4d_populate(struct mm_struct *mm, 
p4d_t *p4d, pud_t *pud)
set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t 
*pud)
+{
+   paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
+   set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+}
+
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
gfp_t gfp = GFP_KERNEL_ACCOUNT;
@@ -173,6 +192,14 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, p4d_t *p4d)
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
 }
 
+static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t 
*p4d)
+{
+   if (!pgtable_l5_enabled())
+   return;
+   paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+   set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
gfp_t gfp = GFP_KERNEL_ACCOUNT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..3e25ac2793ef 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -432,7 +432,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, 
unsigned long paddr_end,
 E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
-   set_pte(pte, __pte(0));
+   set_pte_safe(pte, __pte(0));
continue;
}
 
@@ -452,7 +452,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, 
unsigned long paddr_end,
pr_info("   pte=%p addr=%lx pte=%016lx\n", pte, paddr,
pfn_pte(paddr >> PAGE_SHIFT, PAGE_KERNEL).pte);
pages++;
-   set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+   set_pte_safe(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
paddr_last = (paddr & PAGE_MASK) + PAGE_SIZE;
}
 
@@ -487,7 +487,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, 
unsigned long paddr_end,
 E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
-   set_pmd(pmd, __pmd(0));
+   set_pmd_safe(pmd, __pmd(0));
continue;
}
 
@@ -524,7 +524,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned l

[4.19-stable PATCH] dax: Avoid losing wakeup in dax_lock_mapping_entry

2018-11-28 Thread Dan Williams
From: Matthew Wilcox 

commit 25bbe21bf427a81b8e3ccd480ea0e1d940256156 upstream.

After calling get_unlocked_entry(), you have to call
put_unlocked_entry() to avoid subsequent waiters losing wakeups.

Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: 
Signed-off-by: Matthew Wilcox 
Signed-off-by: Dan Williams 
---
Passes the nvdimm unit test suite which exercises the lock path.

 fs/dax.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0fb270f0a0ef..b0cd1364c68f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -217,6 +217,9 @@ static inline void *unlock_slot(struct address_space 
*mapping, void **slot)
return (void *)entry;
 }
 
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+  pgoff_t index, void *entry);
+
 /*
  * Lookup entry in radix tree, wait for it to become unlocked if it is
  * exceptional entry and return it. The caller must call
@@ -256,8 +259,10 @@ static void *__get_unlocked_mapping_entry(struct 
address_space *mapping,
revalidate = wait_fn();
finish_wait(wq, );
xa_lock_irq(>i_pages);
-   if (revalidate)
+   if (revalidate) {
+   put_unlocked_mapping_entry(mapping, index, entry);
return ERR_PTR(-EAGAIN);
+   }
}
 }
 



Re: [PATCH 2/7] node: Add heterogenous memory performance

2018-11-27 Thread Dan Williams
On Mon, Nov 26, 2018 at 11:00 PM Dan Williams  wrote:
>
> On Wed, Nov 14, 2018 at 2:53 PM Keith Busch  wrote:
> >
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> >
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> >
> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree /sys/devices/system/node/nodeY/initiator_access
> >   /sys/devices/system/node/nodeY/initiator_access
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
>
> With the expectation that there will be nodes that are initiator-only,
> target-only, or both I think this interface should indicate that. The
> 1:1 "local" designation of HMAT should not be directly encoded in the
> interface, it's just a shortcut for finding at least one initiator in
> the set that can realize the advertised performance. At least if the
> interface can enumerate the set of initiators then it becomes clear
> whether sysfs can answer a performance enumeration question or if the
> application needs to consult an interface with specific knowledge of a
> given initiator-target pairing.

Sorry, I misread patch1, this series does allow publishing the
multi-initiator case that shares the same performance profile to a
given target.

> It seems a precursor to these patches is arranges for offline node
> devices to be created for the ACPI proximity domains that are
> offline-by default for reserved memory ranges.

Likely still need this though because node devices don't tend to show
up until they have a cpu or online memory.


Re: [PATCH 0/7] ACPI HMAT memory sysfs representation

2018-11-27 Thread Dan Williams
On Tue, Nov 27, 2018 at 2:15 AM Anshuman Khandual
 wrote:
>
>
>
> On 11/26/2018 11:38 PM, Dan Williams wrote:
> > On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen  wrote:
> >>
> >> On 11/23/18 1:13 PM, Dan Williams wrote:
> >>>> A new system call makes total sense to me.  I have the same concern
> >>>> about the completeness of what's exposed in sysfs, I just don't see a
> >>>> _route_ to completeness with sysfs itself.  Thus, the minimalist
> >>>> approach as a first step.
> >>> Outside of platform-firmware-id to Linux-numa-node-id what other
> >>> userspace API infrastructure does the kernel need to provide? It seems
> >>> userspace enumeration of memory attributes is fully enabled once the
> >>> firmware-to-Linux identification is established.
> >>
> >> It would be nice not to have each app need to know about each specific
> >> platform's firmware.
> >
> > The app wouldn't need to know if it uses a common library. Whether the
> > library calls into the kernel or not is an implementation detail. If
> > it is information that only the app cares about and the kernel does
> > not consume, why have a syscall?
>
> If we just care about platform-firmware-id <--> Linux-numa-node-id mapping
> and fetching memory attribute from the platform (and hiding implementation
> details in a library) then the following interface should be sufficient.
>
> /sys/devices/system/node/nodeX/platform_id
>
> But as the series proposes (and rightly so) kernel needs to start providing
> ABI interfaces for memory attributes instead of hiding them in libraries.

Yes, I can get on board with sysfs providing a subset of the
performance description for administrators to discover the common case
via scripting and leave the exhaustive attribute description to a
separate interface. I was pushing back on the notion that sysfs must
be that exhaustive interface... we're making progress.

I still think we need /sys/devices/system/node/nodeX/platform_id to
enable higher order platform enumeration tooling, but that need not be
the end of the kernel interface description.


Re: [PATCH 2/7] node: Add heterogenous memory performance

2018-11-26 Thread Dan Williams
On Wed, Nov 14, 2018 at 2:53 PM Keith Busch  wrote:
>
> Heterogeneous memory systems provide memory nodes with latency
> and bandwidth performance attributes that are different from other
> nodes. Create an interface for the kernel to register these attributes
> under the node that provides the memory. If the system provides this
> information, applications can query the node attributes when deciding
> which node to request memory.
>
> When multiple memory initiators exist, accessing the same memory target
> from each may not perform the same as the other. The highest performing
> initiator to a given target is considered to be a local initiator for
> that target. The kernel provides performance attributes only for the
> local initiators.
>
> The memory's compute node should be symlinked in sysfs as one of the
> node's initiators.
>
> The following example shows the new sysfs hierarchy for a node exporting
> performance attributes:
>
>   # tree /sys/devices/system/node/nodeY/initiator_access
>   /sys/devices/system/node/nodeY/initiator_access
>   |-- read_bandwidth
>   |-- read_latency
>   |-- write_bandwidth
>   `-- write_latency

With the expectation that there will be nodes that are initiator-only,
target-only, or both I think this interface should indicate that. The
1:1 "local" designation of HMAT should not be directly encoded in the
interface, it's just a shortcut for finding at least one initiator in
the set that can realize the advertised performance. At least if the
interface can enumerate the set of initiators then it becomes clear
whether sysfs can answer a performance enumeration question or if the
application needs to consult an interface with specific knowledge of a
given initiator-target pairing.

It seems a precursor to these patches is arranges for offline node
devices to be created for the ACPI proximity domains that are
offline-by default for reserved memory ranges.


Re: [driver-core PATCH v6 7/9] driver core: Attach devices on CPU local to device node

2018-11-26 Thread Dan Williams
On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
 wrote:
>
> Call the asynchronous probe routines on a CPU local to the device node. By
> doing this we should be able to improve our initialization time
> significantly as we can avoid having to access the device from a remote
> node which may introduce higher latency.
>
> For example, in the case of initializing memory for NVDIMM this can have a
> significant impact as initialing 3TB on remote node can take up to 39
> seconds while initialing it on a local node only takes 23 seconds. It is
> situations like this where we will see the biggest improvement.
>
> Reviewed-by: Bart Van Assche 
> Signed-off-by: Alexander Duyck 

Looks good,

Reviewed-by: Dan Williams 


Re: [PATCH 0/7] ACPI HMAT memory sysfs representation

2018-11-26 Thread Dan Williams
On Mon, Nov 26, 2018 at 8:42 AM Dave Hansen  wrote:
>
> On 11/23/18 1:13 PM, Dan Williams wrote:
> >> A new system call makes total sense to me.  I have the same concern
> >> about the completeness of what's exposed in sysfs, I just don't see a
> >> _route_ to completeness with sysfs itself.  Thus, the minimalist
> >> approach as a first step.
> > Outside of platform-firmware-id to Linux-numa-node-id what other
> > userspace API infrastructure does the kernel need to provide? It seems
> > userspace enumeration of memory attributes is fully enabled once the
> > firmware-to-Linux identification is established.
>
> It would be nice not to have each app need to know about each specific
> platform's firmware.

The app wouldn't need to know if it uses a common library. Whether the
library calls into the kernel or not is an implementation detail. If
it is information that only the app cares about and the kernel does
not consume, why have a syscall?


Re: [PATCH 0/7] ACPI HMAT memory sysfs representation

2018-11-23 Thread Dan Williams
On Fri, Nov 23, 2018 at 11:21 AM Dave Hansen  wrote:
>
> On 11/22/18 10:42 PM, Anshuman Khandual wrote:
> > Are we willing to go in the direction for inclusion of a new system
> > call, subset of it appears on sysfs etc ? My primary concern is not
> > how the attribute information appears on the sysfs but lack of it's
> > completeness.
>
> A new system call makes total sense to me.  I have the same concern
> about the completeness of what's exposed in sysfs, I just don't see a
> _route_ to completeness with sysfs itself.  Thus, the minimalist
> approach as a first step.

Outside of platform-firmware-id to Linux-numa-node-id what other
userspace API infrastructure does the kernel need to provide? It seems
userspace enumeration of memory attributes is fully enabled once the
firmware-to-Linux identification is established.


Re: [PATCH 0/7] ACPI HMAT memory sysfs representation

2018-11-23 Thread Dan Williams
On Thu, Nov 22, 2018 at 11:11 PM Anshuman Khandual
 wrote:
>
>
>
> On 11/22/2018 11:38 PM, Dan Williams wrote:
> > On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual
> >  wrote:
> >>
> >>
> >>
> >> On 11/19/2018 11:07 PM, Dave Hansen wrote:
> >>> On 11/18/18 9:44 PM, Anshuman Khandual wrote:
> >>>> IIUC NUMA re-work in principle involves these functional changes
> >>>>
> >>>> 1. Enumerating compute and memory nodes in heterogeneous environment 
> >>>> (short/medium term)
> >>>
> >>> This patch set _does_ that, though.
> >>>
> >>>> 2. Enumerating memory node attributes as seen from the compute nodes 
> >>>> (short/medium term)
> >>>
> >>> It does that as well (a subset at least).
> >>>
> >>> It sounds like the subset that's being exposed is insufficient for yo
> >>> We did that because we think doing anything but a subset in sysfs will
> >>> just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
> >>> attributes, that's at _least_ 1024*1024*4 files if we expose *all*
> >>> combinations.
> >> Each permutation need not be a separate file inside all possible NODE X
> >> (/sys/devices/system/node/nodeX) directories. It can be a top level file
> >> enumerating various attribute values for a given (X, Y) node pair based
> >> on an offset something like /proc/pid/pagemap.
> >>
> >>>
> >>> Do we agree that sysfs is unsuitable for exposing attributes in this 
> >>> manner?
> >>>
> >>
> >> Yes, for individual files. But this can be worked around with an offset
> >> based access from a top level global attributes file as mentioned above.
> >> Is there any particular advantage of using individual files for each
> >> given attribute ? I was wondering that a single unsigned long (u64) will
> >> be able to pack 8 different attributes where each individual attribute
> >> values can be abstracted out in 8 bits.
> >
> > sysfs has a 4K limit, and in general I don't think there is much
> > incremental value to go describe the entirety of the system from sysfs
> > or anywhere else in the kernel for that matter. It's simply too much> 
> > information to reasonably consume. Instead the kernel can describe the
>
> I agree that it may be some amount of information to parse but is crucial
> for any task on a heterogeneous system to evaluate (probably re-evaluate
> if the task moves around) its memory and CPU binding at runtime to make
> sure it has got the right one.

Can you provide some more evidence for this statement? It seems that
not many applications even care about basic numa let alone specific
memory targeting, at least according to libnumactl users.

 dnf repoquery --whatrequires numactl-libs

The kernel is the arbiter of memory, something is broken if
applications *need* to take on this responsibility. Yes, there will be
applications that want to tune and override the default kernel
behavior, but this is the exception, not the rule. The applications
that tend to care about specific memories also tend to be purpose
built for a given platform, and that lessens their reliance on the
kernel to enumerate all properties.

> > coarse boundaries and some semblance of "best" access initiator for a
> > given target. That should cover the "80%" case of what applications
>
> The current proposal just assumes that the best one is the nearest one.
> This may be true for bandwidth and latency but may not be true for some
> other properties. This assumptions should not be there while defining
> new ABI.

In fact, I tend to agree with you, but in my opinion that's an
argument to expose even less, not more. If we start with something
minimal that can be extended over time we lessen the risk of over
exposing details that don't matter in practice.

We're in the middle of a bit of a disaster with the VmFlags export in
/proc/$pid/smaps precisely because the implementation was too
comprehensive and applications started depending on details that the
kernel does not want to guarantee going forward. So there is a real
risk of being too descriptive in an interface design.

> > want to discover, for the other "20%" we likely need some userspace
> > library that can go parse these platform specific information sources
> > and supplement the kernel view. I also think a simpler kernel starting
> > point gives us room to go pull in more commonly used attributes if it
> > turns out they are useful, and avoid going down the path of exporting
> > attrib

Re: [PATCH 0/7] ACPI HMAT memory sysfs representation

2018-11-22 Thread Dan Williams
On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual
 wrote:
>
>
>
> On 11/19/2018 11:07 PM, Dave Hansen wrote:
> > On 11/18/18 9:44 PM, Anshuman Khandual wrote:
> >> IIUC NUMA re-work in principle involves these functional changes
> >>
> >> 1. Enumerating compute and memory nodes in heterogeneous environment 
> >> (short/medium term)
> >
> > This patch set _does_ that, though.
> >
> >> 2. Enumerating memory node attributes as seen from the compute nodes 
> >> (short/medium term)
> >
> > It does that as well (a subset at least).
> >
> > It sounds like the subset that's being exposed is insufficient for yo
> > We did that because we think doing anything but a subset in sysfs will
> > just blow up sysfs:  MAX_NUMNODES is as high as 1024, so if we have 4
> > attributes, that's at _least_ 1024*1024*4 files if we expose *all*
> > combinations.
> Each permutation need not be a separate file inside all possible NODE X
> (/sys/devices/system/node/nodeX) directories. It can be a top level file
> enumerating various attribute values for a given (X, Y) node pair based
> on an offset something like /proc/pid/pagemap.
>
> >
> > Do we agree that sysfs is unsuitable for exposing attributes in this manner?
> >
>
> Yes, for individual files. But this can be worked around with an offset
> based access from a top level global attributes file as mentioned above.
> Is there any particular advantage of using individual files for each
> given attribute ? I was wondering that a single unsigned long (u64) will
> be able to pack 8 different attributes where each individual attribute
> values can be abstracted out in 8 bits.

sysfs has a 4K limit, and in general I don't think there is much
incremental value to go describe the entirety of the system from sysfs
or anywhere else in the kernel for that matter. It's simply too much
information to reasonably consume. Instead the kernel can describe the
coarse boundaries and some semblance of "best" access initiator for a
given target. That should cover the "80%" case of what applications
want to discover, for the other "20%" we likely need some userspace
library that can go parse these platform specific information sources
and supplement the kernel view. I also think a simpler kernel starting
point gives us room to go pull in more commonly used attributes if it
turns out they are useful, and avoid going down the path of exporting
attributes that have questionable value in practice.


Re: [PATCH] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-11-21 Thread Dan Williams
On Tue, Nov 20, 2018 at 1:03 AM Peter Zijlstra  wrote:
>
> On Tue, Nov 20, 2018 at 02:59:32AM +, Williams, Dan J wrote:
> > On Mon, 2018-11-19 at 15:43 -0800, Dave Hansen wrote:
> > > On 11/19/18 3:19 PM, Dan Williams wrote:
> > > > Andy wondered why a path that can sleep was using __flush_tlb_all()
> > > > [1]
> > > > and Dave confirmed the expectation for TLB flush is for modifying /
> > > > invalidating existing pte entries, but not initial population [2].
> > >
> > > I _think_ this is OK.
> > >
> > > But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in
> > > there
> > > to help us sleep at night?
> >
> > Well, I'm having nightmares now because my naive patch to sprinkle some
> > WARN_ON_ONCE() calls is leading to my VM live locking at boot... no
> > backtrace. If I revert the patch below and just go with the
> > __flush_tlb_all() removal it seems fine.
> >
> > I'm going to set this aside for a bit, but if anyone has any thoughts
> > in the meantime I'd appreciate it.
>
> Have you tried using early_printk ?

No, it boots well past printk, and even gets past pivot root.
Eventually live locks with all cores spinning. It appears to be
correlated with the arrival of pmem, and independent of the tlb
flushes... I'll dig deeper.

> So kernel_physical_mapping_init() has a comment that states the virtual
> and physical addresses we create mappings for should be PMD aligned,
> which implies pud/p4d could have overlap between the mappings.
>
> But in that case, I would expect the new and old values to match.
>
> So maybe you should be checking something like:
>
> WARN_ON_ONCE(pud_present(*pud) && !pud_same(pud, new));

Yes, that looks better.


Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc//smaps

2018-11-20 Thread Dan Williams
On Tue, Nov 20, 2018 at 2:35 AM Michal Hocko  wrote:
>
> From: Michal Hocko 
>
> Even though vma flags exported via /proc//smaps are explicitly
> documented to be not guaranteed for future compatibility the warning
> doesn't go far enough because it doesn't mention semantic changes to
> those flags. And they are important as well because these flags are
> a deep implementation internal to the MM code and the semantic might
> change at any time.
>
> Let's consider two recent examples:
> http://lkml.kernel.org/r/20181002100531.gc4...@quack2.suse.cz
> : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> : mean time certain customer of ours started poking into /proc//smaps
> : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> : flags, the application just fails to start complaining that DAX support is
> : missing in the kernel.
>
> http://lkml.kernel.org/r/alpine.deb.2.21.1809241054050.224...@chino.kir.corp.google.com
> : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> : introduced a regression in that userspace cannot always determine the set
> : of vmas where thp is ineligible.
> : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> : to determine if a vma is eligible to be backed by hugepages.
> : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> : flag and "nh" is not emitted.
> : This causes smaps parsing libraries to assume a vma is eligible for thp
> : and ends up puzzling the user on why its memory is not backed by thp.
>
> In both cases userspace was relying on a semantic of a specific VMA
> flag. The primary reason why that happened is a lack of a proper
> internface. While this has been worked on and it will be fixed properly,
> it seems that our wording could see some refinement and be more vocal
> about semantic aspect of these flags as well.
>
> Cc: Jan Kara 
> Cc: Dan Williams 
> Cc: David Rientjes 
> Signed-off-by: Michal Hocko 
> ---
>  Documentation/filesystems/proc.txt | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..b1fda309f067 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area 
> in two letter encoded
>
>  Note that there is no guarantee that every flag and associated mnemonic will
>  be present in all further kernel releases. Things get changed, the flags may
> -be vanished or the reverse -- new added.
> +be vanished or the reverse -- new added. Interpretatation of their meaning
> +might change in future as well. So each consumnent of these flags have to
> +follow each specific kernel version for the exact semantic.

Can we start to claw some of this back? Perhaps with a config option
to hide the flags to put applications on notice? I recall that when I
introduced CONFIG_IO_STRICT_DEVMEM it caused enough regressions that
distros did not enable it, but now a few years out I'm finding that it
is enabled in more places.

In any event,

Acked-by: Dan Williams 


Re: [PATCH] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-11-19 Thread Dan Williams
On Mon, Nov 19, 2018 at 3:43 PM Dave Hansen  wrote:
>
> On 11/19/18 3:19 PM, Dan Williams wrote:
> > Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
> > and Dave confirmed the expectation for TLB flush is for modifying /
> > invalidating existing pte entries, but not initial population [2].
>
> I _think_ this is OK.
>
> But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in there
> to help us sleep at night?

Makes sense, I'll add those for v2.


[PATCH] x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

2018-11-19 Thread Dan Williams
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing pte entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

[1]: https://lore.kernel.org/patchwork/patch/1009434/#1193941
[2]: https://lore.kernel.org/patchwork/patch/1009434/#1194540

Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Cc: Kirill A. Shutemov 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: 
Reported-by: Andy Lutomirski 
Suggested-by: Dave Hansen 
Signed-off-by: Dan Williams 
---
 arch/x86/mm/init_64.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..de95db8ac52f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -584,7 +584,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
   paddr_end,
   page_size_mask,
   prot);
-   __flush_tlb_all();
continue;
}
/*
@@ -627,7 +626,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
pud_populate(_mm, pud, pmd);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
update_page_count(PG_LEVEL_1G, pages);
 
@@ -668,7 +666,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
paddr_last = phys_pud_init(pud, paddr,
paddr_end,
page_size_mask);
-   __flush_tlb_all();
continue;
}
 
@@ -680,7 +677,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
p4d_populate(_mm, p4d, pud);
spin_unlock(_mm.page_table_lock);
}
-   __flush_tlb_all();
 
return paddr_last;
 }
@@ -733,8 +729,6 @@ kernel_physical_mapping_init(unsigned long paddr_start,
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
 
-   __flush_tlb_all();
-
return paddr_last;
 }
 



Re: [PATCH 1/7] node: Link memory nodes to their compute nodes

2018-11-16 Thread Dan Williams
On Thu, Nov 15, 2018 at 12:37 PM Matthew Wilcox  wrote:
>
> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > > Memory-only nodes will often have affinity to a compute node, and
> > > > platforms have ways to express that locality relationship.
> > > >
> > > > A node containing CPUs or other DMA devices that can initiate memory
> > > > access are referred to as "memory iniators". A "memory target" is a
> > > > node that provides at least one phyiscal address range accessible to a
> > > > memory initiator.
> > >
> > > I think I may be confused here.  If there is _no_ link from node X to
> > > node Y, does that mean that node X's CPUs cannot access the memory on
> > > node Y?  In my mind, all nodes can access all memory in the system,
> > > just not with uniform bandwidth/latency.
> >
> > The link is just about which nodes are "local". It's like how nodes have
> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
> > but the ones in the mask are local, and provide useful optimization hints.
>
> So ... let's imagine a hypothetical system (I've never seen one built like
> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> a square, each of which has some regular DIMMs attached to it.  CPU A is
> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> special memory extender device attached on the PCIe bus.  Now there's
> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> this seems hard.
>
> I understand you're trying to reflect what the HMAT table is telling you,
> I'm just really fuzzy on who's ultimately consuming this information
> and what decisions they're trying to drive from it.

The singular "local" is a limitation of the HMAT, but I would expect
the Linux translation of "local" would allow for multiple initiators
that can achieve some semblance of the "best" performance. Anything
less than best is going to have a wide range of variance and will
likely devolve to looking at the platform firmware data table
directly. The expected 80% case is software wants to be able to ask
"which CPUs should I run on to get the best access to this memory?"


Re: [PATCH 1/7] node: Link memory nodes to their compute nodes

2018-11-15 Thread Dan Williams
On Thu, Nov 15, 2018 at 7:02 AM Keith Busch  wrote:
>
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > Memory-only nodes will often have affinity to a compute node, and
> > > platforms have ways to express that locality relationship.
> > >
> > > A node containing CPUs or other DMA devices that can initiate memory
> > > access are referred to as "memory iniators". A "memory target" is a
> > > node that provides at least one phyiscal address range accessible to a
> > > memory initiator.
> >
> > I think I may be confused here.  If there is _no_ link from node X to
> > node Y, does that mean that node X's CPUs cannot access the memory on
> > node Y?  In my mind, all nodes can access all memory in the system,
> > just not with uniform bandwidth/latency.
>
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.
>
> Would a node mask would be prefered to symlinks?

I think that would be more flexible, because the set of initiators
that may have "best" or "local" access to a target may be more than 1.


Re: [PATCH 2/5] mm/memory_hotplug: Create add/del_device_memory functions

2018-11-12 Thread Dan Williams
On Mon, Oct 15, 2018 at 8:31 AM Oscar Salvador
 wrote:
>
> From: Oscar Salvador 
>
> HMM/devm have a particular handling of memory-hotplug.
> They do not go through the common path, and so, they do not
> call either offline_pages() or online_pages().
>
> The operations they perform are the following ones:
>
> 1) Create the linear mapping in case the memory is not private
> 2) Initialize the pages and add the sections
> 3) Move the pages to ZONE_DEVICE
>
> Due to this particular handling of hot-add/remove memory from HMM/devm,
> I think it would be nice to provide a helper function in order to
> make this cleaner, and not populate other regions with code
> that should belong to memory-hotplug.
>
> The helpers are named:
>
> del_device_memory
> add_device_memory
>
> The idea is that add_device_memory will be in charge of:
>
> a) call either arch_add_memory() or add_pages(), depending on whether
>we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) call move_pfn_range_to_zone() being zone ZONE_DEVICE to
>expand zone/pgdat spanned pages and initialize its pages
>
> del_device_memory, on the other hand, will be in charge of:
>
> a) offline the memory sections that correspond to the pfn range
> b) call shrink_zone_pgdat_pages(), which shrinks node/zone spanned pages.
> c) call either arch_remove_memory() or __remove_pages(), depending on
>whether we need to tear down the linear mapping or not
>
> The reason behind step b) from add_device_memory() and step a)
> from del_device_memory is that now find_smallest/biggest_section_pfn
> will have to check for online sections, and not for valid sections as
> they used to do, because we call offline_mem_sections() in
> offline_pages().
>
> In order to split up better the patches and ease the review,
> this patch will only make a) case work for add_device_memory(),
> and case c) for del_device_memory.
>
> The other cases will be added in the next patch.
>
> These two functions have to be called from devm/HMM code:
>
> dd_device_memory:
> - devm_memremap_pages()
> - hmm_devmem_pages_create()
>
> del_device_memory:
> - hmm_devmem_release
> - devm_memremap_pages_release
>
> One thing I do not know is whether we can move kasan calls out of the
> hotplug lock or not.
> If we can, we could move the hotplug lock within add/del_device_memory().
>
> Signed-off-by: Oscar Salvador 
> ---
>  include/linux/memory_hotplug.h | 11 +++
>  kernel/memremap.c  | 11 ---
>  mm/hmm.c   | 33 +
>  mm/memory_hotplug.c| 41 +
>  4 files changed, 73 insertions(+), 23 deletions(-)

This collides with the refactoring of hmm, to be done in terms of
devm_memremap_pages(). I'd rather not introduce another common
function *beneath* hmm and devm_memremap_pages() and rather make
devm_memremap_pages() the common function.

I plan to resubmit that cleanup after Plumbers. So, unless I'm
misunderstanding some other benefit a nak from me on this patch as it
stands currently.


Re: [PATCH v2 1/6] mm/gup: finish consolidating error handling

2018-11-12 Thread Dan Williams
On Mon, Nov 12, 2018 at 7:45 AM Keith Busch  wrote:
>
> On Sat, Nov 10, 2018 at 12:50:36AM -0800, john.hubb...@gmail.com wrote:
> > From: John Hubbard 
> >
> > An upcoming patch wants to be able to operate on each page that
> > get_user_pages has retrieved. In order to do that, it's best to
> > have a common exit point from the routine. Most of this has been
> > taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
> > pinning pages"), but there was one case remaining.
> >
> > Also, there was still an unnecessary shadow declaration (with a
> > different type) of the "ret" variable, which this commit removes.
> >
> > Cc: Keith Busch 
> > Cc: Dan Williams 
> > Cc: Kirill A. Shutemov 
> > Cc: Dave Hansen 
> > Signed-off-by: John Hubbard 
> > ---
> >  mm/gup.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f76e77a2d34b..55a41dee0340 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, 
> > struct mm_struct *mm,
> >   if (!vma || start >= vma->vm_end) {
> >   vma = find_extend_vma(mm, start);
> >   if (!vma && in_gate_area(mm, start)) {
> > - int ret;
> >   ret = get_gate_page(mm, start & PAGE_MASK,
> >   gup_flags, ,
> >   pages ? [i] : NULL);
> >   if (ret)
> > - return i ? : ret;
> > + goto out;
> >   ctx.page_mask = 0;
> >   goto next_page;
> >   }
>
> This also fixes a potentially leaked dev_pagemap reference count if a
> failure occurs when an iteration crosses a vma boundary. I don't think
> it's normal to have different vma's on a users mapped zone device memory,
> but good to fix anyway.

Does not sound abnormal to me, we should promote this as a fix for the
current cycle with an updated changelog.


Re: [PATCH 02/11] libnvdimm/security: change clear text nvdimm keys to encrypted keys

2018-11-11 Thread Dan Williams
[ add keyrings and lkml ]

On Sun, Nov 11, 2018 at 9:28 AM Mimi Zohar  wrote:
>
> On Fri, 2018-11-09 at 15:13 -0700, Dave Jiang wrote:
> > In order to make nvdimm more secure, encrypted keys will be used instead of
> > clear text keys. A master key will be created to seal encrypted nvdimm
> > keys. The master key can be a trusted key generated from TPM 2.0 or a less
> > secure user key.
>
> Trusted keys also work for TPM 1.2.  Are you intentionally limiting
> the master key to TPM 2.0?

TPM 1.2 is supported from a software perspective, however the
intersection of hardware platforms deploying security enabled NVDIMMs
and TPM 1.2 might be a null set.

> Traditionally there is a single master key for the system, which would
> be sealed to a set of boot time PCR values.  After decrypting all of
> the encrypted keys, the master key would be removed from the keyring
> and a PCR extended.  Extending a PCR would prevent the master key from
> being unsealed again and used to decrypt encrypted keys, without
> rebooting the system.  Normally this would be done before pivoting
> root.
>
> If you're not referring to the system master key and are intentionally
> limiting usage to TPM 2.0, more details on the master key security
> requirements should be included.

Oh, interesting point. I think we had been assuming a local +
unsealed-at-runtime nvdimm master key rather than a system-wide master
key. Yes, we need to rethink this in terms of supporting a sealed
system-key. This would seem to limit security actions, outside of
unlock, to always requiring a reboot. I.e. the nominal case is that we
boot up and unlock the DIMMs, but any subsequent security operation
like erase, or change-passphrase would require rebooting into an
environment where the system-master key is unsealed. I do think
re-provisioning keys and erasing DIMM contents are sufficiently
exceptional events that a reboot requirement is tolerable.

Is there already existing tooling around this to be able to schedule
master-key related actions to be deferred to an initrd environment?

> Using trusted keys that are encrypted/decrypted using a user key
> should really be limited to testing environments.

Makes sense.

> >
> > In the process of this conversion, the kernel cached key will be removed
> > in order to simplify the verification process. The hardware will be used to
> > verify the decrypted user payload directly.
>
> Making this sort of change implies there is no concern in breaking
> existing userspace.  Either the code hasn't yet been upstreamed or
> there are not any users. If the code hasn't been upstreamed, it would
> make more sense to squash the git history:
>
> - making code review easier
> - making the git history bisect safe

Yes, the old scheme is not upstream. I'll do the squash once we've
finalized the key-management details.

Thanks for the help Mimi.


Re: [PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()

2018-11-10 Thread Dan Williams
[ added Kirill ]

On Sat, Nov 10, 2018 at 4:19 PM Andy Lutomirski  wrote:
> > On Nov 10, 2018, at 3:57 PM, Dan Williams  wrote:
> >
> >> On Fri, Nov 9, 2018 at 4:22 PM Andy Lutomirski  wrote:
> >>
> >>
> >>
> >>> On Nov 9, 2018, at 4:05 PM, Dan Williams  wrote:
> >>>
> >>> Commit f77084d96355 "x86/mm/pat: Disable preemption around
> >>> __flush_tlb_all()" addressed a case where __flush_tlb_all() is called
> >>> without preemption being disabled. It also left a warning to catch other
> >>> cases where preemption is not disabled. That warning triggers for the
> >>> memory hotplug path which is also used for persistent memory enabling:
> >>
> >> I don’t think I agree with the patch. If you call __flush_tlb_all() in a 
> >> context where you might be *migrated*, then there’s a bug. We could change 
> >> the code to allow this particular use by checking that we haven’t done SMP 
> >> init yet, perhaps.
> >
> > Hmm, are saying the entire kernel_physical_mapping_init() sequence
> > needs to run with pre-emption disabled?
>
> If it indeed can run late in boot or after boot, then it sure looks buggy. 
> Either the __flush_tlb_all() should be removed or it should be replaced with 
> flush_tlb_kernel_range(). It’s unclear to me why a flush is needed at all, 
> but if it’s needed, surely all CPUs need flushing.

Yeah, I don't think __flush_tlb_all() is needed at
kernel_physical_mapping_init() time, and at
kernel_physical_mapping_remove() time we do a full flush_tlb_all().

Kirill?


Re: [PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()

2018-11-10 Thread Dan Williams
On Fri, Nov 9, 2018 at 4:22 PM Andy Lutomirski  wrote:
>
>
>
> > On Nov 9, 2018, at 4:05 PM, Dan Williams  wrote:
> >
> > Commit f77084d96355 "x86/mm/pat: Disable preemption around
> > __flush_tlb_all()" addressed a case where __flush_tlb_all() is called
> > without preemption being disabled. It also left a warning to catch other
> > cases where preemption is not disabled. That warning triggers for the
> > memory hotplug path which is also used for persistent memory enabling:
>
> I don’t think I agree with the patch. If you call __flush_tlb_all() in a 
> context where you might be *migrated*, then there’s a bug. We could change 
> the code to allow this particular use by checking that we haven’t done SMP 
> init yet, perhaps.

Hmm, are saying the entire kernel_physical_mapping_init() sequence
needs to run with pre-emption disabled?


[PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()

2018-11-09 Thread Dan Williams
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

Rather than audit all __flush_tlb_all() callers to add preemption, just
do it internally to __flush_tlb_all().

Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/tlbflush.h |8 
 arch/x86/mm/pageattr.c  |6 +-
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d760611cfc35..049e0aca0fb5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -454,11 +454,10 @@ static inline void __native_flush_tlb_one_user(unsigned 
long addr)
 static inline void __flush_tlb_all(void)
 {
/*
-* This is to catch users with enabled preemption and the PGE feature
-* and don't trigger the warning in __native_flush_tlb().
+*  Preemption needs to be disabled around __flush_tlb* calls
+*  due to CR3 reload in __native_flush_tlb().
 */
-   VM_WARN_ON_ONCE(preemptible());
-
+   preempt_disable();
if (boot_cpu_has(X86_FEATURE_PGE)) {
__flush_tlb_global();
} else {
@@ -467,6 +466,7 @@ static inline void __flush_tlb_all(void)
 */
__flush_tlb();
}
+   preempt_enable();
 }
 
 /*
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index db7a10082238..f799076e3d57 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -2309,13 +2309,9 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
 
/*
 * We should perform an IPI and flush all tlbs,
-* but that can deadlock->flush only current cpu.
-* Preemption needs to be disabled around __flush_tlb_all() due to
-* CR3 reload in __native_flush_tlb().
+* but that can deadlock->flush only current cpu:
 */
-   preempt_disable();
__flush_tlb_all();
-   preempt_enable();
 
arch_flush_lazy_mmu_mode();
 }



Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-08 Thread Dan Williams
On Thu, Nov 8, 2018 at 12:14 PM Theodore Y. Ts'o  wrote:
>
> On Thu, Nov 08, 2018 at 09:19:33AM -0800, Dan Williams wrote:
> >
> > I know at least StGit mail does not grok that "#"notation. I've
> > stopped using it in favor of a "Fixes:" tag. I would think "Fixes:" is
> > preferred over "# " if only because it can be used to track
> > fixes to commits that have been backported to stable. Is there any
> > reason for "# " to continue in a world where we have "Fixes:"?
>
> The main annoyance I have with Fixes is because it can be a pain to
> figure out what the "# " would be.  Something like:
>
> % tag --contains DEADBEEF | grep ^v | head
>
> doesn't work because kernel version numbers don't sort obviously.  So
> v4.10 comes before v4.3 using a normal sort, and even sort -n doesn't
> do the right.

Unless I'm misunderstanding, I think you want:

git describe --contains $COMMIT --match=v[345]*

...which should give you the latest tagged kernel according to that match spec.


Re: [patch 0/2] Documentation/process: Add subsystem/tree handbook

2018-11-08 Thread Dan Williams
On Thu, Nov 8, 2018 at 8:32 AM Mark Brown  wrote:
>
> On Thu, Nov 08, 2018 at 11:21:33AM -0500, Theodore Y. Ts'o wrote:
> > On Thu, Nov 08, 2018 at 04:05:17PM +0100, Peter Zijlstra wrote:
>
> > > My plumbers schedule is already 100% booked with MCs and other things.
> > > There is no kernel-summit schedule details available as of yet, but it
> > > is already almost certain that I will not have time for anything in that
> > > track anyway :/
>
> > I thought there was a slot already scheduled on the refereed track,
> > "Towards a Linux Kernel Mainainer Handbook" (Tuesday at 4:45pm) for
> > this purpose?
>
> I'm not 100% sure that people will pick up that the topic is about a
> handbook for working with maintainers rather than a handbook for being a
> maintainer from that title...

The intent is a handbook for being a maintainer. However, in the
process of describing how a given sub-system is maintained one also
needs to describe the local rules. So contributors should be able to
glean what matters to a sub-system from a description of how that
sub-system is maintained. In other words the direct audience is
maintainers, but it also hopefully makes the process more transparent
for contributors.


Re: [patch 2/2] Documentation/process: Add tip tree handbook

2018-11-08 Thread Dan Williams
On Thu, Nov 8, 2018 at 1:13 AM Peter Zijlstra  wrote:
>
> On Thu, Nov 08, 2018 at 08:40:12AM +0100, Ingo Molnar wrote:
> > > + - Cc: ``cc-ed-person ``
> > > +
> > > +   If the patch should be backported to stable, then please add a '``Cc:
> > > +   sta...@vger.kernel.org``' tag, but do not Cc stable when sending your
> > > +   mail.
> >
> > Can I suggest a more canonical form:
> >
> >   Cc:  # v4.18 and later kernels
> >
> > It would be nice if people adding Cc: stable lines would actually try to
> > figure out which exact kernel versions are affected.

I know at least StGit mail does not grok that "#"notation. I've
stopped using it in favor of a "Fixes:" tag. I would think "Fixes:" is
preferred over "# " if only because it can be used to track
fixes to commits that have been backported to stable. Is there any
reason for "# " to continue in a world where we have "Fixes:"?


Re: [patch 0/2] Documentation/process: Add subsystem/tree handbook

2018-11-07 Thread Dan Williams
On Wed, Nov 7, 2018 at 11:49 AM Jonathan Corbet  wrote:
>
> On Wed, 07 Nov 2018 18:10:10 +0100
> Thomas Gleixner  wrote:
>
> > Mark recently suggested in one of the ksummit discussions to add subsystem
> > or tree specific maintainer handbooks to document subsystem/tree specific
> > development process information.
> >
> > The following series adds the general section and the tip tree specific
> > handbook.
>
> So this is an idea that has gone around for a while; developers often get
> into trouble when wandering into an unfamiliar part of the kernel, so
> documenting the quaint local customs might help.  Assuming people actually
> read the documentation, of course.
>
> What's here seems generally good, but I do have an overall worry that we
> may want to consider:
>
>   - How much do we want to support and document subsystem-specific quirks
> vs. promoting reasonable and consistent rules kernel-wide?
>
> There is a *lot* of stuff in the new tip manual.  Much of it, regarding
> coding style and the writing of changelogs, really seems like it should be
> global; if we need better documentation of that stuff, I'd really rather
> see that advice folded into the central documents.  Having two (or more)
> extensive coding-style documents doesn't seem like it's going to help us.
>
> The stuff that is truly specific to tip seems fairly minimal:
>
>   - what goes into tip
>   - the reverse fir tree thing
>   - tail comments, or the distaste thereabouts
>   - subject-line prefixes
>
> Having a tip-specific document that contains only those (plus whatever
> else I forgot to list) would, IMO, make it much more likely that readers
> would actually notice (and follow) the stuff that's specific to tip.
>
> See what I'm getting at here?  Am I totally out to lunch on this?

Not at all, and this is one of the thrusts of my talk next week at
Plumbers. I *do* want to propose that sub-systems document all their
local quirks. Then we can refactor the common ones into a global
document, have some discussion fodder if some sub-system specific
rules can be unified, and otherwise leave the freedom for individual
sub-systems to be different as long as it's documented.


Re: [tip:x86/urgent] x86/mm/pat: Disable preemption around __flush_tlb_all()

2018-11-05 Thread Dan Williams
On Mon, Oct 29, 2018 at 11:12 AM tip-bot for Sebastian Andrzej Siewior
 wrote:
>
> Commit-ID:  f77084d96355f5fba8e2c1fb3a51a393b1570de7
> Gitweb: 
> https://git.kernel.org/tip/f77084d96355f5fba8e2c1fb3a51a393b1570de7
> Author: Sebastian Andrzej Siewior 
> AuthorDate: Wed, 17 Oct 2018 12:34:32 +0200
> Committer:  Thomas Gleixner 
> CommitDate: Mon, 29 Oct 2018 19:04:31 +0100
>
> x86/mm/pat: Disable preemption around __flush_tlb_all()
>
> The WARN_ON_ONCE(__read_cr3() != build_cr3()) in switch_mm_irqs_off()
> triggers every once in a while during a snapshotted system upgrade.
>
> The warning triggers since commit decab0888e6e ("x86/mm: Remove
> preempt_disable/enable() from __native_flush_tlb()"). The callchain is:
>
>   get_page_from_freelist() -> post_alloc_hook() -> __kernel_map_pages()
>
> with CONFIG_DEBUG_PAGEALLOC enabled.
>
> Disable preemption during CR3 reset / __flush_tlb_all() and add a comment
> why preemption has to be disabled so it won't be removed accidentaly.
>
> Add another preemptible() check in __flush_tlb_all() to catch callers with
> enabled preemption when PGE is enabled, because PGE enabled does not
> trigger the warning in __native_flush_tlb(). Suggested by Andy Lutomirski.
>
> Fixes: decab0888e6e ("x86/mm: Remove preempt_disable/enable() from 
> __native_flush_tlb()")
> Signed-off-by: Sebastian Andrzej Siewior 
> Signed-off-by: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Cc: Dave Hansen 
> Cc: Peter Zijlstra 
> Cc: Borislav Petkov 
> Cc: sta...@vger.kernel.org
> Link: https://lkml.kernel.org/r/20181017103432.zgv46nlu3hc7k...@linutronix.de
> ---
>  arch/x86/include/asm/tlbflush.h | 6 ++
>  arch/x86/mm/pageattr.c  | 6 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 323a313947e0..d760611cfc35 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -453,6 +453,12 @@ static inline void __native_flush_tlb_one_user(unsigned 
> long addr)
>   */
>  static inline void __flush_tlb_all(void)
>  {
> +   /*
> +* This is to catch users with enabled preemption and the PGE feature
> +* and don't trigger the warning in __native_flush_tlb().
> +*/
> +   VM_WARN_ON_ONCE(preemptible());

This warning triggers 100% of the time for the pmem use case and it
seems it would also trigger for any memory hotplug use case that uses
arch_add_memory().

 WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
__flush_tlb_all+0x1b/0x3a
 CPU: 35 PID: 911 Comm: systemd-udevd Tainted: G   OE
4.20.0-rc1+ #2583
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
 RIP: 0010:__flush_tlb_all+0x1b/0x3a
 [..]
 Call Trace:
  phys_pud_init+0x29c/0x2bb
  kernel_physical_mapping_init+0xfc/0x219
  init_memory_mapping+0x1a5/0x3b0
  arch_add_memory+0x2c/0x50
  devm_memremap_pages+0x3aa/0x610
  pmem_attach_disk+0x585/0x700 [nd_pmem]

...could we just move the preempt_disable() inside __flush_tlb_all()?


Re: [PATCH] docs: Extend trusted keys documentation for TPM 2.0

2018-11-05 Thread Dan Williams
On Fri, Oct 19, 2018 at 3:19 AM Stefan Berger  wrote:
>
> Extend the documentation for trusted keys with documentation for how to
> set up a key for a TPM 2.0 so it can be used with a TPM 2.0 as well.
>
> Signed-off-by: Stefan Berger 
> Reviewed-by: Mimi Zohar 

Thanks for the updates:

Acked-by: Dan Williams 


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-30 Thread Dan Williams
On Mon, Oct 29, 2018 at 11:29 PM Michal Hocko  wrote:
>
> On Mon 29-10-18 12:59:11, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
[..]
> > The patches Andrew pushed addressed the immediate issue so that now
> > systems with nvdimm/DAX memory can at least initialize quick enough
> > that systemd doesn't refuse to mount the root file system due to a
> > timeout.
>
> This is about the first time you actually mention that. I have re-read
> the cover letter and all changelogs of patches in this serious. Unless I
> have missed something there is nothing about real users hitting issues
> out there. nvdimm is still considered a toy because there is no real HW
> users can play with.

Yes, you have missed something, because that's incorrect. There's been
public articles about these parts sampling since May.


https://www.anandtech.com/show/12828/intel-launches-optane-dimms-up-to-512gb-apache-pass-is-here

That testing identified this initialization performance problem and
thankfully got it addressed in time for the current merge window.


Re: [PATCH v3] x86/numa_emulation: Fix uniform-split numa emulation

2018-10-25 Thread Dan Williams
On Thu, Oct 25, 2018 at 1:52 PM Dave Hansen  wrote:
>
> On 10/25/18 1:26 PM, Dan Williams wrote:
> > --- a/arch/x86/mm/numa_emulation.c
> > +++ b/arch/x86/mm/numa_emulation.c
> > @@ -400,9 +400,17 @@ void __init numa_emulation(struct numa_meminfo 
> > *numa_meminfo, int numa_dist_cnt)
> >   n = simple_strtoul(emu_cmdline, _cmdline, 0);
> >   ret = -1;
> >   for_each_node_mask(i, physnode_mask) {
> > + /*
> > +  * The reason we pass in blk[0] is due to
> > +  * numa_remove_memblk_from() called by
> > +  * emu_setup_memblk() will delete entry 0
> > +  * and then move everything else up in the pi.blk
> > +  * array. Therefore we should always be looking
> > +  * at blk[0].
> > +  */
> >   ret = split_nodes_size_interleave_uniform(, ,
> > - pi.blk[i].start, pi.blk[i].end, 0,
> > - n, [i], nid);
> > + pi.blk[0].start, pi.blk[0].end, 0,
> > + n, [0], nid);
>
> So, has this *ever* worked on a multi-socket configuration?  Or has it
> just never been run on a multi-socket configuration?

It happened to work on 2-socket. We only saw issues when moving to
4-socket and above, and sometimes only a grey failure with an
odd-sized node not outright crash / boot failure.

> Either way, nice changelog, and nice comments.  I'd have some minor nits
> if you have to respin it, but otherwise:
>
> Reviewed-by: Dave Hansen 

Thanks.


[PATCH v3] x86/numa_emulation: Fix uniform-split numa emulation

2018-10-25 Thread Dan Williams
From: Dave Jiang 

The numa_emulation() routine in the 'uniform' case walks through all the
physical 'memblk' instances and divides them into N emulated nodes with
split_nodes_size_interleave_uniform(). As each physical node is consumed
it is removed from the physical memblk array in the
numa_remove_memblk_from() helper. Since
split_nodes_size_interleave_uniform() handles advancing the array as the
'memblk' is consumed it is expected that the base of the array is always
specified as the argument.

Otherwise, on multi-socket (> 2) configurations the uniform-split
capability can generate an invalid numa configuration leading to boot
failures with signatures like the following:

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Sending NMI from CPU 0 to CPUs 2:
NMI backtrace for cpu 2
CPU: 2 PID: 1332 Comm: pgdatinit0 Not tainted 
4.19.0-rc8-next-20181019-baseline #59
RIP: 0010:__init_single_page.isra.74+0x81/0x90
[..]
Call Trace:
 deferred_init_pages+0xaa/0xe3
 deferred_init_memmap+0x18f/0x318
 kthread+0xf8/0x130
 ? deferred_free_pages.isra.105+0xc9/0xc9
 ? kthread_stop+0x110/0x110
 ret_from_fork+0x35/0x40

Cc: x...@kernel.org
Cc: Borislav Petkov 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Dave Hansen 
Cc: 
Fixes: 1f6a2c6d9f121 ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: Dave Jiang 
Tested-by: Alexander Duyck 
Signed-off-by: Dan Williams 
---
Changes since v2: https://lore.kernel.org/patchwork/patch/988541/

* Update the changelog with details from testing by Alex

 arch/x86/mm/numa_emulation.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index b54d52a2d00a..d71d72cf6c66 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -400,9 +400,17 @@ void __init numa_emulation(struct numa_meminfo 
*numa_meminfo, int numa_dist_cnt)
n = simple_strtoul(emu_cmdline, _cmdline, 0);
ret = -1;
for_each_node_mask(i, physnode_mask) {
+   /*
+* The reason we pass in blk[0] is due to
+* numa_remove_memblk_from() called by
+* emu_setup_memblk() will delete entry 0
+* and then move everything else up in the pi.blk
+* array. Therefore we should always be looking
+* at blk[0].
+*/
ret = split_nodes_size_interleave_uniform(, ,
-   pi.blk[i].start, pi.blk[i].end, 0,
-   n, [i], nid);
+   pi.blk[0].start, pi.blk[0].end, 0,
+   n, [0], nid);
if (ret < 0)
break;
if (ret < n) {



Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-10-23 Thread Dan Williams
On Tue, Oct 23, 2018 at 11:17 AM Dave Hansen  wrote:
>
> >> This series adds a new "driver" to which pmem devices can be
> >> attached.  Once attached, the memory "owned" by the device is
> >> hot-added to the kernel and managed like any other memory.  On
> >
> > Would this memory be considered volatile (with the driver initializing
> > it to zeros), or persistent (contents are presented unchanged,
> > applications may guarantee persistence by using cache flush
> > instructions, fence instructions, and writing to flush hint addresses
> > per the persistent memory programming model)?
>
> Volatile.
>
> >> I expect udev can automate this by setting up a rule to watch for
> >> device-dax instances by UUID and call a script to do the detach /
> >> reattach dance.
> >
> > Where would that rule be stored? Storing it on another device
> > is problematic. If that rule is lost, it could confuse other
> > drivers trying to grab device DAX devices for use as persistent
> > memory.
>
> Well, we do lots of things like stable device naming from udev scripts.
>  We depend on them not being lost.  At least this "fails safe" so we'll
> default to persistence instead of defaulting to "eat your data".
>

Right, and at least for the persistent memory to volatile conversion
case we will have the UUID to positively identify the DAX device. So
it will indeed "fail safe" and just become a dax_pmem device again if
the configuration is lost. We'll likely need to create/use a "by-path"
scheme for non-pmem use cases.


Re: [PATCH] pfn_t: force '~' to be parsed as an unary operator

2018-10-22 Thread Dan Williams
[ added Andrew ]

Patch here: https://lore.kernel.org/patchwork/patch/1002234/

On Sun, Oct 21, 2018 at 8:00 AM Sebastien Boisvert
 wrote:
>
> Tracing the event "fs_dax:dax_pmd_insert_mapping" with perf produces this
> warning:
>   [fs_dax:dax_pmd_insert_mapping] unknown op '~'
>
> It is printed in process_op (tools/lib/traceevent/event-parse.c) because '~'
> is parsed as a binary operator.
>
> perf reads the format of fs_dax:dax_pmd_insert_mapping ("print fmt") from
> /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_insert_mapping/format .
>
> The format contains:
>
> ~(((u64) ~(~(((1UL) << 12)-1)))
>  ^
>  \ interpreted as a binary operator by process_op().
>
> This part is generated in the declaration of the event class
> dax_pmd_insert_mapping_class in include/trace/events/fs_dax.h :
>
> __print_flags_u64(__entry->pfn_val & PFN_FLAGS_MASK, "|",
> PFN_FLAGS_TRACE),
>
> This patch adds a pair of parentheses in the declaration of PFN_FLAGS_MASK to
> make sure that '~' is parsed as a unary operator by perf.
>
> The part of the format that was problematic is now:
>
> ~(((u64) (~(~(((1UL) << 12)-1
>
> Now, all the '~' are parsed as unary operators.
>
> Cc: Dan Williams 


Acked-by: Dan Williams 


> Cc: "Steven Rostedt (VMware)" 
> Cc: Arnaldo Carvalho de Melo 
> Cc: "Tzvetomir Stoyanov (VMware)" 
> Cc: Namhyung Kim 
> Cc: Ross Zwisler 
> Cc: Elenie Godzaridis 
> Cc: linux-perf-us...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Sebastien Boisvert 
> ---
>  include/linux/pfn_t.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
> index 21713dc14ce2..673546ba7342 100644
> --- a/include/linux/pfn_t.h
> +++ b/include/linux/pfn_t.h
> @@ -10,7 +10,7 @@
>   * PFN_DEV - pfn is not covered by system memmap by default
>   * PFN_MAP - pfn has a dynamic page mapping established by a device driver
>   */
> -#define PFN_FLAGS_MASK (((u64) ~PAGE_MASK) << (BITS_PER_LONG_LONG - 
> PAGE_SHIFT))
> +#define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - 
> PAGE_SHIFT))
>  #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
>  #define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
>  #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
> --
> 2.17.1
>


Re: [PATCH] pstore/ram: Clarify resource reservation labels

2018-10-18 Thread Dan Williams
On Thu, Oct 18, 2018 at 3:43 PM Luck, Tony  wrote:
>
> > If the filesystem existed on the namespace before the user specified
> > the ramoops command line then ramoops will clobber the filesystem and
> > the user will only find out when mount later fails. All the kernel
> > will say is:
> >
> > dev_warn(dev, "could not reserve region %pR\n", res);
> >
> > ...from the pmem driver, and then the only way to figure who the
> > conflict is with is to look at /proc/iomem, but the damage is already
> > likely done by that point.
>
> When you set up a ramoops region in pmem, write a magic header block
> at the start of the area.  Then when pstore/ramoops goes to find the
> region, it checks for the magic header.  Not there? Don't write to the
> region. Problem solved.

That's effectively what this registration proposal will do. However,
with the caveat that the user never gets to write that magic header
without going through the nvdimm infrastructure to set it up, and
assert there is nothing there already.


Re: [PATCH] pstore/ram: Clarify resource reservation labels

2018-10-18 Thread Dan Williams
On Thu, Oct 18, 2018 at 3:26 PM Kees Cook  wrote:
>
> On Thu, Oct 18, 2018 at 3:23 PM, Dan Williams  
> wrote:
> > On Thu, Oct 18, 2018 at 3:19 PM Kees Cook  wrote:
> >>
> >> On Thu, Oct 18, 2018 at 2:35 PM, Dan Williams  
> >> wrote:
> >> > On Thu, Oct 18, 2018 at 1:31 PM Kees Cook  wrote:
> > [..]
> >> > I cringe at users picking addresses because someone is going to enable
> >> > ramoops on top of their persistent memory namespace and wonder why
> >> > their filesystem got clobbered. Should attempts to specify an explicit
> >> > ramoops range that intersects EfiPersistentMemory fail by default? The
> >> > memmap=ss!nn parameter has burned us many times with users picking the
> >> > wrong address, so I'd be inclined to hide this ramoops sharp edge from
> >> > them.
> >>
> >> Yeah, this is what I'm trying to solve. I'd like ramoops to find the
> >> address itself, but it has to do it really early, so if I can't have
> >> nvdimm handle it directly, will having regions already allocated with
> >> request_mem_region() "get along" with the rest of nvdimm?
> >
> > If the filesystem existed on the namespace before the user specified
> > the ramoops command line then ramoops will clobber the filesystem and
> > the user will only find out when mount later fails. All the kernel
> > will say is:
> >
> > dev_warn(dev, "could not reserve region %pR\n", res);
> >
> > ...from the pmem driver, and then the only way to figure who the
> > conflict is with is to look at /proc/iomem, but the damage is already
> > likely done by that point.
>
> Yeah, bleh. Okay, well, let's just skip this for now, since ramoops
> doesn't do _anything_ with pmem now. No need to go crazy right from
> the start. Instead, let's make it work "normally", and if someone
> needs it for very early boot, they can manually enter the mem_address.
>
> How should I attach a ramoops_probe() call to pmem?

To me this looks like it would be a nvdimm glue driver whose entire
job is to attach to the namespace, fill out some
ramoops_platform_data, and then register a "ramoops" platform_device
for the ramoops driver to find.


Re: [PATCH] pstore/ram: Clarify resource reservation labels

2018-10-18 Thread Dan Williams
On Thu, Oct 18, 2018 at 3:19 PM Kees Cook  wrote:
>
> On Thu, Oct 18, 2018 at 2:35 PM, Dan Williams  
> wrote:
> > On Thu, Oct 18, 2018 at 1:31 PM Kees Cook  wrote:
[..]
> > I cringe at users picking addresses because someone is going to enable
> > ramoops on top of their persistent memory namespace and wonder why
> > their filesystem got clobbered. Should attempts to specify an explicit
> > ramoops range that intersects EfiPersistentMemory fail by default? The
> > memmap=ss!nn parameter has burned us many times with users picking the
> > wrong address, so I'd be inclined to hide this ramoops sharp edge from
> > them.
>
> Yeah, this is what I'm trying to solve. I'd like ramoops to find the
> address itself, but it has to do it really early, so if I can't have
> nvdimm handle it directly, will having regions already allocated with
> request_mem_region() "get along" with the rest of nvdimm?

If the filesystem existed on the namespace before the user specified
the ramoops command line then ramoops will clobber the filesystem and
the user will only find out when mount later fails. All the kernel
will say is:

dev_warn(dev, "could not reserve region %pR\n", res);

...from the pmem driver, and then the only way to figure who the
conflict is with is to look at /proc/iomem, but the damage is already
likely done by that point.


Re: [PATCH] pstore/ram: Clarify resource reservation labels

2018-10-18 Thread Dan Williams
On Thu, Oct 18, 2018 at 1:31 PM Kees Cook  wrote:
>
> On Thu, Oct 18, 2018 at 8:33 AM, Dan Williams  
> wrote:
> > [ add Ross ]
>
> Hi Ross! :)
>
> > On Thu, Oct 18, 2018 at 12:15 AM Kees Cook  wrote:
> >> As for nvdimm specifically, yes, I'd love to get pstore hooked up
> >> correctly to nvdimm. How do the namespaces work? Right now pstore
> >> depends one of platform driver data, device tree specification, or
> >> manual module parameters.
> >
> > From the userspace side we have the ndctl utility to wrap
> > personalities on top of namespaces. So for example, I envision we
> > would be able to do:
> >
> > ndctl create-namespace --mode=pstore --size=128M
> >
> > ...and create a small namespace that will register with the pstore 
> > sub-system.
> >
> > On the kernel side this would involve registering a 'pstore_dev' child
> > / seed device under each region device. The 'seed-device' sysfs scheme
> > is described in our documentation [1]. The short summary is ndctl
> > finds a seed device assigns a namespace to it and then binding that
> > device to a driver causes it to be initialized by the kernel.
> >
> > [1]: https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
>
> Interesting!
>
> Really, this would be a way to configure "ramoops" (the persistent RAM
> backend to pstore), rather than pstore itself (pstore is just the
> framework). From reading the ndctl man page it sounds like there isn't
> a way to store configuration information beyond just size?
>
> ramoops will auto-configure itself and fill available space using its
> default parameters, but it might be nice to have a way to store that
> somewhere (traditionally it's part of device tree or platform data).
> ramoops could grow a "header", but normally the regions are very small
> so I've avoided that.
>
> I'm not sure I understand the right way to glue ramoops_probe() to the
> "seed-device" stuff. (It needs to be probed VERY early to catch early
> crashes -- ramoops uses postcore_initcall() normally.)

Irk, yeah, that's early. On some configurations we can't delineate
namespaces until after ACPI has come up. Ideally the address range
would be reserved and communicated in the memory-map from the BIOS.

In EFI terms I think early ramoops is only suitable for
EfiACPIMemoryNVS, but we could certainly support a late arriving
ramoops for EfiPersistentMemory with this proposed namespace scheme.

I cringe at users picking addresses because someone is going to enable
ramoops on top of their persistent memory namespace and wonder why
their filesystem got clobbered. Should attempts to specify an explicit
ramoops range that intersects EfiPersistentMemory fail by default? The
memmap=ss!nn parameter has burned us many times with users picking the
wrong address, so I'd be inclined to hide this ramoops sharp edge from
them.


Re: [PATCH] pstore/ram: Clarify resource reservation labels

2018-10-18 Thread Dan Williams
[ add Ross ]

On Thu, Oct 18, 2018 at 12:15 AM Kees Cook  wrote:
>
> On Wed, Oct 17, 2018 at 5:49 PM, Dan Williams  
> wrote:
> > On Wed, Oct 17, 2018 at 5:29 PM Kees Cook  wrote:
> >>
> >> When ramoops reserved a memory region in the kernel, it had an unhelpful
> >> label of "persistent_memory". When reading /proc/iomem, it would be
> >> repeated many times, did not hint that it was ramoops in particular,
> >> and didn't clarify very much about what each was used for:
> >>
> >> 4-407ff : Persistent Memory (legacy)
> >>   4-40fff : persistent_memory
> >>   41000-41fff : persistent_memory
> >> ...
> >>   4000ff000-4000f : persistent_memory
> >>
> >> Instead, this adds meaningful labels for how the various regions are
> >> being used:
> >>
> >> 4-407ff : Persistent Memory (legacy)
> >>   4-40fff : ramoops:dump(0/252)
> >>   41000-41fff : ramoops:dump(1/252)
> >> ...
> >>   4000fc000-4000fcfff : ramoops:dump(252/252)
> >>   4000fd000-4000fdfff : ramoops:console
> >>   4000fe000-4000fe3ff : ramoops:ftrace(0/3)
> >>   4000fe400-4000fe7ff : ramoops:ftrace(1/3)
> >>   4000fe800-4000febff : ramoops:ftrace(2/3)
> >>   4000fec00-4000fefff : ramoops:ftrace(3/3)
> >>   4000ff000-4000f : ramoops:pmsg
> >
> > Hopefully ramoops is doing request_region() before trying to do
> > anything with its ranges, because it's going to collide with the pmem
> > driver doing a request_region(). If we want to have pstore use pmem as
> > a backing store that's a new drivers/nvdimm/ namespace personality
> > driver to turn around and register a persistent memory range with
> > pstore rather than the pmem block-device driver.
>
> Yup: it's using request_mem_region() (that's where the labels above
> are assigned).
>
> As for nvdimm specifically, yes, I'd love to get pstore hooked up
> correctly to nvdimm. How do the namespaces work? Right now pstore
> depends one of platform driver data, device tree specification, or
> manual module parameters.

>From the userspace side we have the ndctl utility to wrap
personalities on top of namespaces. So for example, I envision we
would be able to do:

ndctl create-namespace --mode=pstore --size=128M

...and create a small namespace that will register with the pstore sub-system.

On the kernel side this would involve registering a 'pstore_dev' child
/ seed device under each region device. The 'seed-device' sysfs scheme
is described in our documentation [1]. The short summary is ndctl
finds a seed device assigns a namespace to it and then binding that
device to a driver causes it to be initialized by the kernel.

[1]: https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt


Re: [PATCH] pstore/ram: Clarify resource reservation labels

2018-10-17 Thread Dan Williams
On Wed, Oct 17, 2018 at 5:29 PM Kees Cook  wrote:
>
> When ramoops reserved a memory region in the kernel, it had an unhelpful
> label of "persistent_memory". When reading /proc/iomem, it would be
> repeated many times, did not hint that it was ramoops in particular,
> and didn't clarify very much about what each was used for:
>
> 4-407ff : Persistent Memory (legacy)
>   4-40fff : persistent_memory
>   41000-41fff : persistent_memory
> ...
>   4000ff000-4000f : persistent_memory
>
> Instead, this adds meaningful labels for how the various regions are
> being used:
>
> 4-407ff : Persistent Memory (legacy)
>   4-40fff : ramoops:dump(0/252)
>   41000-41fff : ramoops:dump(1/252)
> ...
>   4000fc000-4000fcfff : ramoops:dump(252/252)
>   4000fd000-4000fdfff : ramoops:console
>   4000fe000-4000fe3ff : ramoops:ftrace(0/3)
>   4000fe400-4000fe7ff : ramoops:ftrace(1/3)
>   4000fe800-4000febff : ramoops:ftrace(2/3)
>   4000fec00-4000fefff : ramoops:ftrace(3/3)
>   4000ff000-4000f : ramoops:pmsg

Hopefully ramoops is doing request_region() before trying to do
anything with its ranges, because it's going to collide with the pmem
driver doing a request_region(). If we want to have pstore use pmem as
a backing store that's a new drivers/nvdimm/ namespace personality
driver to turn around and register a persistent memory range with
pstore rather than the pmem block-device driver.
>
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/ram.c| 16 +---
>  fs/pstore/ram_core.c   | 11 +++
>  include/linux/pstore_ram.h |  3 ++-
>  3 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index 6ea9cd801044..ffcff6516e89 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -587,9 +587,16 @@ static int ramoops_init_przs(const char *name,
> goto fail;
>
> for (i = 0; i < *cnt; i++) {
> +   char *label;
> +
> +   if (*cnt == 1)
> +   label = kasprintf(GFP_KERNEL, "ramoops:%s", name);
> +   else
> +   label = kasprintf(GFP_KERNEL, "ramoops:%s(%d/%d)",
> + name, i, *cnt - 1);
> prz_ar[i] = persistent_ram_new(*paddr, zone_sz, sig,
> - >ecc_info,
> - cxt->memtype, flags);
> +  >ecc_info,
> +  cxt->memtype, flags, label);
> if (IS_ERR(prz_ar[i])) {
> err = PTR_ERR(prz_ar[i]);
> dev_err(dev, "failed to request %s mem region 
> (0x%zx@0x%llx): %d\n",
> @@ -619,6 +626,8 @@ static int ramoops_init_prz(const char *name,
> struct persistent_ram_zone **prz,
> phys_addr_t *paddr, size_t sz, u32 sig)
>  {
> +   char *label;
> +
> if (!sz)
> return 0;
>
> @@ -629,8 +638,9 @@ static int ramoops_init_prz(const char *name,
> return -ENOMEM;
> }
>
> +   label = kasprintf(GFP_KERNEL, "ramoops:%s", name);
> *prz = persistent_ram_new(*paddr, sz, sig, >ecc_info,
> - cxt->memtype, 0);
> + cxt->memtype, 0, label);
> if (IS_ERR(*prz)) {
> int err = PTR_ERR(*prz);
>
> diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
> index 0792595ebcfb..12e21f789194 100644
> --- a/fs/pstore/ram_core.c
> +++ b/fs/pstore/ram_core.c
> @@ -438,11 +438,11 @@ static void *persistent_ram_vmap(phys_addr_t start, 
> size_t size,
>  }
>
>  static void *persistent_ram_iomap(phys_addr_t start, size_t size,
> -   unsigned int memtype)
> +   unsigned int memtype, char *label)
>  {
> void *va;
>
> -   if (!request_mem_region(start, size, "persistent_ram")) {
> +   if (!request_mem_region(start, size, label ?: "ramoops")) {
> pr_err("request mem region (0x%llx@0x%llx) failed\n",
> (unsigned long long)size, (unsigned long long)start);
> return NULL;
> @@ -470,7 +470,8 @@ static int persistent_ram_buffer_map(phys_addr_t start, 
> phys_addr_t size,
> if (pfn_valid(start >> PAGE_SHIFT))
> prz->vaddr = persistent_ram_vmap(start, size, memtype);
> else
> -   prz->vaddr = persistent_ram_iomap(start, size, memtype);
> +   prz->vaddr = persistent_ram_iomap(start, size, memtype,
> + prz->label);
>
> if (!prz->vaddr) {
> pr_err("%s: Failed to map 0x%llx pages at 0x%llx\n", __func__,
> @@ -541,12 +542,13 @@ void persistent_ram_free(struct persistent_ram_zone 
> *prz)
> prz->ecc_info.par = NULL;
>
> 

Re: [PATCH v7 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-10-17 Thread Dan Williams
On Wed, Oct 17, 2018 at 1:18 AM Michal Hocko  wrote:
>
> On Fri 12-10-18 10:49:37, Dan Williams wrote:
> > devm_memremap_pages() is a facility that can create struct page entries
> > for any arbitrary range and give drivers the ability to subvert core
> > aspects of page management.
> >
> > Specifically the facility is tightly integrated with the kernel's memory
> > hotplug functionality. It injects an altmap argument deep into the
> > architecture specific vmemmap implementation to allow allocating from
> > specific reserved pages, and it has Linux specific assumptions about
> > page structure reference counting relative to get_user_pages() and
> > get_user_pages_fast(). It was an oversight and a mistake that this was
> > not marked EXPORT_SYMBOL_GPL from the outset.
>
> One thing is still not clear to me. Both devm_memremap_* and
> hmm_devmem_add essentially do the same thing AFAICS. They both allow to
> hotplug a device memory. Both rely on the hotplug code (namely
> add_pages) which itself is not exported to modules. One is GPL only
> while the later is a general export. Is this mismatch desirable?

That is resolved by the last patch in this series.

> API exported by the core hotplug is ad-hoc to say the least. Symbols
> that we actually export are GPL mostly (only try_offline_node is
> EXPORT_SYMBOL without any explanation whatsoever). So I would call it a
> general mess tweaked for specific usecases.
>
> I personally do not care about EXPORT_SYMBOL vs. EXPORT_SYMBOL_GPL
> much to be honest. I understand an argument that we do not care about
> out-of-tree modules a wee bit. I would just be worried those will find a
> way around and my experience tells me that it would be much uglier than
> what the core kernel can provide. But this seems more political than
> technical discussion.

I'm not worried about people finding a way around, I'm sure
cringe-worthy workarounds of core kernel details happen on a regular
basis in out-of-tree code. EXPORT_SYMBOL_GPL in this instance is not a
political statement, it is a statement of the rate of evolution and
depth of the api.

>
> > Again, devm_memremap_pagex() exposes and relies upon core kernel
> > internal assumptions and will continue to evolve along with 'struct
> > page', memory hotplug, and support for new memory types / topologies.
> > Only an in-kernel GPL-only driver is expected to keep up with this
> > ongoing evolution. This interface, and functionality derived from this
> > interface, is not suitable for kernel-external drivers.
>
> I do not follow this line of argumentation though. We generally do not
> care about out-of-tree modules and breaking them if the interface has to
> be updated.

Exactly right. The EXPORT_SYMBOL_GPL is there to say that this api has
deep enough ties into the core kernel to lower the confidence that the
API will stay stable from one kernel revision to the next. It's also
an api that is attracting widening and varied reuse and the long term
health of the implementation depends on being able to peer deeply into
its users and refactor the interface and the core kernel as a result.

> Also what about GPL out of tree modules?

These are precisely the modules we want upstream.

> That being said, I do not mind this patch. You and Christoph are the
> authors and therefore it is you to decide. I just find the current
> situation confusing to say the least.

Hopefully I clarified, let me know if not.


Re: [PATCH v4 1/3] mm: Shuffle initial free memory

2018-10-15 Thread Dan Williams
On Mon, Oct 15, 2018 at 3:25 PM Kees Cook  wrote:
>
> On Wed, Oct 10, 2018 at 6:36 PM, Dan Williams  
> wrote:
> > While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
> > caches it leaves vast bulk of memory to be predictably in order
> > allocated. That ordering can be detected by a memory side-cache.
> >
> > The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
> > pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
> > 10, 4MB this trades off randomization granularity for time spent
> > shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
> > allocator while still showing memory-side cache behavior improvements,
> > and the expectation that the security implications of finer granularity
> > randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.
>
> Perhaps it would help some of the detractors of this feature to make
> this a runtime choice? Some benchmarks show improvements, some show
> regressions. It could just be up to the admin to turn this on/off
> given their paranoia levels? (i.e. the shuffling could become a no-op
> with a given specific boot param?)

Yes, I think it's a valid concern to not turn this on for everybody
given the potential for performance regression. For the next version
I'll add some runtime detection for a memory-side-cache to set the
default on/off, and include a command line override for the paranoid
that want in on regardless of the presence of such a cache.


[PATCH v2] MAINTAINERS: Clarify UIO vs UIOVEC maintainer

2018-10-12 Thread Dan Williams
The UIO file mask in MAINTAINERS was incorrectly directing UIOVEC
(include/linux/uio.h) patches to Greg.

Tag Al as the UIOVEC maintainer as Ingo and others have explicitly
required his ack before taking architecture patches that touch
lib/iov_iter.c.

Cc: Al Viro 
Reported-by: Greg Kroah-Hartman 
Signed-off-by: Dan Williams 
---
Changes in v2:
* Rename UACCESS to UIOVEC

 MAINTAINERS |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index d870cb57c887..c1841d58d7ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15344,13 +15344,19 @@ F:arch/x86/um/
 F: fs/hostfs/
 F: fs/hppfs/
 
+USERSPACE COPYIN/COPYOUT (UIOVEC)
+M: Alexander Viro 
+S: Maintained
+F: lib/iov_iter.c
+F: include/linux/uio.h
+
 USERSPACE I/O (UIO)
 M: Greg Kroah-Hartman 
 S: Maintained
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
 F: Documentation/driver-api/uio-howto.rst
 F: drivers/uio/
-F: include/linux/uio*.h
+F: include/linux/uio_driver.h
 
 UTIL-LINUX PACKAGE
 M: Karel Zak 



[PATCH v7 7/7] mm, hmm: Mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL

2018-10-12 Thread Dan Williams
The routines hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix
and hook page-idle events. The devm_memremap_pages() interface is base
infrastructure for HMM. HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.

Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify
the implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html
Recent work to extend devm_memremap_pages() for the peer-to-peer-DMA
facility enabled this cleanup to move forward.

In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:

mmu_notifier_unregister_no_release
percpu_ref
region_intersects
__class_create

It goes further to consume / indirectly expose functionality that is not
exported to any other driver:

alloc_pages_vma
walk_page_range

HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().

Cc: "Jérôme Glisse" 
Cc: Logan Gunthorpe 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 mm/hmm.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 2e72cb4188ca..90d1383c7e24 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1062,7 +1062,7 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
return result;
return devmem;
 }
-EXPORT_SYMBOL(hmm_devmem_add);
+EXPORT_SYMBOL_GPL(hmm_devmem_add);
 
 struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
   struct device *device,
@@ -1116,7 +1116,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct 
hmm_devmem_ops *ops,
return result;
return devmem;
 }
-EXPORT_SYMBOL(hmm_devmem_add_resource);
+EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
 
 /*
  * A device driver that wants to handle multiple devices memory through a



[PATCH v7 5/7] mm, hmm: Use devm semantics for hmm_devmem_{add, remove}

2018-10-12 Thread Dan Williams
devm semantics arrange for resources to be torn down when
device-driver-probe fails or when device-driver-release completes.
Similar to devm_memremap_pages() there is no need to support an explicit
remove operation when the users properly adhere to devm semantics.

Note that devm_kzalloc() automatically handles allocating node-local
memory.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Cc: "Jérôme Glisse" 
Cc: Logan Gunthorpe 
Signed-off-by: Dan Williams 
---
 include/linux/hmm.h |4 --
 mm/hmm.c|  127 ++-
 2 files changed, 25 insertions(+), 106 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index dde947083d4e..5888ae9f6abf 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -499,8 +499,7 @@ struct hmm_devmem {
  * enough and allocate struct page for it.
  *
  * The device driver can wrap the hmm_devmem struct inside a private device
- * driver struct. The device driver must call hmm_devmem_remove() before the
- * device goes away and before freeing the hmm_devmem struct memory.
+ * driver struct.
  */
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
  struct device *device,
@@ -508,7 +507,6 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
 struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
   struct device *device,
   struct resource *res);
-void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
  * hmm_devmem_page_set_drvdata - set per-page driver data field
diff --git a/mm/hmm.c b/mm/hmm.c
index 774d684fa2b4..60e4b275ad78 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -939,7 +939,6 @@ static void hmm_devmem_ref_exit(void *data)
 
devmem = container_of(ref, struct hmm_devmem, ref);
percpu_ref_exit(ref);
-   devm_remove_action(devmem->device, _devmem_ref_exit, data);
 }
 
 static void hmm_devmem_ref_kill(void *data)
@@ -950,7 +949,6 @@ static void hmm_devmem_ref_kill(void *data)
devmem = container_of(ref, struct hmm_devmem, ref);
percpu_ref_kill(ref);
wait_for_completion(>completion);
-   devm_remove_action(devmem->device, _devmem_ref_kill, data);
 }
 
 static int hmm_devmem_fault(struct vm_area_struct *vma,
@@ -988,7 +986,7 @@ static void hmm_devmem_radix_release(struct resource 
*resource)
mutex_unlock(_devmem_lock);
 }
 
-static void hmm_devmem_release(struct device *dev, void *data)
+static void hmm_devmem_release(void *data)
 {
struct hmm_devmem *devmem = data;
struct resource *resource = devmem->resource;
@@ -996,11 +994,6 @@ static void hmm_devmem_release(struct device *dev, void 
*data)
struct zone *zone;
struct page *page;
 
-   if (percpu_ref_tryget_live(>ref)) {
-   dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
-   percpu_ref_put(>ref);
-   }
-
/* pages are dead and unused, undo the arch mapping */
start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT;
npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT;
@@ -1126,19 +1119,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
return ret;
 }
 
-static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
-{
-   struct hmm_devmem *devmem = data;
-
-   return devmem->resource == match_data;
-}
-
-static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
-{
-   devres_release(devmem->device, _devmem_release,
-  _devmem_match, devmem->resource);
-}
-
 /*
  * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
  *
@@ -1166,8 +1146,7 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
 
dev_pagemap_get_ops();
 
-   devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
-  GFP_KERNEL, dev_to_node(device));
+   devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
if (!devmem)
return ERR_PTR(-ENOMEM);
 
@@ -1181,11 +1160,11 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
ret = percpu_ref_init(>ref, _devmem_ref_release,
  0, GFP_KERNEL);
if (ret)
-   goto error_percpu_ref;
+   return ERR_PTR(ret);
 
-   ret = devm_add_action(device, hmm_devmem_ref_exit, >ref);
+   ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, 
>ref);
if (ret)
-   goto error_devm_add_action;
+   return ERR_PTR(ret);
 
size = ALIGN(size, PA_SECTION_SIZE);
addr = min((unsigned long)iomem_resource.end,
@@ -1205,16 +1184,12 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
 
  

[PATCH v7 6/7] mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages()

2018-10-12 Thread Dan Williams
Commit e8d513483300 "memremap: change devm_memremap_pages interface to
use struct dev_pagemap" refactored devm_memremap_pages() to allow a
dev_pagemap instance to be supplied. Passing in a dev_pagemap interface
simplifies the design of pgmap type drivers in that they can rely on
container_of() to lookup any private data associated with the given
dev_pagemap instance.

In addition to the cleanups this also gives hmm users multi-order-radix
improvements that arrived with commit ab1b597ee0e4 "mm,
devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups"

As part of the conversion to the devm_memremap_pages() method of
handling the percpu_ref relative to when pages are put, the percpu_ref
completion needs to move to hmm_devmem_ref_exit(). See commit
71389703839e ("mm, zone_device: Replace {get, put}_zone_device_page...")
for details.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Acked-by: Balbir Singh 
Cc: Logan Gunthorpe 
Signed-off-by: Dan Williams 
---
 mm/hmm.c |  196 --
 1 file changed, 26 insertions(+), 170 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 60e4b275ad78..2e72cb4188ca 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -938,17 +938,16 @@ static void hmm_devmem_ref_exit(void *data)
struct hmm_devmem *devmem;
 
devmem = container_of(ref, struct hmm_devmem, ref);
+   wait_for_completion(>completion);
percpu_ref_exit(ref);
 }
 
-static void hmm_devmem_ref_kill(void *data)
+static void hmm_devmem_ref_kill(struct percpu_ref *ref)
 {
-   struct percpu_ref *ref = data;
struct hmm_devmem *devmem;
 
devmem = container_of(ref, struct hmm_devmem, ref);
percpu_ref_kill(ref);
-   wait_for_completion(>completion);
 }
 
 static int hmm_devmem_fault(struct vm_area_struct *vma,
@@ -971,154 +970,6 @@ static void hmm_devmem_free(struct page *page, void *data)
devmem->ops->free(devmem, page);
 }
 
-static DEFINE_MUTEX(hmm_devmem_lock);
-static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
-
-static void hmm_devmem_radix_release(struct resource *resource)
-{
-   resource_size_t key;
-
-   mutex_lock(_devmem_lock);
-   for (key = resource->start;
-key <= resource->end;
-key += PA_SECTION_SIZE)
-   radix_tree_delete(_devmem_radix, key >> PA_SECTION_SHIFT);
-   mutex_unlock(_devmem_lock);
-}
-
-static void hmm_devmem_release(void *data)
-{
-   struct hmm_devmem *devmem = data;
-   struct resource *resource = devmem->resource;
-   unsigned long start_pfn, npages;
-   struct zone *zone;
-   struct page *page;
-
-   /* pages are dead and unused, undo the arch mapping */
-   start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT;
-   npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT;
-
-   page = pfn_to_page(start_pfn);
-   zone = page_zone(page);
-
-   mem_hotplug_begin();
-   if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
-   __remove_pages(zone, start_pfn, npages, NULL);
-   else
-   arch_remove_memory(start_pfn << PAGE_SHIFT,
-  npages << PAGE_SHIFT, NULL);
-   mem_hotplug_done();
-
-   hmm_devmem_radix_release(resource);
-}
-
-static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
-{
-   resource_size_t key, align_start, align_size, align_end;
-   struct device *device = devmem->device;
-   int ret, nid, is_ram;
-
-   align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
-   align_size = ALIGN(devmem->resource->start +
-  resource_size(devmem->resource),
-  PA_SECTION_SIZE) - align_start;
-
-   is_ram = region_intersects(align_start, align_size,
-  IORESOURCE_SYSTEM_RAM,
-  IORES_DESC_NONE);
-   if (is_ram == REGION_MIXED) {
-   WARN_ONCE(1, "%s attempted on mixed region %pr\n",
-   __func__, devmem->resource);
-   return -ENXIO;
-   }
-   if (is_ram == REGION_INTERSECTS)
-   return -ENXIO;
-
-   if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
-   devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
-   else
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-
-   devmem->pagemap.res = *devmem->resource;
-   devmem->pagemap.page_fault = hmm_devmem_fault;
-   devmem->pagemap.page_free = hmm_devmem_free;
-   devmem->pagemap.dev = devmem->device;
-   devmem->pagemap.ref = >ref;
-   devmem->pagemap.data = devmem;
-
-   mutex_lock(_devmem_lock);
-   align_end = align_start + align_size - 1;
-  

[PATCH v7 2/7] mm, devm_memremap_pages: Kill mapping "System RAM" support

2018-10-12 Thread Dan Williams
Given the fact that devm_memremap_pages() requires a percpu_ref that is
torn down by devm_memremap_pages_release() the current support for
mapping RAM is broken.

Support for remapping "System RAM" has been broken since the beginning
and there is no existing user of this this code path, so just kill the
support and make it an explicit error.

This cleanup also simplifies a follow-on patch to fix the error path
when setting a devm release action for devm_memremap_pages_release()
fails.

Reviewed-by: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 1bbb2e892941..871d81bf0c69 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -167,15 +167,12 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
is_ram = region_intersects(align_start, align_size,
IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
 
-   if (is_ram == REGION_MIXED) {
-   WARN_ONCE(1, "%s attempted on mixed region %pr\n",
-   __func__, res);
+   if (is_ram != REGION_DISJOINT) {
+   WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
+   is_ram == REGION_MIXED ? "mixed" : "ram", res);
return ERR_PTR(-ENXIO);
}
 
-   if (is_ram == REGION_INTERSECTS)
-   return __va(res->start);
-
if (!pgmap->ref)
return ERR_PTR(-EINVAL);
 



[PATCH v7 3/7] mm, devm_memremap_pages: Fix shutdown handling

2018-10-12 Thread Dan Williams
The last step before devm_memremap_pages() returns success is to
allocate a release action, devm_memremap_pages_release(), to tear the
entire setup down. However, the result from devm_add_action() is not
checked.

Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed
by the time the devm_memremap_pages_release() is run. Rather than
continue this awkward situation, offload the responsibility of killing
the percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing  relative to init failures
and shutdown.

Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or
disable/enable, of an nvdimm namespace will fail forever as subsequent
calls to devm_memremap_pages() will fail to setup the pgmap_radix since
there will be stale entries for the physical address range.

An argument could be made to require that the ->kill() operation be set
in the @pgmap arg rather than passed in separately. However, it helps
code readability, tracking the lifetime of a given instance, to be able
to grep the kill routine directly at the devm_memremap_pages() call
site.

Cc: 
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse" 
Reported-by: Logan Gunthorpe 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/pmem.c|   14 +++---
 drivers/nvdimm/pmem.c |   13 +
 include/linux/memremap.h  |2 ++
 kernel/memremap.c |   30 ++
 tools/testing/nvdimm/test/iomap.c |   15 ++-
 5 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 99e2aace8078..2c1f459c0c63 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -48,9 +48,8 @@ static void dax_pmem_percpu_exit(void *data)
percpu_ref_exit(ref);
 }
 
-static void dax_pmem_percpu_kill(void *data)
+static void dax_pmem_percpu_kill(struct percpu_ref *ref)
 {
-   struct percpu_ref *ref = data;
struct dax_pmem *dax_pmem = to_dax_pmem(ref);
 
dev_dbg(dax_pmem->dev, "trace\n");
@@ -112,17 +111,10 @@ static int dax_pmem_probe(struct device *dev)
}
 
dax_pmem->pgmap.ref = _pmem->ref;
+   dax_pmem->pgmap.kill = dax_pmem_percpu_kill;
addr = devm_memremap_pages(dev, _pmem->pgmap);
-   if (IS_ERR(addr)) {
-   devm_remove_action(dev, dax_pmem_percpu_exit, _pmem->ref);
-   percpu_ref_exit(_pmem->ref);
+   if (IS_ERR(addr))
return PTR_ERR(addr);
-   }
-
-   rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill,
-   _pmem->ref);
-   if (rc)
-   return rc;
 
/* adjust the dax_region resource to the start of data */
memcpy(, _pmem->pgmap.res, sizeof(res));
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index a75d10c23d80..c7548e712a16 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -309,8 +309,11 @@ static void pmem_release_queue(void *q)
blk_cleanup_queue(q);
 }
 
-static void pmem_freeze_queue(void *q)
+static void pmem_freeze_queue(struct percpu_ref *ref)
 {
+   struct request_queue *q;
+
+   q = container_of(ref, typeof(*q), q_usage_counter);
blk_freeze_queue_start(q);
 }
 
@@ -402,6 +405,7 @@ static int pmem_attach_disk(struct device *dev,
 
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = >q_usage_counter;
+   pmem->pgmap.kill = pmem_freeze_queue;
if (is_nd_pfn(dev)) {
if (setup_pagemap_fsdax(dev, >pgmap))
return -ENOMEM;
@@ -425,13 +429,6 @@ static int pmem_attach_disk(struct device *dev,
addr = devm_memremap(dev, pmem->phys_addr,
pmem->size, ARCH_MEMREMAP_PMEM);
 
-   /*
-* At release time the queue must be frozen before
-* devm_memremap_pages is unwound
-*/
-   if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
-   return -ENOMEM;
-
if (IS_ERR(addr))
return PTR_ERR(addr);
pmem->virt_addr = addr;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f91f9e763557..a84572cdc438 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -106,6 +106,7 @@ typedef void (*dev_page_free_t)(struct page *page, void 
*data);
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference

[PATCH v7 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support

2018-10-12 Thread Dan Williams
In preparation for consolidating all ZONE_DEVICE enabling via
devm_memremap_pages(), teach it how to handle the constraints of
MEMORY_DEVICE_PRIVATE ranges.

Reviewed-by: Jérôme Glisse 
[jglisse: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE]
Acked-by: Christoph Hellwig 
Reported-by: Logan Gunthorpe 
Reviewed-by: Logan Gunthorpe 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |   53 +
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index ec7d8ca8db2c..7275a227c4ec 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -98,9 +98,15 @@ static void devm_memremap_pages_release(void *data)
- align_start;
 
mem_hotplug_begin();
-   arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
-   >altmap : NULL);
-   kasan_remove_zero_shadow(__va(align_start), align_size);
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   pfn = align_start >> PAGE_SHIFT;
+   __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
+   align_size >> PAGE_SHIFT, NULL);
+   } else {
+   arch_remove_memory(align_start, align_size,
+   pgmap->altmap_valid ? >altmap : NULL);
+   kasan_remove_zero_shadow(__va(align_start), align_size);
+   }
mem_hotplug_done();
 
untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
goto err_pfn_remap;
 
mem_hotplug_begin();
-   error = kasan_add_zero_shadow(__va(align_start), align_size);
-   if (error) {
-   mem_hotplug_done();
-   goto err_kasan;
+
+   /*
+* For device private memory we call add_pages() as we only need to
+* allocate and initialize struct page for the device memory. More-
+* over the device memory is un-accessible thus we do not want to
+* create a linear mapping for the memory like arch_add_memory()
+* would do.
+*
+* For all other device memory types, which are accessible by
+* the CPU, we do want the linear mapping and thus use
+* arch_add_memory().
+*/
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   error = add_pages(nid, align_start >> PAGE_SHIFT,
+   align_size >> PAGE_SHIFT, NULL, false);
+   } else {
+   error = kasan_add_zero_shadow(__va(align_start), align_size);
+   if (error) {
+   mem_hotplug_done();
+   goto err_kasan;
+   }
+
+   error = arch_add_memory(nid, align_start, align_size, altmap,
+   false);
+   }
+
+   if (!error) {
+   struct zone *zone;
+
+   zone = _DATA(nid)->node_zones[ZONE_DEVICE];
+   move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
+   align_size >> PAGE_SHIFT, altmap);
}
 
-   error = arch_add_memory(nid, align_start, align_size, altmap, false);
-   if (!error)
-   move_pfn_range_to_zone(_DATA(nid)->node_zones[ZONE_DEVICE],
-   align_start >> PAGE_SHIFT,
-   align_size >> PAGE_SHIFT, altmap);
mem_hotplug_done();
if (error)
goto err_add_memory;



[PATCH v7 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only

2018-10-12 Thread Dan Williams
[ apologies for the resend, script error ]

Changes since v6 [1]:
* Rebase on next-20181008 and fixup conflicts with the xarray conversion
  and hotplug optimizations
* It has soaked on a 0day visible branch for a few days without any
  reports.

[1]: https://lkml.org/lkml/2018/9/13/104

---

Hi Andrew,

Jérôme has reviewed the cleanups, thanks Jérôme. We still disagree on
the EXPORT_SYMBOL_GPL status of the core HMM implementation, but Logan,
Christoph and I continue to support marking all devm_memremap_pages()
derivatives EXPORT_SYMBOL_GPL.

HMM has been upstream for over a year, with no in-tree users it is clear
it was designed first and foremost for out of tree drivers. It takes
advantage of a facility Christoph and I spearheaded to support
persistent memory. It continues to see expanding use cases with no clear
end date when it will stop attracting features / revisions. It is not
suitable to export devm_memremap_pages() as a stable 3rd party driver
api.

devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give out-of-tree drivers the ability to
subvert core aspects of page management. It, and anything derived from
it (e.g. hmm, pcip2p, etc...), is a deep integration point into the core
kernel, and an EXPORT_SYMBOL_GPL() interface. 

Commit 31c5bda3a656 "mm: fix exports that inadvertently make put_page()
EXPORT_SYMBOL_GPL" was merged ahead of this series to relieve some of
the pressure from innocent consumers of put_page(), but now we need this
series to address *producers* of device pages.

More details and justification in the changelogs. The 0day
infrastructure has reported success across 152 configs and this survives
the libnvdimm unit test suite. Aside from the controversial bits the
diffstat is compelling at:

7 files changed, 126 insertions(+), 323 deletions(-)

Note that the series has some minor collisions with Alex's recent series
to improve devm_memremap_pages() scalability [2]. So, whichever you take
first the other will need a minor rebase.

[2]: https://www.lkml.org/lkml/2018/9/11/10

Dan Williams (7):
  mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  mm, devm_memremap_pages: Kill mapping "System RAM" support
  mm, devm_memremap_pages: Fix shutdown handling
  mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support
  mm, hmm: Use devm semantics for hmm_devmem_{add,remove}
  mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages()
  mm, hmm: Mark hmm_devmem_{add,add_resource} EXPORT_SYMBOL_GPL


 drivers/dax/pmem.c|   14 --
 drivers/nvdimm/pmem.c |   13 +-
 include/linux/hmm.h   |4 
 include/linux/memremap.h  |2 
 kernel/memremap.c |   94 +++
 mm/hmm.c  |  305 +
 tools/testing/nvdimm/test/iomap.c |   17 ++
 7 files changed, 126 insertions(+), 323 deletions(-)


[PATCH v7 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-10-12 Thread Dan Williams
devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.

Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about
page structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.

Again, devm_memremap_pagex() exposes and relies upon core kernel
internal assumptions and will continue to evolve along with 'struct
page', memory hotplug, and support for new memory types / topologies.
Only an in-kernel GPL-only driver is expected to keep up with this
ongoing evolution. This interface, and functionality derived from this
interface, is not suitable for kernel-external drivers.

Cc: Michal Hocko 
Cc: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |2 +-
 tools/testing/nvdimm/test/iomap.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 6ec81e0d7a33..1bbb2e892941 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -232,7 +232,7 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
  err_array:
return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
diff --git a/tools/testing/nvdimm/test/iomap.c 
b/tools/testing/nvdimm/test/iomap.c
index ff9d3a5825e1..ed18a0cbc0c8 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
return nfit_res->buf + offset - nfit_res->res.start;
return devm_memremap_pages(dev, pgmap);
 }
-EXPORT_SYMBOL(__wrap_devm_memremap_pages);
+EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
 
 pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
 {



[PATCH v7 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only

2018-10-12 Thread Dan Williams
Changes since v6 [1]:
* Rebase on next-20181008 and fixup conflicts with the xarray conversion
  and hotplug optimizations
* It has soaked on a 0day visible branch for a few days without any
  reports.

[1]: https://lkml.org/lkml/2018/9/13/104

---

Hi Andrew,

Jérôme has reviewed the cleanups, thanks Jérôme. We still disagree on
the EXPORT_SYMBOL_GPL status of the core HMM implementation, but Logan,
Christoph and I continue to support marking all devm_memremap_pages()
derivatives EXPORT_SYMBOL_GPL.

HMM has been upstream for over a year, with no in-tree users it is clear
it was designed first and foremost for out of tree drivers. It takes
advantage of a facility Christoph and I spearheaded to support
persistent memory. It continues to see expanding use cases with no clear
end date when it will stop attracting features / revisions. It is not
suitable to export devm_memremap_pages() as a stable 3rd party driver
api.

devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give out-of-tree drivers the ability to
subvert core aspects of page management. It, and anything derived from
it (e.g. hmm, pcip2p, etc...), is a deep integration point into the core
kernel, and an EXPORT_SYMBOL_GPL() interface. 

Commit 31c5bda3a656 "mm: fix exports that inadvertently make put_page()
EXPORT_SYMBOL_GPL" was merged ahead of this series to relieve some of
the pressure from innocent consumers of put_page(), but now we need this
series to address *producers* of device pages.

More details and justification in the changelogs. The 0day
infrastructure has reported success across 152 configs and this survives
the libnvdimm unit test suite. Aside from the controversial bits the
diffstat is compelling at:

7 files changed, 126 insertions(+), 323 deletions(-)

Note that the series has some minor collisions with Alex's recent series
to improve devm_memremap_pages() scalability [2]. So, whichever you take
first the other will need a minor rebase.

[2]: https://www.lkml.org/lkml/2018/9/11/10

Dan Williams (7):
  mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  mm, devm_memremap_pages: Kill mapping "System RAM" support
  mm, devm_memremap_pages: Fix shutdown handling
  mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support
  mm, hmm: Use devm semantics for hmm_devmem_{add,remove}
  mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages()
  mm, hmm: Mark hmm_devmem_{add,add_resource} EXPORT_SYMBOL_GPL


 drivers/dax/pmem.c|   14 --
 drivers/nvdimm/pmem.c |   13 +-
 include/linux/hmm.h   |4 
 include/linux/memremap.h  |2 
 kernel/memremap.c |   94 +++
 mm/hmm.c  |  305 +
 tools/testing/nvdimm/test/iomap.c |   17 ++
 7 files changed, 126 insertions(+), 323 deletions(-)


[PATCH v7 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-10-12 Thread Dan Williams
devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.

Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about
page structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.

Again, devm_memremap_pagex() exposes and relies upon core kernel
internal assumptions and will continue to evolve along with 'struct
page', memory hotplug, and support for new memory types / topologies.
Only an in-kernel GPL-only driver is expected to keep up with this
ongoing evolution. This interface, and functionality derived from this
interface, is not suitable for kernel-external drivers.

Cc: Michal Hocko 
Cc: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |2 +-
 tools/testing/nvdimm/test/iomap.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 6ec81e0d7a33..1bbb2e892941 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -232,7 +232,7 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
  err_array:
return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
diff --git a/tools/testing/nvdimm/test/iomap.c 
b/tools/testing/nvdimm/test/iomap.c
index ff9d3a5825e1..ed18a0cbc0c8 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
return nfit_res->buf + offset - nfit_res->res.start;
return devm_memremap_pages(dev, pgmap);
 }
-EXPORT_SYMBOL(__wrap_devm_memremap_pages);
+EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
 
 pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
 {



Re: [PATCHv2] mm/gup: Cache dev_pagemap while pinning pages

2018-10-12 Thread Dan Williams
On Fri, Oct 12, 2018 at 4:00 AM Kirill A. Shutemov  wrote:
[..]
> > Does this have defined behavior? I would feel better with " = { 0 }"
> > to be explicit.
>
> Well, it's not allowed by the standart, but GCC allows this.
> You can see a warning with -pedantic.
>
> We use empty-list initializers a lot in the kernel:
> $ git grep 'struct .*= {};' | wc -l
> 997
>
> It should be fine.

Ah, ok. I would still say we should be consistent between the init
syntax for 'ctx' in follow_page() and __get_user_pages(), and why not
go with '= { 0 }', one less unnecessary gcc'ism.


Re: [PATCHv2] mm/gup: Cache dev_pagemap while pinning pages

2018-10-11 Thread Dan Williams
On Thu, Oct 11, 2018 at 11:00 AM Keith Busch  wrote:
>
> Getting pages from ZONE_DEVICE memory needs to check the backing device's
> live-ness, which is tracked in the device's dev_pagemap metadata. This
> metadata is stored in a radix tree and looking it up adds measurable
> software overhead.
>
> This patch avoids repeating this relatively costly operation when
> dev_pagemap is used by caching the last dev_pagemap while getting user
> pages. The gup_benchmark kernel self test reports this reduces time to
> get user pages to as low as 1/3 of the previous time.
>
> Cc: Kirill Shutemov 
> Cc: Dave Hansen 
> Cc: Dan Williams 
> Signed-off-by: Keith Busch 

Other than the 2 comments below, this looks good to me:

Reviewed-by: Dan Williams 

[..]
> diff --git a/mm/gup.c b/mm/gup.c
> index 1abc8b4afff6..d2700dff6f66 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
[..]
> @@ -431,7 +430,22 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> return no_page_table(vma, flags);
> }
>
> -   return follow_p4d_mask(vma, address, pgd, flags, page_mask);
> +   return follow_p4d_mask(vma, address, pgd, flags, ctx);
> +}
> +
> +struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> +unsigned int foll_flags)
> +{
> +   struct page *page;
> +   struct follow_page_context ctx = {
> +   .pgmap = NULL,
> +   .page_mask = 0,
> +   };

You don't need to init all members. It is defined that if you init at
least one member then all non initialized members are set to zero, so
you should be able to do " = { 0 }".

> +
> +   page = follow_page_mask(vma, address, foll_flags, );
> +   if (ctx.pgmap)
> +   put_dev_pagemap(ctx.pgmap);
> +   return page;
>  }
>
>  static int get_gate_page(struct mm_struct *mm, unsigned long address,
> @@ -659,9 +673,9 @@ static long __get_user_pages(struct task_struct *tsk, 
> struct mm_struct *mm,
> unsigned int gup_flags, struct page **pages,
> struct vm_area_struct **vmas, int *nonblocking)
>  {
> -   long i = 0;
> -   unsigned int page_mask;
> +   long ret = 0, i = 0;
> struct vm_area_struct *vma = NULL;
> +   struct follow_page_context ctx = {};

Does this have defined behavior? I would feel better with " = { 0 }"
to be explicit.

>
> if (!nr_pages)
> return 0;
> @@ -691,12 +705,14 @@ static long __get_user_pages(struct task_struct *tsk, 
> struct mm_struct *mm,
> pages ? [i] : NULL);
> if (ret)
> return i ? : ret;
> -   page_mask = 0;
> +   ctx.page_mask = 0;
> goto next_page;
> }
>
> -   if (!vma || check_vma_flags(vma, gup_flags))
> -   return i ? : -EFAULT;
> +   if (!vma || check_vma_flags(vma, gup_flags)) {
> +   ret = -EFAULT;
> +   goto out;
> +   }
> if (is_vm_hugetlb_page(vma)) {
> i = follow_hugetlb_page(mm, vma, pages, vmas,
> , _pages, i,
> @@ -709,23 +725,26 @@ static long __get_user_pages(struct task_struct *tsk, 
> struct mm_struct *mm,
>  * If we have a pending SIGKILL, don't keep faulting pages and
>  * potentially allocating memory.
>  */
> -   if (unlikely(fatal_signal_pending(current)))
> -   return i ? i : -ERESTARTSYS;
> +   if (unlikely(fatal_signal_pending(current))) {
> +   ret = -ERESTARTSYS;
> +   goto out;
> +   }
> cond_resched();
> -   page = follow_page_mask(vma, start, foll_flags, _mask);
> +
> +   page = follow_page_mask(vma, start, foll_flags, );
> if (!page) {
> -   int ret;
> ret = faultin_page(tsk, vma, start, _flags,
> nonblocking);
> switch (ret) {
> case 0:
> goto retry;
> +   case -EBUSY:
> +   ret = 0;
> +   /* FALLTHRU */
> case -EFAULT:
> 

Re: [mm PATCH v2 4/6] mm: Do not set reserved flag for hotplug memory

2018-10-11 Thread Dan Williams
On Thu, Oct 11, 2018 at 3:18 PM Alexander Duyck
 wrote:
>
> The general suspicion at this point is that the setting of the reserved bit
> is not really needed for hotplug memory. In addition the setting of this
> bit results in issues for DAX in that it is not possible to assign the
> region to KVM if the reserved bit is set in each page.
>
> For now we can try just not setting the bit since we suspect it isn't
> adding value in setting it. If at a later time we find that it is needed we
> can come back through and re-add it for the hotplug paths.
>
> Suggested-by: Michael Hocko 
> Reported-by: Dan Williams 
> Signed-off-by: Alexander Duyck 
> ---
>  mm/page_alloc.c |   11 ---
>  1 file changed, 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3603d5444865..e435223e2ddb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5571,8 +5571,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>
> page = pfn_to_page(pfn);
> __init_single_page(page, pfn, zone, nid);
> -   if (context == MEMMAP_HOTPLUG)
> -   __SetPageReserved(page);

At a minimum I think we need to do this before removing PageReserved,
to make sure zone_device pages are not tracked in the hibernation
image.

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3d37c279c090..c0613137d726 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1285,6 +1285,9 @@ static struct page *saveable_page(struct zone
*zone, unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;

+   if (is_zone_device_page(page))
+   return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;


Re: [PATCH v2 0/3] Randomize free memory

2018-10-11 Thread Dan Williams
On Thu, Oct 11, 2018 at 4:56 AM Michal Hocko  wrote:
>
> On Wed 10-10-18 17:13:14, Dan Williams wrote:
> [...]
> > On Wed, Oct 10, 2018 at 1:48 AM Michal Hocko  wrote:
> > ...and now that I've made that argument I think I've come around to
> > your point about the shuffle_page_order parameter. The only entity
> > that might have a better clue about "safer" shuffle orders than
> > MAX_ORDER is the distribution provider.
>
> And how is somebody providing a kernel for large variety of workloads
> supposed to know?

True, this would be a much easier discussion with a wider / deeper data set.

>
> [...]
>
> > Note, you can also think about this just on pure architecture terms.
> > I.e. that for a direct mapped cache anywhere in a system you can have
> > a near zero cache conflict rate on a first run of a workload and high
> > conflict rate on a second run based on how lucky you are with memory
> > allocation placement relative to the first run. Randomization keeps
> > you out of such performance troughs and provides more reliable average
> > performance.
>
> I am not disagreeing here. That reliable average might be worse than
> what you get with the non-randomized case. And that might be a fair
> deal for some workloads. You are, however, providing a functionality
> which is enabled by default without any actual numbers (well except for
> _a_java_ workload that seems to benefit) so you should really do your
> homework stop handwaving and give us some numbers and/or convincing
> arguments please.

The latest version of the patches no longer enable it by default. I'm
giving you the data I can give with respect to pre-production
hardware.

> > With the numa emulation patch I referenced an
> > administrator could constrain a workload to run in a cache-sized
> > subset of the available memory if they really know what they are doing
> > and need firmer guarantees.
>
> Then mention how and what you can achieve by that in the changelog.

The numa_emulation aspect is orthogonal to the randomization
implementation. It does not belong in the randomization changelog.

> > The risk if Linux does not have this capability is unstable hacks like
> > zonesort and rebooting, as referenced in that KNL article, which are
> > not suitable for a general purpose kernel / platform.
>
> We could have lived without those for quite some time so this doesn't
> seem to be anything super urgent to push through without a proper
> justification.

We lived without them previously because memory-side-caches were
limited to niche hardware, now this is moving into general purpose
server platforms and the urgency / impact goes up accordingly.

> > > > > Many years back while at a university I was playing
> > > > > with page coloring as a method to reach a more stable performance
> > > > > results due to reduced cache conflicts. It was not always a 
> > > > > performance
> > > > > gain but it definitely allowed for more stable run-to-run comparable
> > > > > results. I can imagine that a randomization might lead to a similar 
> > > > > effect
> > > > > although I am not sure how much and it would be more interesting to 
> > > > > hear
> > > > > about that effect.
> > > >
> > > > Cache coloring is effective up until your workload no longer fits in
> > > > that color.
> > >
> > > Yes, that was my observation back then more or less. But even when you
> > > do not fit into the cache a color aware strategy (I was playing with bin
> > > hoping as well) produced a more deterministic/stable results. But that
> > > is just a side note as it doesn't directly relate to your change.
> > >
> > > > Randomization helps to attenuate the cache conflict rate
> > > > when that happens.
> > >
> > > I can imagine that. Do we have any numbers to actually back that claim
> > > though?
> > >
> >
> > Yes, 2.5X cache conflict rate reduction, in the change log.
>
> Which is a single benchmark result which is not even described in detail
> to be able to reproduce that measurement. I am sorry for nagging
> here but I would expect something less obscure.

No need to apologize.

> How does this behave for
> usual workloads that we test cache sensitive workloads. I myself am not
> a benchmark person but I am pretty sure there are people who can help
> you to find proper ones to run and evaluate.

I wouldn't pick benchmarks that are cpu-cache sensitive since those
are small number of MBs in size, a memory-side cache is on the order
of 10s of GBs.

>

[PATCH v4 2/3] mm: Move buddy list manipulations into helpers

2018-10-10 Thread Dan Williams
In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

Cc: Michal Hocko 
Cc: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/linux/mm.h   |3 --
 include/linux/mm_types.h |3 ++
 include/linux/mmzone.h   |   51 ++
 mm/compaction.c  |4 +--
 mm/page_alloc.c  |   70 ++
 5 files changed, 84 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5891bd4e5d29..856b0530c55d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -473,9 +473,6 @@ static inline void vma_set_anonymous(struct vm_area_struct 
*vma)
 struct mmu_gather;
 struct inode;
 
-#define page_private(page) ((page)->private)
-#define set_page_private(page, v)  ((page)->private = (v))
-
 #if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..72f37ea6dedb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -209,6 +209,9 @@ struct page {
 #define PAGE_FRAG_CACHE_MAX_SIZE   __ALIGN_MASK(32768, ~PAGE_MASK)
 #define PAGE_FRAG_CACHE_MAX_ORDER  get_order(PAGE_FRAG_CACHE_MAX_SIZE)
 
+#define page_private(page) ((page)->private)
+#define set_page_private(page, v)  ((page)->private = (v))
+
 struct page_frag_cache {
void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 15029fedbfe6..0b91ce871895 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -18,6 +18,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 /* Free memory management - zoned buddy allocator.  */
@@ -98,6 +100,55 @@ struct free_area {
unsigned long   nr_free;
 };
 
+/* Used for pages not on another list */
+static inline void add_to_free_area(struct page *page, struct free_area *area,
+int migratetype)
+{
+   list_add(>lru, >free_list[migratetype]);
+   area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_area_tail(struct page *page, struct free_area 
*area,
+ int migratetype)
+{
+   list_add_tail(>lru, >free_list[migratetype]);
+   area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_area(struct page *page, struct free_area *area,
+int migratetype)
+{
+   list_move(>lru, >free_list[migratetype]);
+}
+
+static inline struct page *get_page_from_free_area(struct free_area *area,
+   int migratetype)
+{
+   return list_first_entry_or_null(>free_list[migratetype],
+   struct page, lru);
+}
+
+static inline void rmv_page_order(struct page *page)
+{
+   __ClearPageBuddy(page);
+   set_page_private(page, 0);
+}
+
+static inline void del_page_from_free_area(struct page *page,
+   struct free_area *area, int migratetype)
+{
+   list_del(>lru);
+   rmv_page_order(page);
+   area->nr_free--;
+}
+
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+   return list_empty(>free_list[migratetype]);
+}
+
 struct pglist_data;
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..44adbfa073b3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1359,13 +1359,13 @@ static enum compact_result __compact_finished(struct 
zone *zone,
bool can_steal;
 
/* Job done if page is free of the right migratetype */
-   if (!list_empty(>free_list[migratetype]))
+   if (!free_area_empty(area, migratetype))
return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
if (migratetype == MIGRATE_MOVABLE &&
-   !list_empty(>free_list[MIGRATE_CMA]))
+   !free_area_empty(area, MIGRATE_CMA))
return COMPACT_SUCCESS;
 #endif
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9b295b2287da..e1e0b54423f0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -704,12 +704,6 @@ static inline void set_page_order(struct page *page, 
unsigned int order)
__SetPageBuddy(page);
 }
 
-static inline void rmv_page_order(struct page *page)
-{
-   __ClearPageBuddy(page);
-   set_page_private(page, 0);
-}
-
 /*
  * This function checks whether a page is free && is the buddy
  * we can coalesce a page 

[PATCH v4 1/3] mm: Shuffle initial free memory

2018-10-10 Thread Dan Williams
Some data exfiltration and return-oriented-programming attacks rely on
the ability to infer the location of sensitive data objects. The kernel
page allocator, especially early in system boot, has predictable
first-in-first out behavior for physical pages. Pages are freed in
physical address order when first onlined.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time.

Quoting Kees:
"While we already have a base-address randomization
 (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
 memory layouts would certainly be using the predictability of
 allocation ordering (i.e. for attacks where the base address isn't
 important: only the relative positions between allocated memory).
 This is common in lots of heap-style attacks. They try to gain
 control over ordering by spraying allocations, etc.

 I'd really like to see this because it gives us something similar
 to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

Another motivation for this change is performance in the presence of a
memory-side cache. In the future, memory-side-cache technology will be
available on generally available server platforms. The proposed
randomization approach has been measured to improve the cache conflict
rate by a factor of 2.5X on a well-known Java benchmark. It avoids
performance peaks and valleys to provide more predictable performance.

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated. That ordering can be detected by a memory side-cache.

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.
10, 4MB this trades off randomization granularity for time spent
shuffling.  MAX_ORDER-1 was chosen to be minimally invasive to the page
allocator while still showing memory-side cache behavior improvements,
and the expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

This initial randomization can be undone over time so a follow-on patch
is introduced to inject entropy on page free decisions. It is reasonable
to ask if the page free entropy is sufficient, but it is not enough due
to the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

Cc: Michal Hocko 
Cc: Kees Cook 
Cc: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/linux/list.h   |   17 +
 include/linux/mm.h |   17 +
 include/linux/mmzone.h |4 +
 init/Kconfig   |   32 +
 mm/Makefile|1 
 mm/memblock.c  |9 ++-
 mm/memory_hotplug.c|2 +
 mm/page_alloc.c|2 +
 mm/shuffle.c   |  170 
 9 files changed, 253 insertions(+), 1 deletion(-)
 create mode 100644 mm/shuffle.c

diff --git a/include/linux/list.h b/include/linux/list.h
index de04cc5ed536..43f963328d7c 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
INIT_LIST_HEAD(old);
 }
 
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's 
position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+struct list_head *entry2)
+{
+   struct list_head *pos = entry2->prev;
+
+   list_del(entry2);
+   list_replace(entry1, entry2);
+   if (pos == entry1)
+   pos = entry2;
+   list_add(entry1, pos);
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 273d4dbd3883..5891bd4e5d29 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2043,6 +2043,23 @@ extern void mem_init_print_info(const char *str);
 
 extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
 
+#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
+extern void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+   unsigned long end_pfn);
+extern void shuffle_zone(struct zone *z, unsigned l

[PATCH v4 3/3] mm: Maintain randomization of page free lists

2018-10-10 Thread Dan Williams
When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

Cc: Michal Hocko 
Cc: Kees Cook 
Cc: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/linux/mm.h |   10 ++
 include/linux/mmzone.h |   10 ++
 mm/page_alloc.c|   11 +--
 mm/shuffle.c   |   16 
 4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 856b0530c55d..91a1e7fb465a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2045,6 +2045,11 @@ extern void shuffle_free_memory(pg_data_t *pgdat, 
unsigned long start_pfn,
unsigned long end_pfn);
 extern void shuffle_zone(struct zone *z, unsigned long start_pfn,
unsigned long end_pfn);
+
+static inline bool is_shuffle_order(int order)
+{
+   return order >= CONFIG_SHUFFLE_PAGE_ORDER;
+}
 #else
 static inline void shuffle_free_memory(pg_data_t *pgdat, unsigned long 
start_pfn,
unsigned long end_pfn)
@@ -2055,6 +2060,11 @@ static inline void shuffle_zone(struct zone *z, unsigned 
long start_pfn,
unsigned long end_pfn)
 {
 }
+
+static inline bool is_shuffle_order(int order)
+{
+   return false;
+}
 #endif
 
 /* Free the reserved page into the buddy system, so it gets managed. */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b91ce871895..c7abf21ed9f4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -98,6 +98,8 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
struct list_headfree_list[MIGRATE_TYPES];
unsigned long   nr_free;
+   u64 rand;
+   u8  rand_bits;
 };
 
 /* Used for pages not on another list */
@@ -116,6 +118,14 @@ static inline void add_to_free_area_tail(struct page 
*page, struct free_area *ar
area->nr_free++;
 }
 
+#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
+/* Used to preserve page allocation order entropy */
+void add_to_free_area_random(struct page *page, struct free_area *area,
+   int migratetype);
+#else
+#define add_to_free_area_random add_to_free_area
+#endif
+
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 int migratetype)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e1e0b54423f0..eef241ceb2c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -850,7 +851,8 @@ static inline void __free_one_page(struct page *page,
 * so it's less likely to be used soon and more likely to be merged
 * as a higher order page
 */
-   if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
+   if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
+   && !is_shuffle_order(order)) {
struct page *higher_page, *higher_buddy;
combined_pfn = buddy_pfn & pfn;
higher_page = page + (combined_pfn - pfn);
@@ -864,7 +866,12 @@ static inline void __free_one_page(struct page *page,
}
}
 
-   add_to_free_area(page, >free_area[order], migratetype);
+   if (is_shuffle_order(order))
+   add_to_free_area_random(page, >free_area[order],
+   migratetype);
+   else
+   add_to_free_area(page, >free_area[order], migratetype);
+
 }
 
 /*
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 5ed91b5b8441..3937d0bc3670 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -168,3 +168,19 @@ void __meminit shuffle_free_memory(pg_data_t *pgdat, 
unsigned long start_pfn,
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
shuffle_zone(z, start_pfn, end_pfn);
 }
+
+void add_to_free_area_random(struct page *page, struct free_area *area,
+   int migratetype)
+{
+   if (area->rand_bits == 0) {
+   area->rand_bits = 64;
+   area->rand = get_random_u64();
+   }
+
+   if (area->rand & 1)
+   add_to_free_area(page, area, migratetype);
+   else
+   add_to_free_area_tail(page, area, migratetype);
+   area->rand_bits--;
+   area->rand >>= 1;
+}



[PATCH v4 0/3] Randomize free memory

2018-10-10 Thread Dan Williams
Changes since v3 [1]:
* Replace runtime 'shuffle_page_order' parameter with a compile-time
  CONFIG_SHUFFLE_PAGE_ALLOCATOR on/off switch and a
  CONFIG_SHUFFLE_PAGE_ORDER if a distro decides that the default 4MB
  shuffling boundary is not sufficient. Administrators will not be
  burdened with making this decision. (Michal)

* Move shuffle related code into a new mm/shuffle.c file.

[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1783262.html

---

Some data exfiltration and return-oriented-programming attacks rely on
the ability to infer the location of sensitive data objects. The kernel
page allocator, especially early in system boot, has predictable
first-in-first out behavior for physical pages. Pages are freed in
physical address order when first onlined.

Quoting Kees:
"While we already have a base-address randomization
 (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
 memory layouts would certainly be using the predictability of
 allocation ordering (i.e. for attacks where the base address isn't
 important: only the relative positions between allocated memory).
 This is common in lots of heap-style attacks. They try to gain
 control over ordering by spraying allocations, etc.

 I'd really like to see this because it gives us something similar
 to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

Another motivation for this change is performance in the presence of a
memory-side cache. In the future, memory-side-cache technology will be
available on generally available server platforms. The proposed
randomization approach has been measured to improve the cache conflict
rate by a factor of 2.5X on a well-known Java benchmark. It avoids
performance peaks and valleys to provide more predictable performance.

The initial randomization in patch1 can be undone over time so patch3 is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

More details in the patch1 commit message.

---

Dan Williams (3):
  mm: Shuffle initial free memory
  mm: Move buddy list manipulations into helpers
  mm: Maintain randomization of page free lists


 include/linux/list.h |   17 
 include/linux/mm.h   |   30 +++
 include/linux/mm_types.h |3 +
 include/linux/mmzone.h   |   65 
 init/Kconfig |   32 
 mm/Makefile  |1 
 mm/compaction.c  |4 -
 mm/memblock.c|9 ++
 mm/memory_hotplug.c  |2 
 mm/page_alloc.c  |   81 +---
 mm/shuffle.c |  186 ++
 11 files changed, 381 insertions(+), 49 deletions(-)
 create mode 100644 mm/shuffle.c


Re: [PATCH v2 0/3] Randomize free memory

2018-10-10 Thread Dan Williams
On Wed, Oct 10, 2018 at 1:48 AM Michal Hocko  wrote:
>
> On Tue 09-10-18 10:34:55, Dan Williams wrote:
> > On Tue, Oct 9, 2018 at 4:28 AM Michal Hocko  wrote:
> > >
> > > On Thu 04-10-18 09:44:35, Dan Williams wrote:
> > > > Hi Michal,
> > > >
> > > > On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > > > > Changes since v1:
> > > > > > * Add support for shuffling hot-added memory (Andrew)
> > > > > > * Update cover letter and commit message to clarify the performance 
> > > > > > impact
> > > > > >   and relevance to future platforms
> > > > >
> > > > > I believe this hasn't addressed my questions in
> > > > > http://lkml.kernel.org/r/20181002143015.gx18...@dhcp22.suse.cz. Namely
> > > > > "
> > > > > It is the more general idea that I am not really sure about. First of
> > > > > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > > > > cannot we simply have it disabled?
> > > >
> > > > I'm not aware of any CVE that this would directly preclude, but that
> > > > said the entropy injected at 4MB boundaries raises the bar on heap
> > > > attacks. Environments that want more can adjust that with the boot
> > > > parameter. Given the potential benefits I think it would only make
> > > > sense to default disable it if there was a significant runtime impact,
> > > > from what I have seen there isn't.
> > > >
> > > > > Then and more concerning question is,
> > > > > does it even make sense to have this randomization applied to higher
> > > > > orders than 0? Attacker might fragment the memory and keep recycling 
> > > > > the
> > > > > lowest order and get the predictable behavior that we have right now.
> > > >
> > > > Certainly I expect there are attacks that can operate within a 4MB
> > > > window, as I expect there are attacks that could operate within a 4K
> > > > window that would need sub-page randomization to deter. In fact I
> > > > believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> > > > Combining that with page allocator randomization makes the kernel less
> > > > predictable.
> > >
> > > I am sorry but this hasn't explained anything (at least to me). I can
> > > still see a way to bypass this randomization by fragmenting the memory.
> > > With that possibility in place this doesn't really provide the promissed
> > > additional security. So either I am missing something or the per-order
> > > threshold is simply a wrong interface to a broken security misfeature.
> >
> > I think a similar argument can be made against
> > CONFIG_SLAB_FREELIST_RANDOM the randomization benefits can be defeated
> > with more effort, and more effort is the entire point.
>
> If there is relatively simple way to achieve that (which I dunno about
> the slab free list randomization because I am not familiar with the
> implementation) then the feature is indeed questionable. I would
> understand an argument about feasibility if bypassing was extremely hard
> but fragmenting the memory is relatively a simple task.
>
> > > > Is that enough justification for this patch on its own?
> > >
> > > I do not think so from what I have heard so far.
> >
> > I'm missing what bar you are judging the criteria for these patches,
> > my bar is increased protection against allocation ordering attacks as
> > seconded by Kees, and the memory side caching effects.
>
> As said above, if it is quite easy to bypass the randomization then
> calling and advertizing this as a security feature is a dubious. Not
> enough to ouright nak it of course but also not something I would put my
> stamp on. And arguments would be much more solid if they were backed by
> some numbers (not only for the security aspect but also the side caching
> effects).

In fact you don't even need to fragment since you'll have 4MB
contiguous targets by default, but that's not the point. We'll now
have more entropy in the allocation order to compliment the entropy
introduced at the per-SLAB level with CONFIG_SLAB_FREELIST_RANDOM.

...and now that I've made that argument I think I've come around to
your point about the shuffle_page_order parameter. The only entity
that might have a better clue about "safer" shuffle orders than
MAX_ORDER is the di

[PATCH v3 2/3] mm: Move buddy list manipulations into helpers

2018-10-10 Thread Dan Williams
In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

Cc: Michal Hocko 
Cc: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/linux/mm.h   |3 --
 include/linux/mm_types.h |3 ++
 include/linux/mmzone.h   |   51 ++
 mm/compaction.c  |4 +--
 mm/page_alloc.c  |   70 ++
 5 files changed, 84 insertions(+), 47 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1692c1a36883..02f8f3eff9c4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -473,9 +473,6 @@ static inline void vma_set_anonymous(struct vm_area_struct 
*vma)
 struct mmu_gather;
 struct inode;
 
-#define page_private(page) ((page)->private)
-#define set_page_private(page, v)  ((page)->private = (v))
-
 #if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..72f37ea6dedb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -209,6 +209,9 @@ struct page {
 #define PAGE_FRAG_CACHE_MAX_SIZE   __ALIGN_MASK(32768, ~PAGE_MASK)
 #define PAGE_FRAG_CACHE_MAX_ORDER  get_order(PAGE_FRAG_CACHE_MAX_SIZE)
 
+#define page_private(page) ((page)->private)
+#define set_page_private(page, v)  ((page)->private = (v))
+
 struct page_frag_cache {
void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 15029fedbfe6..0b91ce871895 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -18,6 +18,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 /* Free memory management - zoned buddy allocator.  */
@@ -98,6 +100,55 @@ struct free_area {
unsigned long   nr_free;
 };
 
+/* Used for pages not on another list */
+static inline void add_to_free_area(struct page *page, struct free_area *area,
+int migratetype)
+{
+   list_add(>lru, >free_list[migratetype]);
+   area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_area_tail(struct page *page, struct free_area 
*area,
+ int migratetype)
+{
+   list_add_tail(>lru, >free_list[migratetype]);
+   area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_area(struct page *page, struct free_area *area,
+int migratetype)
+{
+   list_move(>lru, >free_list[migratetype]);
+}
+
+static inline struct page *get_page_from_free_area(struct free_area *area,
+   int migratetype)
+{
+   return list_first_entry_or_null(>free_list[migratetype],
+   struct page, lru);
+}
+
+static inline void rmv_page_order(struct page *page)
+{
+   __ClearPageBuddy(page);
+   set_page_private(page, 0);
+}
+
+static inline void del_page_from_free_area(struct page *page,
+   struct free_area *area, int migratetype)
+{
+   list_del(>lru);
+   rmv_page_order(page);
+   area->nr_free--;
+}
+
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+   return list_empty(>free_list[migratetype]);
+}
+
 struct pglist_data;
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..44adbfa073b3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1359,13 +1359,13 @@ static enum compact_result __compact_finished(struct 
zone *zone,
bool can_steal;
 
/* Job done if page is free of the right migratetype */
-   if (!list_empty(>free_list[migratetype]))
+   if (!free_area_empty(area, migratetype))
return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
if (migratetype == MIGRATE_MOVABLE &&
-   !list_empty(>free_list[MIGRATE_CMA]))
+   !free_area_empty(area, MIGRATE_CMA))
return COMPACT_SUCCESS;
 #endif
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 14a9a8273ab9..c3ee504f1a91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -712,12 +712,6 @@ static inline void set_page_order(struct page *page, 
unsigned int order)
__SetPageBuddy(page);
 }
 
-static inline void rmv_page_order(struct page *page)
-{
-   __ClearPageBuddy(page);
-   set_page_private(page, 0);
-}
-
 /*
  * This function checks whether a page is free && is the buddy
  * we can coalesce a page 

[PATCH v3 0/3] Randomize free memory

2018-10-10 Thread Dan Williams
Changes since v2
* Clarify in the changelog why front-back randomization is not
  sufficient (Michal)
* Rebase on mmotm which results in dropping bootmem randomization and
  adding memblock randomization.

---

Some data exfiltration and return-oriented-programming attacks rely on
the ability to infer the location of sensitive data objects. The kernel
page allocator, especially early in system boot, has predictable
first-in-first out behavior for physical pages. Pages are freed in
physical address order when first onlined.

Quoting Kees:
"While we already have a base-address randomization
 (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
 memory layouts would certainly be using the predictability of
 allocation ordering (i.e. for attacks where the base address isn't
 important: only the relative positions between allocated memory).
 This is common in lots of heap-style attacks. They try to gain
 control over ordering by spraying allocations, etc.

 I'd really like to see this because it gives us something similar
 to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

Another motivation for this change is performance in the presence of a
memory-side cache. In the future, memory-side-cache technology will be
available on generally available server platforms. The proposed
randomization approach has been measured to improve the cache conflict
rate by a factor of 2.5X on a well-known Java benchmark. It avoids
performance peaks and valleys to provide more predictable performance.

The initial randomization in patch1 can be undone over time so patch3 is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficientm but not enough is due to the
in-order initial freeing of pages. At the start of that process putting
page1 in front or behind page0 still keeps them close together, page2 is
still near page1 and has a high chance of being adjacent. As more pages
are added ordering diversity improves, but there is still high page
locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

More details in the patch1 commit message.

Dan Williams (3):
  mm: Shuffle initial free memory
  mm: Move buddy list manipulations into helpers
  mm: Maintain randomization of page free lists


 include/linux/list.h |   17 +++
 include/linux/mm.h   |8 +
 include/linux/mm_types.h |3 +
 include/linux/mmzone.h   |   57 ++
 mm/compaction.c  |4 -
 mm/memblock.c|9 +-
 mm/memory_hotplug.c  |2 
 mm/page_alloc.c  |  267 +++---
 8 files changed, 317 insertions(+), 50 deletions(-)


[PATCH v3 1/3] mm: Shuffle initial free memory

2018-10-10 Thread Dan Williams
Some data exfiltration and return-oriented-programming attacks rely on
the ability to infer the location of sensitive data objects. The kernel
page allocator, especially early in system boot, has predictable
first-in-first out behavior for physical pages. Pages are freed in
physical address order when first onlined.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to
perform a Fisher-Yates shuffle of the page allocator 'free_area' lists
when they are initially populated with free memory at boot and at
hotplug time.

Quoting Kees:
"While we already have a base-address randomization
 (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
 memory layouts would certainly be using the predictability of
 allocation ordering (i.e. for attacks where the base address isn't
 important: only the relative positions between allocated memory).
 This is common in lots of heap-style attacks. They try to gain
 control over ordering by spraying allocations, etc.

 I'd really like to see this because it gives us something similar
 to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

Another motivation for this change is performance in the presence of a
memory-side cache. In the future, memory-side-cache technology will be
available on generally available server platforms. The proposed
randomization approach has been measured to improve the cache conflict
rate by a factor of 2.5X on a well-known Java benchmark. It avoids
performance peaks and valleys to provide more predictable performance.

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order
allocated. That ordering can be detected by a memory side-cache.

The shuffling is done in terms of 'shuffle_page_order' sized free pages
where the default shuffle_page_order is MAX_ORDER-1 i.e. 10, 4MB this
trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements.

The performance impact of the shuffling appears to be in the noise
compared to other memory initialization work. Also the bulk of the work
is done in the background as a part of deferred_init_memmap().

This initial randomization can be undone over time so a follow-on patch
is introduced to inject entropy on page free decisions. It is reasonable
to ask if the page free entropy is sufficientm but not enough is due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still
high page locality for the low address pages and this leads to no
significant impact to the cache conflict rate.

Cc: Michal Hocko 
Cc: Kees Cook 
Cc: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/linux/list.h   |   17 +
 include/linux/mm.h |5 +
 include/linux/mmzone.h |4 +
 mm/memblock.c  |9 ++-
 mm/memory_hotplug.c|2 +
 mm/page_alloc.c|  172 
 6 files changed, 207 insertions(+), 2 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index de04cc5ed536..43f963328d7c 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
INIT_LIST_HEAD(old);
 }
 
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's 
position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+struct list_head *entry2)
+{
+   struct list_head *pos = entry2->prev;
+
+   list_del(entry2);
+   list_replace(entry1, entry2);
+   if (pos == entry1)
+   pos = entry2;
+   list_add(entry1, pos);
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 273d4dbd3883..1692c1a36883 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2042,7 +2042,10 @@ extern void adjust_managed_page_count(struct page *page, 
long count);
 extern void mem_init_print_info(const char *str);
 
 extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
-
+extern void shuffle_free_memory(pg_data_t *pgdat, unsigned long start_pfn,
+   unsigned long end_pfn);
+extern void shuffle_zone(struct zone *z, unsigned long start_pfn,
+   unsigned long end_pfn);
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void __free_reserved_page(struct page *page)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index

[PATCH v3 3/3] mm: Maintain randomization of page free lists

2018-10-10 Thread Dan Williams
When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

Cc: Michal Hocko 
Cc: Kees Cook 
Cc: Dave Hansen 
Signed-off-by: Dan Williams 
---
 include/linux/mmzone.h |2 ++
 mm/page_alloc.c|   27 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b91ce871895..4fae2f7042aa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -98,6 +98,8 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
struct list_headfree_list[MIGRATE_TYPES];
unsigned long   nr_free;
+   u64 rand;
+   u8  rand_bits;
 };
 
 /* Used for pages not on another list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3ee504f1a91..d4b375cf5b22 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -753,6 +754,22 @@ static inline int page_is_buddy(struct page *page, struct 
page *buddy,
return 0;
 }
 
+static void add_to_free_area_random(struct page *page, struct free_area *area,
+   int migratetype)
+{
+   if (area->rand_bits == 0) {
+   area->rand_bits = 64;
+   area->rand = get_random_u64();
+   }
+
+   if (area->rand & 1)
+   add_to_free_area(page, area, migratetype);
+   else
+   add_to_free_area_tail(page, area, migratetype);
+   area->rand_bits--;
+   area->rand >>= 1;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -858,7 +875,8 @@ static inline void __free_one_page(struct page *page,
 * so it's less likely to be used soon and more likely to be merged
 * as a higher order page
 */
-   if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
+   if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
+   && order < shuffle_page_order) {
struct page *higher_page, *higher_buddy;
combined_pfn = buddy_pfn & pfn;
higher_page = page + (combined_pfn - pfn);
@@ -872,7 +890,12 @@ static inline void __free_one_page(struct page *page,
}
}
 
-   add_to_free_area(page, >free_area[order], migratetype);
+   if (order < shuffle_page_order)
+   add_to_free_area(page, >free_area[order], migratetype);
+   else
+   add_to_free_area_random(page, >free_area[order],
+   migratetype);
+
 }
 
 /*



Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-09 Thread Dan Williams
On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
 wrote:
>
> On 10/9/2018 11:04 AM, Dan Williams wrote:
> > On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  wrote:
[..]
> > That comment is incorrect, device-pages are never onlined. So I think
> > we can just skip that call to __SetPageReserved() unless the memory
> > range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.
> >
>
> When pages are "onlined" via __free_pages_boot_core they clear the
> reserved bit, that is the reason for the comment. The reserved bit is
> meant to indicate that the page cannot be swapped out or moved based on
> the description of the bit.

...but ZONE_DEVICE pages are never onlined so I would expect
memmap_init_zone_device() to know that detail.

> I would think with that being the case we still probably need the call
> to __SetPageReserved to set the bit with the expectation that it will
> not be cleared for device-pages since the pages are not onlined.
> Removing the call to __SetPageReserved would probably introduce a number
> of regressions as there are multiple spots that use the reserved bit to
> determine if a page can be swapped out to disk, mapped as system memory,
> or migrated.

Right, this is what Yi is working on... the PageReserved flag is
problematic for KVM. Auditing those locations it seems as long as we
teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
the reserved flag for DAX pages. What I'm trying to avoid is a local
KVM hack to check for DAX pages when the Reserved flag is not
otherwise needed.


Re: [PATCH v2 1/3] mm: Shuffle initial free memory

2018-10-09 Thread Dan Williams
On Tue, Oct 9, 2018 at 4:16 AM Michal Hocko  wrote:
>
> On Thu 04-10-18 09:51:37, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 12:48 AM Michal Hocko  wrote:
[..]
> > So the reason front-back randomization is not enough is due to the
> > in-order initial freeing of pages. At the start of that process
> > putting page1 in front or behind page0 still keeps them close
> > together, page2 is still near page1 and has a high chance of being
> > adjacent. As more pages are added ordering diversity improves, but
> > there is still high page locality for the low address pages and this
> > leads to no significant impact to the cache conflict rate. Patch3 is
> > enough to keep the entropy sustained over time, but it's not enough
> > initially.
>
> That should be in the changelog IMHO.

Fair enough, I'll fold that in when I rebase on top of -next.


Re: [PATCH v2 0/3] Randomize free memory

2018-10-09 Thread Dan Williams
On Tue, Oct 9, 2018 at 4:28 AM Michal Hocko  wrote:
>
> On Thu 04-10-18 09:44:35, Dan Williams wrote:
> > Hi Michal,
> >
> > On Thu, Oct 4, 2018 at 12:53 AM Michal Hocko  wrote:
> > >
> > > On Wed 03-10-18 19:15:18, Dan Williams wrote:
> > > > Changes since v1:
> > > > * Add support for shuffling hot-added memory (Andrew)
> > > > * Update cover letter and commit message to clarify the performance 
> > > > impact
> > > >   and relevance to future platforms
> > >
> > > I believe this hasn't addressed my questions in
> > > http://lkml.kernel.org/r/20181002143015.gx18...@dhcp22.suse.cz. Namely
> > > "
> > > It is the more general idea that I am not really sure about. First of
> > > all. Does it make _any_ sense to randomize 4MB blocks by default? Why
> > > cannot we simply have it disabled?
> >
> > I'm not aware of any CVE that this would directly preclude, but that
> > said the entropy injected at 4MB boundaries raises the bar on heap
> > attacks. Environments that want more can adjust that with the boot
> > parameter. Given the potential benefits I think it would only make
> > sense to default disable it if there was a significant runtime impact,
> > from what I have seen there isn't.
> >
> > > Then and more concerning question is,
> > > does it even make sense to have this randomization applied to higher
> > > orders than 0? Attacker might fragment the memory and keep recycling the
> > > lowest order and get the predictable behavior that we have right now.
> >
> > Certainly I expect there are attacks that can operate within a 4MB
> > window, as I expect there are attacks that could operate within a 4K
> > window that would need sub-page randomization to deter. In fact I
> > believe that is the motivation for CONFIG_SLAB_FREELIST_RANDOM.
> > Combining that with page allocator randomization makes the kernel less
> > predictable.
>
> I am sorry but this hasn't explained anything (at least to me). I can
> still see a way to bypass this randomization by fragmenting the memory.
> With that possibility in place this doesn't really provide the promissed
> additional security. So either I am missing something or the per-order
> threshold is simply a wrong interface to a broken security misfeature.

I think a similar argument can be made against
CONFIG_SLAB_FREELIST_RANDOM the randomization benefits can be defeated
with more effort, and more effort is the entire point.

> > Is that enough justification for this patch on its own?
>
> I do not think so from what I have heard so far.

I'm missing what bar you are judging the criteria for these patches,
my bar is increased protection against allocation ordering attacks as
seconded by Kees, and the memory side caching effects. That said I
don't have a known CVE in my mind that would be mitigated by 4MB page
shuffling.

> > It's
> > debatable. Combine that though with the wider availability of
> > platforms with memory-side-cache and I think it's a reasonable default
> > behavior for the kernel to deploy.
>
> OK, this sounds a bit more interesting. I am going to speculate because
> memory-side-cache is way too generic of a term for me to imagine
> anything specific.

No need to imagine, a memory side cache shipped on a previous product
as Robert linked in his comments.

> Many years back while at a university I was playing
> with page coloring as a method to reach a more stable performance
> results due to reduced cache conflicts. It was not always a performance
> gain but it definitely allowed for more stable run-to-run comparable
> results. I can imagine that a randomization might lead to a similar effect
> although I am not sure how much and it would be more interesting to hear
> about that effect.

Cache coloring is effective up until your workload no longer fits in
that color. Randomization helps to attenuate the cache conflict rate
when that happens. For workloads that may fit in the cache, and/or
environments that need more explicit cache control we have the recent
changes to numa_emulation [1] to arrange for cache sized numa nodes.

> If this is really the case then I would assume on/off
> knob to control the randomization without something as specific as
> order.

Are we only debating the enabling knob at this point? I'm not opposed
to changing that, but I do think we want to keep the rest of the
infrastructure to allow for shuffling on a variable page size boundary
in case there is enhanced security benefits at smaller buddy-page
sizes.

[1]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc9aec03e58f


  1   2   3   4   5   6   7   8   9   10   >