Re: [PATCH v2 5/9] mm: Initialize struct vm_unmapped_area_info

2024-02-28 Thread Kirill A. Shutemov
On Mon, Feb 26, 2024 at 11:09:47AM -0800, Rick Edgecombe wrote:
> diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
> index 5db88b627439..dd6801bb9240 100644
> --- a/arch/alpha/kernel/osf_sys.c
> +++ b/arch/alpha/kernel/osf_sys.c
> @@ -1218,7 +1218,7 @@ static unsigned long
>  arch_get_unmapped_area_1(unsigned long addr, unsigned long len,
>unsigned long limit)
>  {
> - struct vm_unmapped_area_info info;
> + struct vm_unmapped_area_info info = {};
>  
>   info.flags = 0;
>   info.length = len;

Can we make a step forward and actually move initialization inside the
initializator? Something like below.

I understand that it is substantially more work, but I think it is useful.

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 5db88b627439..c40ddede3b13 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -1218,14 +1218,12 @@ static unsigned long
 arch_get_unmapped_area_1(unsigned long addr, unsigned long len,
 unsigned long limit)
 {
-   struct vm_unmapped_area_info info;
+   struct vm_unmapped_area_info info = {
+   .length = len;
+   .low_limit = addr,
+   .high_limit = limit,
+   };

-   info.flags = 0;
-   info.length = len;
-   info.low_limit = addr;
-   info.high_limit = limit;
-   info.align_mask = 0;
-   info.align_offset = 0;
return vm_unmapped_area();
 }

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-07-26 Thread Kirill A . Shutemov
On Tue, Jul 25, 2023 at 01:51:55PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
> > On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index dbc9f86b1934..a3d2b132df52 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control 
> > > *cc, unsigned long low_pfn,
> > >   if (!mapping && (folio_ref_count(folio) - 1) > 
> > > folio_mapcount(folio))
> > >   goto isolate_fail_put;
> > >  
> > > + /* The mapping truly isn't movable. */
> > > + if (mapping && mapping_unmovable(mapping))
> > > + goto isolate_fail_put;
> > > +
> > 
> > I doubt that it is safe to dereference mapping here. I believe the folio
> > can be truncated from under us and the mapping freed with the inode.
> > 
> > The folio has to be locked to dereference mapping safely (given that the
> > mapping is still tied to the folio).
> 
> There's even a comment to that effect later on in the function:
> 
> /*
>  * Only pages without mappings or that have a
>  * ->migrate_folio callback are possible to migrate
>  * without blocking. However, we can be racing with
>  * truncation so it's necessary to lock the page
>  * to stabilise the mapping as truncation holds
>  * the page lock until after the page is removed
>  * from the page cache.
>  */
> 
> (that could be reworded to make it clear how dangerous dereferencing
> ->mapping is without the lock ... and it does need to be changed to say
> "folio lock" instead of "page lock", so ...)
> 
> How does this look?
> 
> /*
>  * Only folios without mappings or that have
>  * a ->migrate_folio callback are possible to
>  * migrate without blocking. However, we can
>  * be racing with truncation, which can free
>      * the mapping.  Truncation holds the folio lock
>  * until after the folio is removed from the page
>  * cache so holding it ourselves is sufficient.
>  */
> 

Looks good to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-07-25 Thread Kirill A . Shutemov
On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index dbc9f86b1934..a3d2b132df52 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control *cc, 
> unsigned long low_pfn,
>   if (!mapping && (folio_ref_count(folio) - 1) > 
> folio_mapcount(folio))
>   goto isolate_fail_put;
>  
> + /* The mapping truly isn't movable. */
> + if (mapping && mapping_unmovable(mapping))
> + goto isolate_fail_put;
> +

I doubt that it is safe to dereference mapping here. I believe the folio
can be truncated from under us and the mapping freed with the inode.

The folio has to be locked to dereference mapping safely (given that the
mapping is still tied to the folio).

Vlastimil, any comments?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCH 00/14] arch,mm: cleanup Kconfig entries for ARCH_FORCE_MAX_ORDER

2023-03-23 Thread Kirill A. Shutemov
On Thu, Mar 23, 2023 at 11:21:42AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> Several architectures have ARCH_FORCE_MAX_ORDER in their Kconfig and
> they all have wrong and misleading prompt and help text for this option.
> 
> Besides, some define insane limits for possible values of
> ARCH_FORCE_MAX_ORDER, some carefully define ranges only for a subset of
> possible configurations, some make this option configurable by users for no
> good reason.
> 
> This set updates the prompt and help text everywhere and does its best to
> update actual definitions of ranges where applicable.
> 
> Mike Rapoport (IBM) (14):
>   arm: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER
>   arm64: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   csky: drop ARCH_FORCE_MAX_ORDER
>   ia64: don't allow users to override ARCH_FORCE_MAX_ORDER
>   m68k: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   nios2: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   nios2: drop ranges for definition of ARCH_FORCE_MAX_ORDER
>   powerpc: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   powerpc: drop ranges for definition of ARCH_FORCE_MAX_ORDER
>   sh: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   sh: drop ranges for definition of ARCH_FORCE_MAX_ORDER
>   sparc: reword ARCH_FORCE_MAX_ORDER prompt and help text
>   xtensa: reword ARCH_FORCE_MAX_ORDER prompt and help text
> 
>  arch/arm/Kconfig  | 16 +---
>  arch/arm64/Kconfig| 27 ---
>  arch/csky/Kconfig |  4 
>  arch/ia64/Kconfig |  3 +--
>  arch/m68k/Kconfig.cpu | 16 +---
>  arch/nios2/Kconfig| 17 +
>  arch/powerpc/Kconfig  | 22 +-
>  arch/sh/mm/Kconfig| 19 +--
>  arch/sparc/Kconfig| 16 +++++---
>  arch/xtensa/Kconfig   | 16 +---
>  10 files changed, 76 insertions(+), 80 deletions(-)

Acked-by: Kirill A. Shutemov 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-24 Thread Kirill A. Shutemov
On Thu, Sep 23, 2021 at 08:21:03PM +0200, Borislav Petkov wrote:
> On Thu, Sep 23, 2021 at 12:05:58AM +0300, Kirill A. Shutemov wrote:
> > Unless we find other way to guarantee RIP-relative access, we must use
> > fixup_pointer() to access any global variables.
> 
> Yah, I've asked compiler folks about any guarantees we have wrt
> rip-relative addresses but it doesn't look good. Worst case, we'd have
> to do the fixup_pointer() thing.
> 
> In the meantime, Tom and I did some more poking at this and here's a
> diff ontop.
> 
> The direction being that we'll stick both the AMD and Intel
> *cc_platform_has() call into cc_platform.c for which instrumentation
> will be disabled so no issues with that.
> 
> And that will keep all that querying all together in a single file.

And still do cc_platform_has() calls in __startup_64() codepath?

It's broken.

Intel detection in cc_platform_has() relies on boot_cpu_data.x86_vendor
which is not initialized until early_cpu_init() in setup_arch(). Given
that X86_VENDOR_INTEL is 0 it leads to false-positive.

I think opencode these two calls is the way forward. Maybe also move the
check from sme_encrypt_kernel() to __startup_64().

-- 
 Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-22 Thread Kirill A. Shutemov
On Wed, Sep 22, 2021 at 09:52:07PM +0200, Borislav Petkov wrote:
> On Wed, Sep 22, 2021 at 05:30:15PM +0300, Kirill A. Shutemov wrote:
> > Not fine, but waiting to blowup with random build environment change.
> 
> Why is it not fine?
> 
> Are you suspecting that the compiler might generate something else and
> not a rip-relative access?

Yes. We had it before for __supported_pte_mask and other users of
fixup_pointer().

See for instance 4a09f0210c8b ("x86/boot/64/clang: Use fixup_pointer() to
access '__supported_pte_mask'")

Unless we find other way to guarantee RIP-relative access, we must use
fixup_pointer() to access any global variables.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-22 Thread Kirill A. Shutemov
On Wed, Sep 22, 2021 at 08:40:43AM -0500, Tom Lendacky wrote:
> On 9/21/21 4:58 PM, Kirill A. Shutemov wrote:
> > On Tue, Sep 21, 2021 at 04:43:59PM -0500, Tom Lendacky wrote:
> > > On 9/21/21 4:34 PM, Kirill A. Shutemov wrote:
> > > > On Tue, Sep 21, 2021 at 11:27:17PM +0200, Borislav Petkov wrote:
> > > > > On Wed, Sep 22, 2021 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > > > > > I still believe calling cc_platform_has() from __startup_64() is 
> > > > > > totally
> > > > > > broken as it lacks proper wrapping while accessing global variables.
> > > > > 
> > > > > Well, one of the issues on the AMD side was using boot_cpu_data too
> > > > > early and the Intel side uses it too. Can you replace those checks 
> > > > > with
> > > > > is_tdx_guest() or whatever was the helper's name which would check
> > > > > whether the the kernel is running as a TDX guest, and see if that 
> > > > > helps?
> > > > 
> > > > There's no need in Intel check this early. Only AMD need it. Maybe just
> > > > opencode them?
> > > 
> > > Any way you can put a gzipped/bzipped copy of your vmlinux file somewhere 
> > > I
> > > can grab it from and take a look at it?
> > 
> > You can find broken vmlinux and bzImage here:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1n74vUQHOGebnF70Im32qLFY8iS3wvjIs%3Fusp%3Dsharingdata=04%7C01%7Cthomas.lendacky%40amd.com%7C1c7adf380cbe4c1a6bb708d97d4af6ff%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637678583935705530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=gA30x%2Bfu97tUx0p2UqI8HgjiL8bxDbK1GqgJBbUrUE4%3Dreserved=0
> > 
> > Let me know when I can remove it.
> 
> Looking at everything, it is all RIP relative addressing, so those
> accesses should be fine.

Not fine, but waiting to blowup with random build environment change.

> Your image has the intel_cc_platform_has()
> function, does it work if you remove that call? Because I think it may be
> the early call into that function which looks like it has instrumentation
> that uses %gs in __sanitizer_cov_trace_pc and %gs is not setup properly
> yet. And since boot_cpu_data.x86_vendor will likely be zero this early it
> will match X86_VENDOR_INTEL and call into that function.

Right removing call to intel_cc_platform_has() or moving it to
cc_platform.c fixes the issue.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-21 Thread Kirill A. Shutemov
On Tue, Sep 21, 2021 at 04:43:59PM -0500, Tom Lendacky wrote:
> On 9/21/21 4:34 PM, Kirill A. Shutemov wrote:
> > On Tue, Sep 21, 2021 at 11:27:17PM +0200, Borislav Petkov wrote:
> > > On Wed, Sep 22, 2021 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > > > I still believe calling cc_platform_has() from __startup_64() is totally
> > > > broken as it lacks proper wrapping while accessing global variables.
> > > 
> > > Well, one of the issues on the AMD side was using boot_cpu_data too
> > > early and the Intel side uses it too. Can you replace those checks with
> > > is_tdx_guest() or whatever was the helper's name which would check
> > > whether the the kernel is running as a TDX guest, and see if that helps?
> > 
> > There's no need in Intel check this early. Only AMD need it. Maybe just
> > opencode them?
> 
> Any way you can put a gzipped/bzipped copy of your vmlinux file somewhere I
> can grab it from and take a look at it?

You can find broken vmlinux and bzImage here:

https://drive.google.com/drive/folders/1n74vUQHOGebnF70Im32qLFY8iS3wvjIs?usp=sharing

Let me know when I can remove it.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-21 Thread Kirill A. Shutemov
On Tue, Sep 21, 2021 at 11:27:17PM +0200, Borislav Petkov wrote:
> On Wed, Sep 22, 2021 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > I still believe calling cc_platform_has() from __startup_64() is totally
> > broken as it lacks proper wrapping while accessing global variables.
> 
> Well, one of the issues on the AMD side was using boot_cpu_data too
> early and the Intel side uses it too. Can you replace those checks with
> is_tdx_guest() or whatever was the helper's name which would check
> whether the the kernel is running as a TDX guest, and see if that helps?

There's no need in Intel check this early. Only AMD need it. Maybe just
opencode them?

-- 
 Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-21 Thread Kirill A. Shutemov
On Tue, Sep 21, 2021 at 07:47:15PM +0200, Borislav Petkov wrote:
> On Tue, Sep 21, 2021 at 12:04:58PM -0500, Tom Lendacky wrote:
> > Looks like instrumentation during early boot. I worked with Boris offline to
> > exclude arch/x86/kernel/cc_platform.c from some of the instrumentation and
> > that allowed an allyesconfig to boot.
> 
> And here's the lineup I have so far, I'd appreciate it if ppc and s390 folks
> could run it too:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=rc2-cc

Still broken for me with allyesconfig.

gcc version 11.2.0 (Gentoo 11.2.0 p1)
GNU ld (Gentoo 2.37_p1 p0) 2.37

I still believe calling cc_platform_has() from __startup_64() is totally
broken as it lacks proper wrapping while accessing global variables.

I think sme_get_me_mask() has the same problem. I just happened to work
(until next compiler update).

This hack makes kernel boot again:

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f98c76a1d16c..e9110a44bf1b 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -285,7 +285,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 * there is no need to zero it after changing the memory encryption
 * attribute.
 */
-   if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+   if (0 && cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
vaddr = (unsigned long)__start_bss_decrypted;
vaddr_end = (unsigned long)__end_bss_decrypted;
for (; vaddr < vaddr_end; vaddr += PMD_SIZE) {
diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c
index eff4d19f9cb4..91638ed0b1db 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -288,7 +288,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
unsigned long pgtable_area_len;
unsigned long decrypted_base;
 
-   if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+   if (1 || !cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
    return;
 
/*
-- 
 Kirill A. Shutemov


Re: [PATCH v3 5/8] x86/sme: Replace occurrences of sme_active() with cc_platform_has()

2021-09-20 Thread Kirill A. Shutemov
On Wed, Sep 08, 2021 at 05:58:36PM -0500, Tom Lendacky wrote:
> diff --git a/arch/x86/mm/mem_encrypt_identity.c 
> b/arch/x86/mm/mem_encrypt_identity.c
> index 470b20208430..eff4d19f9cb4 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -287,7 +288,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
>   unsigned long pgtable_area_len;
>   unsigned long decrypted_base;
>  
> - if (!sme_active())
> + if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
>   return;
>  
>   /*

This change break boot for me (in KVM on Intel host). It only reproduces
with allyesconfig. More reasonable config works fine, but I didn't try to
find exact cause in config.

Convertion to cc_platform_has() in __startup_64() in 8/8 has the same
effect.

I believe it caused by sme_me_mask access from __startup_64() without
fixup_pointer() magic. I think __startup_64() requires special treatement
and we should avoid cc_platform_has() there (or have a special version of
the helper). Note that only AMD requires these cc_platform_has() to return
true.

-- 
 Kirill A. Shutemov


Re: [PATCH 07/11] treewide: Replace the use of mem_encrypt_active() with prot_guest_has()

2021-08-12 Thread Kirill A. Shutemov
On Wed, Aug 11, 2021 at 10:52:55AM -0500, Tom Lendacky wrote:
> On 8/11/21 7:19 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 02:48:54PM -0500, Tom Lendacky wrote:
> >> On 8/10/21 1:45 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>
> >>>
> >>> On 7/27/21 3:26 PM, Tom Lendacky wrote:
> >>>> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> >>>> index de01903c3735..cafed6456d45 100644
> >>>> --- a/arch/x86/kernel/head64.c
> >>>> +++ b/arch/x86/kernel/head64.c
> >>>> @@ -19,7 +19,7 @@
> >>>>   #include 
> >>>>   #include 
> >>>>   #include 
> >>>> -#include 
> >>>> +#include 
> >>>>   #include 
> >>>>     #include 
> >>>> @@ -285,7 +285,7 @@ unsigned long __head __startup_64(unsigned long
> >>>> physaddr,
> >>>>    * there is no need to zero it after changing the memory encryption
> >>>>    * attribute.
> >>>>    */
> >>>> -    if (mem_encrypt_active()) {
> >>>> +    if (prot_guest_has(PATTR_MEM_ENCRYPT)) {
> >>>>   vaddr = (unsigned long)__start_bss_decrypted;
> >>>>   vaddr_end = (unsigned long)__end_bss_decrypted;
> >>>
> >>>
> >>> Since this change is specific to AMD, can you replace PATTR_MEM_ENCRYPT 
> >>> with
> >>> prot_guest_has(PATTR_SME) || prot_guest_has(PATTR_SEV). It is not used in
> >>> TDX.
> >>
> >> This is a direct replacement for now.
> > 
> > With current implementation of prot_guest_has() for TDX it breaks boot for
> > me.
> > 
> > Looking at code agains, now I *think* the reason is accessing a global
> > variable from __startup_64() inside TDX version of prot_guest_has().
> > 
> > __startup_64() is special. If you access any global variable you need to
> > use fixup_pointer(). See comment before __startup_64().
> > 
> > I'm not sure how you get away with accessing sme_me_mask directly from
> > there. Any clues? Maybe just a luck and complier generates code just right
> > for your case, I donno.
> 
> Hmm... yeah, could be that the compiler is using rip-relative addressing
> for it because it lives in the .data section?

I guess. It has to be fixed. It may break with complier upgrade or any
random change around the code.

BTW, does it work with clang for you?

> For the static variables in mem_encrypt_identity.c I did an assembler rip
> relative LEA, but probably could have passed physaddr to sme_enable() and
> used a fixup_pointer() style function, instead.

Sounds like a plan.

> > A separate point is that TDX version of prot_guest_has() relies on
> > cpu_feature_enabled() which is not ready at this point.
> 
> Does TDX have to do anything special to make memory able to be shared with
> the hypervisor?

Yes. But there's nothing that required any changes in early boot. It
handled in ioremap/set_memory.

> You might have to use something that is available earlier
> than cpu_feature_enabled() in that case (should you eventually support
> kvmclock).

Maybe.

> > I think __bss_decrypted fixup has to be done if sme_me_mask is non-zero.
> > Or just do it uncoditionally because it's NOP for sme_me_mask == 0.
> 
> For SNP, we'll have to additionally call the HV to update the RMP to make
> the memory shared. But that could also be done unconditionally since the
> early_snp_set_memory_shared() routine will check for SNP before doing
> anything.

-- 
 Kirill A. Shutemov


Re: [PATCH 07/11] treewide: Replace the use of mem_encrypt_active() with prot_guest_has()

2021-08-11 Thread Kirill A. Shutemov
On Tue, Aug 10, 2021 at 02:48:54PM -0500, Tom Lendacky wrote:
> On 8/10/21 1:45 PM, Kuppuswamy, Sathyanarayanan wrote:
> > 
> > 
> > On 7/27/21 3:26 PM, Tom Lendacky wrote:
> >> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> >> index de01903c3735..cafed6456d45 100644
> >> --- a/arch/x86/kernel/head64.c
> >> +++ b/arch/x86/kernel/head64.c
> >> @@ -19,7 +19,7 @@
> >>   #include 
> >>   #include 
> >>   #include 
> >> -#include 
> >> +#include 
> >>   #include 
> >>     #include 
> >> @@ -285,7 +285,7 @@ unsigned long __head __startup_64(unsigned long
> >> physaddr,
> >>    * there is no need to zero it after changing the memory encryption
> >>    * attribute.
> >>    */
> >> -    if (mem_encrypt_active()) {
> >> +    if (prot_guest_has(PATTR_MEM_ENCRYPT)) {
> >>   vaddr = (unsigned long)__start_bss_decrypted;
> >>   vaddr_end = (unsigned long)__end_bss_decrypted;
> > 
> > 
> > Since this change is specific to AMD, can you replace PATTR_MEM_ENCRYPT with
> > prot_guest_has(PATTR_SME) || prot_guest_has(PATTR_SEV). It is not used in
> > TDX.
> 
> This is a direct replacement for now.

With current implementation of prot_guest_has() for TDX it breaks boot for
me.

Looking at code agains, now I *think* the reason is accessing a global
variable from __startup_64() inside TDX version of prot_guest_has().

__startup_64() is special. If you access any global variable you need to
use fixup_pointer(). See comment before __startup_64().

I'm not sure how you get away with accessing sme_me_mask directly from
there. Any clues? Maybe just a luck and complier generates code just right
for your case, I donno.

A separate point is that TDX version of prot_guest_has() relies on
cpu_feature_enabled() which is not ready at this point.

I think __bss_decrypted fixup has to be done if sme_me_mask is non-zero.
Or just do it uncoditionally because it's NOP for sme_me_mask == 0.

> I think the change you're requesting
> should be done as part of the TDX support patches so it's clear why it is
> being changed.
> 
> But, wouldn't TDX still need to do something with this shared/unencrypted
> area, though? Or since it is shared, there's actually nothing you need to
> do (the bss decrpyted section exists even if CONFIG_AMD_MEM_ENCRYPT is not
> configured)?

AFAICS, only kvmclock uses __bss_decrypted. We don't enable kvmclock in
TDX at the moment. It may change in the future.

-- 
 Kirill A. Shutemov


Re: [PATCH v7 01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-06-08 Thread Kirill A. Shutemov
On Tue, Jun 08, 2021 at 04:47:19PM +0530, Aneesh Kumar K.V wrote:
> On 6/8/21 3:12 PM, Kirill A. Shutemov wrote:
> > On Tue, Jun 08, 2021 at 01:22:23PM +0530, Aneesh Kumar K.V wrote:
> > > 
> > > Hi Hugh,
> > > 
> > > Hugh Dickins  writes:
> > > 
> > > > On Mon, 7 Jun 2021, Aneesh Kumar K.V wrote:
> > > > 
> > > > > CPU 1 CPU 2   
> > > > > CPU 3
> > > > > 
> > > > > mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> > > > > 
> > > > > mmap_write_lock_killable()
> > > > > 
> > > > >   addr = old_addr
> > > > >   lock(pte_ptl)
> > > > > lock(pmd_ptl)
> > > > > pmd = *old_pmd
> > > > > pmd_clear(old_pmd)
> > > > > flush_tlb_range(old_addr)
> > > > > 
> > > > > *new_pmd = pmd
> > > > >   
> > > > > *new_addr = 10; and fills
> > > > >   
> > > > > TLB with new addr
> > > > >   
> > > > > and old pfn
> > > > > 
> > > > > unlock(pmd_ptl)
> > > > >   ptep_clear_flush()
> > > > >   old pfn is free.
> > > > >   
> > > > > Stale TLB entry
> > > > > 
> > > > > Fix this race by holding pmd lock in pageout. This still doesn't 
> > > > > handle the race
> > > > > between MOVE_PUD and pageout.
> > > > > 
> > > > > Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
> > > > > Signed-off-by: Aneesh Kumar K.V 
> > > > 
> > > > This seems very wrong to me, to require another level of locking in the
> > > > rmap lookup, just to fix some new pagetable games in mremap.
> > > > 
> > > > But Linus asked "Am I missing something?": neither of you have mentioned
> > > > mremap's take_rmap_locks(), so I hope that already meets your need.  And
> > > > if it needs to be called more often than before (see "need_rmap_locks"),
> > > > that's probably okay.
> > > > 
> > > > Hugh
> > > > 
> > > 
> > > Thanks for reviewing the change. I missed the rmap lock in the code
> > > path. How about the below change?
> > > 
> > >  mm/mremap: hold the rmap lock in write mode when moving page table 
> > > entries.
> > >  To avoid a race between rmap walk and mremap, mremap does 
> > > take_rmap_locks().
> > >  The lock was taken to ensure that rmap walk don't miss a page table 
> > > entry due to
> > >  PTE moves via move_pagetables(). The kernel does further 
> > > optimization of
> > >  this lock such that if we are going to find the newly added vma 
> > > after the
> > >  old vma, the rmap lock is not taken. This is because rmap walk would 
> > > find the
> > >  vmas in the same order and if we don't find the page table attached 
> > > to
> > >  older vma we would find it with the new vma which we would iterate 
> > > later.
> > >  The actual lifetime of the page is still controlled by the PTE lock.
> > >  This patch updates the locking requirement to handle another race 
> > > condition
> > >  explained below with optimized mremap::
> > >  Optmized PMD move
> > >  CPU 1   CPU 2
> > >CPU 3
> > >  mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> > >  mmap_write_lock_killable()
> > >  addr = old_addr
> > >  lock(pte_ptl)
> > >  lock(pmd_ptl)
> > >  pmd = *old_pmd
> > >  pmd_clear(old_pmd)
> > >  flush_tlb_range(old_addr)
> >

Re: [PATCH v7 01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-06-08 Thread Kirill A. Shutemov
 lock(pte_ptl)
> lock(pud_ptl)
> pud = *old_pud
> pud_clear(old_pud)
> flush_tlb_range(old_addr)
> 
> *new_pud = pud
>   
>   *new_addr = 10; and fills
>   
>   TLB with new addr
>   
>   and old pfn
> 
>     unlock(pud_ptl)
> ptep_clear_flush()
> old pfn is free.
>   
>   Stale TLB entry
> 
> Both the above race condition can be fixed if we force mremap path to 
> take rmap lock.
> 
> Signed-off-by: Aneesh Kumar K.V 

Looks like it should be enough to address the race.

It would be nice to understand what is performance overhead of the
additional locking. Is it still faster to move single PMD page table under
these locks comparing to moving PTE page table entries without the locks?

-- 
 Kirill A. Shutemov


Re: [PATCH] Raise the minimum GCC version to 5.2

2021-05-03 Thread Kirill A. Shutemov
On Sun, May 02, 2021 at 02:08:31PM -0700, Linus Torvalds wrote:
> Last year, Arnd and Kirill (maybe others were involved too) made a
> list of distros and older gcc versions. But I don't think anybody
> actually _maintains_ such a list.

Distrowatch does. I used it for checking. But you need to check it per
distro. For Debian it would be here:

https://distrowatch.com/table.php?distribution=debian

-- 
 Kirill A. Shutemov


Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-16 Thread Kirill A. Shutemov
On Fri, Nov 13, 2020 at 12:19:01PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> These patches provide generic infrastructure to determine TLB page size from
> page table entries alone. Perf will use this (for either data or code address)
> to aid in profiling TLB issues.

I'm not sure it's an issue, but strictly speaking, size of page according
to page table tree doesn't mean pagewalk would fill TLB entry of the size.
CPU may support 1G pages in page table tree without 1G TLB at all.

IIRC, current Intel CPU still don't have any 1G iTLB entries and fill 2M
iTLB instead.

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/4] PM: hibernate: make direct map manipulations more explicit

2020-11-03 Thread Kirill A. Shutemov
On Tue, Nov 03, 2020 at 02:13:50PM +0200, Mike Rapoport wrote:
> On Tue, Nov 03, 2020 at 02:08:16PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Nov 01, 2020 at 07:08:13PM +0200, Mike Rapoport wrote:
> > > diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> > > index 46b1804c1ddf..054c8cce4236 100644
> > > --- a/kernel/power/snapshot.c
> > > +++ b/kernel/power/snapshot.c
> > > @@ -76,6 +76,32 @@ static inline void hibernate_restore_protect_page(void 
> > > *page_address) {}
> > >  static inline void hibernate_restore_unprotect_page(void *page_address) 
> > > {}
> > >  #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
> > >  
> > > +static inline void hibernate_map_page(struct page *page, int enable)
> > > +{
> > > + if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> > > + unsigned long addr = (unsigned long)page_address(page);
> > > + int ret;
> > > +
> > > + /*
> > > +  * This should not fail because remapping a page here means
> > > +  * that we only update protection bits in an existing PTE.
> > > +  * It is still worth to have WARN_ON() here if something
> > > +  * changes and this will no longer be the case.
> > > +  */
> > > + if (enable)
> > > + ret = set_direct_map_default_noflush(page);
> > > + else
> > > + ret = set_direct_map_invalid_noflush(page);
> > > +
> > > + if (WARN_ON(ret))
> > 
> > _ONCE?
> 
> I've changed it to pr_warn() after David said people enable panic on
> warn in production kernels.

pr_warn_once()? :P

-- 
 Kirill A. Shutemov


Re: [PATCH v3 0/4] arch, mm: improve robustness of direct map manipulation

2020-11-03 Thread Kirill A. Shutemov
On Sun, Nov 01, 2020 at 07:08:11PM +0200, Mike Rapoport wrote:
> Mike Rapoport (4):
>   mm: introduce debug_pagealloc_map_pages() helper
>   PM: hibernate: make direct map manipulations more explicit
>   arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC
>   arch, mm: make kernel_page_present() always available

The series looks good to me (apart from the minor nit):

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v3 2/4] PM: hibernate: make direct map manipulations more explicit

2020-11-03 Thread Kirill A. Shutemov
On Sun, Nov 01, 2020 at 07:08:13PM +0200, Mike Rapoport wrote:
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 46b1804c1ddf..054c8cce4236 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -76,6 +76,32 @@ static inline void hibernate_restore_protect_page(void 
> *page_address) {}
>  static inline void hibernate_restore_unprotect_page(void *page_address) {}
>  #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
>  
> +static inline void hibernate_map_page(struct page *page, int enable)
> +{
> + if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> + unsigned long addr = (unsigned long)page_address(page);
> + int ret;
> +
> + /*
> +  * This should not fail because remapping a page here means
> +  * that we only update protection bits in an existing PTE.
> +  * It is still worth to have WARN_ON() here if something
> +  * changes and this will no longer be the case.
> +  */
> + if (enable)
> + ret = set_direct_map_default_noflush(page);
> + else
> + ret = set_direct_map_invalid_noflush(page);
> +
> + if (WARN_ON(ret))

_ONCE?
> + return;
> +
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> + } else {
> + debug_pagealloc_map_pages(page, 1, enable);
> + }
> +}
> +
>  static int swsusp_page_is_free(struct page *);
>  static void swsusp_set_page_forbidden(struct page *);
>  static void swsusp_unset_page_forbidden(struct page *);

-- 
 Kirill A. Shutemov


Re: [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines

2019-12-18 Thread Kirill A. Shutemov
On Wed, Dec 18, 2019 at 02:15:53PM -0800, John Hubbard wrote:
> On 12/18/19 7:52 AM, Kirill A. Shutemov wrote:
> > On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:
> > > +static void put_compound_head(struct page *page, int refs)
> > > +{
> > > + /* Do a get_page() first, in case refs == page->_refcount */
> > > + get_page(page);
> > > + page_ref_sub(page, refs);
> > > + put_page(page);
> > > +}
> > 
> > It's not terribly efficient. Maybe something like:
> > 
> > VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
> > if (refs > 2)
> > page_ref_sub(page, refs - 1);
> > put_page(page);
> > 
> > ?
> 
> OK, but how about this instead? I don't see the need for a "2", as that
> is a magic number that requires explanation. Whereas "1" is not a magic
> number--here it means: either there are "many" (>1) refs, or not.

Yeah, it's my thinko. Sure, it has to be '1' (or >= 2, which is less readable).

> And the routine won't be called with refs less than about 32 (2MB huge
> page, 64KB base page == 32 subpages) anyway.

It's hard to make predictions about future :P

>   VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
>   /*
>* Calling put_page() for each ref is unnecessarily slow. Only the last
>* ref needs a put_page().
>*/
>   if (refs > 1)
>   page_ref_sub(page, refs - 1);
>   put_page(page);

Looks good to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2019-12-18 Thread Kirill A. Shutemov
On Mon, Dec 16, 2019 at 02:25:18PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments to loosen the restriction, allowing
> FOLL_LONGTERM in some cases.
> 
> Also, copy the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). This check will then be removed from the VFIO
> call site in a subsequent patch.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Tested-by: Alex Williamson 
> Acked-by: Alex Williamson 
> Reviewed-by: Jason Gunthorpe 
> Reviewed-by: Ira Weiny 
> Suggested-by: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Jerome Glisse 
> Signed-off-by: John Hubbard 
> ---
>  mm/gup.c | 27 ++-
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 3ecce297a47f..c0c56888e7cc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,13 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
> +   struct mm_struct *mm,
> +   unsigned long start,
> +   unsigned long nr_pages,
> +   struct page **pages,
> +   struct vm_area_struct **vmas,
> +   unsigned int flags);

Any particular reason for the forward declaration? Maybe move
get_user_pages_remote() down?

>  /*
>   * Return the compound head page with ref appropriately incremented,
>   * or NULL if that failed.
> @@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
> struct mm_struct *mm,
>   struct vm_area_struct **vmas, int *locked)
>  {
>   /*
> -  * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> +  * Parts of FOLL_LONGTERM behavior are incompatible with
>* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -  * vmas.  As there are no users of this flag in this call we simply
> -  * disallow this option for now.
> +  * vmas. However, this only comes up if locked is set, and there are
> +  * callers that do request FOLL_LONGTERM, but do not set locked. So,
> +  * allow what we can.
>*/
> - if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> - return -EINVAL;
> + if (gup_flags & FOLL_LONGTERM) {
> + if (WARN_ON_ONCE(locked))
> + return -EINVAL;
> + /*
> +  * This will check the vmas (even if our vmas arg is NULL)
> +  * and return -ENOTSUPP if DAX isn't allowed in this case:
> +  */
> + return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
> +  vmas, gup_flags | FOLL_TOUCH |
> +  FOLL_REMOTE);
> + }
>  
>   return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
>  locked,
> -- 
> 2.24.1
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-18 Thread Kirill A. Shutemov
p.c
> +++ b/mm/swap.c
> @@ -1102,3 +1102,27 @@ void __init swap_setup(void)
>* _really_ don't want to cluster much more
>*/
>  }
> +
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {

Reversing the condition would save you an indentation level.

> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +EXPORT_SYMBOL(put_devmap_managed_page);
> +#endif
> -- 
> 2.24.1
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines

2019-12-18 Thread Kirill A. Shutemov
On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:
> +static void put_compound_head(struct page *page, int refs)
> +{
> + /* Do a get_page() first, in case refs == page->_refcount */
> + get_page(page);
> + page_ref_sub(page, refs);
> + put_page(page);
> +}

It's not terribly efficient. Maybe something like:

VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
if (refs > 2)
page_ref_sub(page, refs - 1);
    put_page(page);

?

-- 
 Kirill A. Shutemov


Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks

2019-10-08 Thread Kirill A. Shutemov
On Sat, Oct 05, 2019 at 02:05:29PM +0530, Aneesh Kumar K.V wrote:
> MADV_DONTNEED has caused us multiple issues due to the fact that it can run
> in parallel to page fault. I am not sure whether we have a known/noticeable
> performance gain in allowing that with mmap_sem held in read mode.

Yes we do. MADV_DONTNEED used a lot by userspace memory allocators and it
will be very noticible performance regression if we would switch it to
down_write(mmap_sem).

-- 
 Kirill A. Shutemov


Re: [PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-07 Thread Kirill A. Shutemov
On Mon, Oct 07, 2019 at 03:51:58PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov  wrote:
> 
> > On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
> > > 
> > > * Anshuman Khandual  wrote:
> > > 
> > > > This adds a test module which will validate architecture page table 
> > > > helpers
> > > > and accessors regarding compliance with generic MM semantics 
> > > > expectations.
> > > > This will help various architectures in validating changes to the 
> > > > existing
> > > > page table helpers or addition of new ones.
> > > > 
> > > > Test page table and memory pages creating it's entries at various level 
> > > > are
> > > > all allocated from system memory with required alignments. If memory 
> > > > pages
> > > > with required size and alignment could not be allocated, then all 
> > > > depending
> > > > individual tests are skipped.
> > > 
> > > > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> > > > b/arch/x86/include/asm/pgtable_64_types.h
> > > > index 52e5f5f2240d..b882792a3999 100644
> > > > --- a/arch/x86/include/asm/pgtable_64_types.h
> > > > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > > > @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
> > > >  #define pgtable_l5_enabled() 0
> > > >  #endif /* CONFIG_X86_5LEVEL */
> > > >  
> > > > +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
> > > > +
> > > >  extern unsigned int pgdir_shift;
> > > >  extern unsigned int ptrs_per_p4d;
> > > 
> > > Any deep reason this has to be a macro instead of proper C?
> > 
> > It's a way to override the generic mm_p4d_folded(). It can be rewritten
> > as inline function + define. Something like:
> > 
> > #define mm_p4d_folded mm_p4d_folded
> > static inline bool mm_p4d_folded(struct mm_struct *mm)
> > {
> > return !pgtable_l5_enabled();
> > }
> > 
> > But I don't see much reason to be more verbose here than needed.
> 
> C type checking? Documentation? Yeah, I know it's just a one-liner, but 
> the principle of the death by a thousand cuts applies here.

Okay, if you think it worth it. Anshuman, could you fix it up for the next
submission?


> BTW., any reason this must be in the low level pgtable_64_types.h type 
> header, instead of one of the API level header files?

I defined it next pgtable_l5_enabled(). What is more appropriate place to
you? pgtable_64.h? Yeah, it makes sense.

-- 
 Kirill A. Shutemov


Re: [PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-07 Thread Kirill A. Shutemov
On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
> 
> * Anshuman Khandual  wrote:
> 
> > This adds a test module which will validate architecture page table helpers
> > and accessors regarding compliance with generic MM semantics expectations.
> > This will help various architectures in validating changes to the existing
> > page table helpers or addition of new ones.
> > 
> > Test page table and memory pages creating it's entries at various level are
> > all allocated from system memory with required alignments. If memory pages
> > with required size and alignment could not be allocated, then all depending
> > individual tests are skipped.
> 
> > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> > b/arch/x86/include/asm/pgtable_64_types.h
> > index 52e5f5f2240d..b882792a3999 100644
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
> >  #define pgtable_l5_enabled() 0
> >  #endif /* CONFIG_X86_5LEVEL */
> >  
> > +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
> > +
> >  extern unsigned int pgdir_shift;
> >  extern unsigned int ptrs_per_p4d;
> 
> Any deep reason this has to be a macro instead of proper C?

It's a way to override the generic mm_p4d_folded(). It can be rewritten
as inline function + define. Something like:

#define mm_p4d_folded mm_p4d_folded
static inline bool mm_p4d_folded(struct mm_struct *mm)
{
return !pgtable_l5_enabled();
}

But I don't see much reason to be more verbose here than needed.

-- 
 Kirill A. Shutemov


Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-09-30 Thread Kirill A. Shutemov
On Fri, Sep 27, 2019 at 08:40:00PM -0300, Leonardo Bras wrote:
> As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
 ^ typo

-- 
 Kirill A. Shutemov


Re: [PATCH V3 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-24 Thread Kirill A. Shutemov
On Fri, Sep 20, 2019 at 12:03:21PM +0530, Anshuman Khandual wrote:
> This series adds a test validation for architecture exported page table
> helpers. Patch in the series adds basic transformation tests at various
> levels of the page table. Before that it exports gigantic page allocation
> function from HugeTLB.
> 
> This test was originally suggested by Catalin during arm64 THP migration
> RFC discussion earlier. Going forward it can include more specific tests
> with respect to various generic MM functions like THP, HugeTLB etc and
> platform specific tests.
> 
> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/
> 
> Testing:
> 
> Successfully build and boot tested on both arm64 and x86 platforms without
> any test failing. Only build tested on some other platforms. Build failed
> on some platforms (known) in pud_clear_tests() as there were no available
> __pgd() definitions.
> 
> - ARM32
> - IA64

Hm. Grep shows __pgd() definitions for both of them. Is it for specific
config?


-- 
 Kirill A. Shutemov


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-13 Thread Kirill A. Shutemov
On Fri, Sep 13, 2019 at 02:32:04PM +0530, Anshuman Khandual wrote:
> 
> On 09/12/2019 10:44 PM, Christophe Leroy wrote:
> > 
> > 
> > Le 12/09/2019 à 08:02, Anshuman Khandual a écrit :
> >> This adds a test module which will validate architecture page table helpers
> >> and accessors regarding compliance with generic MM semantics expectations.
> >> This will help various architectures in validating changes to the existing
> >> page table helpers or addition of new ones.
> >>
> >> Test page table and memory pages creating it's entries at various level are
> >> all allocated from system memory with required alignments. If memory pages
> >> with required size and alignment could not be allocated, then all depending
> >> individual tests are skipped.
> >>
> > 
> > [...]
> > 
> >>
> >> Suggested-by: Catalin Marinas 
> >> Signed-off-by: Anshuman Khandual 
> >> ---
> >>   arch/x86/include/asm/pgtable_64_types.h |   2 +
> >>   mm/Kconfig.debug    |  14 +
> >>   mm/Makefile |   1 +
> >>   mm/arch_pgtable_test.c  | 429 
> >>   4 files changed, 446 insertions(+)
> >>   create mode 100644 mm/arch_pgtable_test.c
> >>
> >> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> >> b/arch/x86/include/asm/pgtable_64_types.h
> >> index 52e5f5f2240d..b882792a3999 100644
> >> --- a/arch/x86/include/asm/pgtable_64_types.h
> >> +++ b/arch/x86/include/asm/pgtable_64_types.h
> >> @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
> >>   #define pgtable_l5_enabled() 0
> >>   #endif /* CONFIG_X86_5LEVEL */
> >>   +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
> >> +
> > 
> > This is specific to x86, should go in a separate patch.
> 
> Thought about it but its just a single line. Kirill suggested this in the
> previous version. There is a generic fallback definition but s390 has it's
> own. This change overrides the generic one for x86 probably as a fix or as
> an improvement. Kirill should be able to help classify it in which case it
> can be a separate patch.

I don't think it worth a separate patch.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-13 Thread Kirill A. Shutemov
On Fri, Sep 13, 2019 at 02:12:45PM +0530, Anshuman Khandual wrote:
> 
> 
> On 09/13/2019 12:41 PM, Christophe Leroy wrote:
> > 
> > 
> > Le 13/09/2019 à 09:03, Christophe Leroy a écrit :
> >>
> >>
> >> Le 13/09/2019 à 08:58, Anshuman Khandual a écrit :
> >>> On 09/13/2019 11:53 AM, Christophe Leroy wrote:
> >>>> Fix build failure on powerpc.
> >>>>
> >>>> Fix preemption imbalance.
> >>>>
> >>>> Signed-off-by: Christophe Leroy 
> >>>> ---
> >>>>   mm/arch_pgtable_test.c | 3 +++
> >>>>   1 file changed, 3 insertions(+)
> >>>>
> >>>> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
> >>>> index 8b4a92756ad8..f2b3c9ec35fa 100644
> >>>> --- a/mm/arch_pgtable_test.c
> >>>> +++ b/mm/arch_pgtable_test.c
> >>>> @@ -24,6 +24,7 @@
> >>>>   #include 
> >>>>   #include 
> >>>>   #include 
> >>>> +#include 
> >>>
> >>> This is okay.
> >>>
> >>>>   #include 
> >>>>   #include 
> >>>> @@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)
> >>>>   p4d_clear_tests(p4dp);
> >>>>   pgd_clear_tests(mm, pgdp);
> >>>> +    pte_unmap(ptep);
> >>>> +
> >>>
> >>> Now the preemption imbalance via pte_alloc_map() path i.e
> >>>
> >>> pte_alloc_map() -> pte_offset_map() -> kmap_atomic()
> >>>
> >>> Is not this very much powerpc 32 specific or this will be applicable
> >>> for all platform which uses kmap_XXX() to map high memory ?
> >>>
> >>
> >> See 
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/include/linux/highmem.h#L91
> >>
> >> I think it applies at least to all arches using the generic implementation.
> >>
> >> Applies also to arm:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/mm/highmem.c#L52
> >>
> >> Applies also to mips:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/mips/mm/highmem.c#L47
> >>
> >> Same on sparc:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/sparc/mm/highmem.c#L52
> >>
> >> Same on x86:
> >> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/mm/highmem_32.c#L34
> >>
> >> I have not checked others, but I guess it is like that for all.
> >>
> > 
> > 
> > Seems like I answered too quickly. All kmap_atomic() do preempt_disable(), 
> > but not all pte_alloc_map() call kmap_atomic().
> > 
> > However, for instance ARM does:
> > 
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/include/asm/pgtable.h#L200
> > 
> > And X86 as well:
> > 
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/include/asm/pgtable_32.h#L51
> > 
> > Microblaze also:
> > 
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/microblaze/include/asm/pgtable.h#L495
> 
> All the above platforms checks out to be using k[un]map_atomic(). I am 
> wondering whether
> any of the intermediate levels will have similar problems on any these 32 bit 
> platforms
> or any other platforms which might be using generic k[un]map_atomic().

No. Kernel only allocates pte page table from highmem. All other page
tables are always visible in kernel address space.

-- 
 Kirill A. Shutemov


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-12 Thread Kirill A. Shutemov
On Thu, Sep 12, 2019 at 11:32:53AM +0530, Anshuman Khandual wrote:
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Anshuman Khandual ");
> +MODULE_DESCRIPTION("Test architecture page table helpers");

It's not module. Why?

BTW, I think we should make all code here __init (or it's variants) so it
can be discarded on boot. It has not use after that.

-- 
 Kirill A. Shutemov


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Kirill A. Shutemov
On Mon, Sep 09, 2019 at 11:56:50AM +0530, Anshuman Khandual wrote:
> 
> 
> On 09/07/2019 12:33 AM, Gerald Schaefer wrote:
> > On Fri, 6 Sep 2019 11:58:59 +0530
> > Anshuman Khandual  wrote:
> > 
> >> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
> >>> On Thu, 5 Sep 2019 14:48:14 +0530
> >>> Anshuman Khandual  wrote:
> >>>   
> >>>>> [...]
> >>>>>> +
> >>>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && 
> >>>>>> !defined(__ARCH_HAS_4LEVEL_HACK)
> >>>>>> +static void pud_clear_tests(pud_t *pudp)
> >>>>>> +{
> >>>>>> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >>>>>> +  pud_clear(pudp);
> >>>>>> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >>>>>> +}
> >>>>>
> >>>>> For pgd/p4d/pud_clear(), we only clear if the page table level is 
> >>>>> present
> >>>>> and not folded. The memset() here overwrites the table type bits, so
> >>>>> pud_clear() will not clear anything on s390 and the pud_none() check 
> >>>>> will
> >>>>> fail.
> >>>>> Would it be possible to OR a (larger) random value into the table, so 
> >>>>> that
> >>>>> the lower 12 bits would be preserved?
> >>>>
> >>>> So the suggestion is instead of doing memset() on entry with 
> >>>> RANDOM_NZVALUE,
> >>>> it should OR a large random value preserving lower 12 bits. Hmm, this 
> >>>> should
> >>>> still do the trick for other platforms, they just need non zero value. 
> >>>> So on
> >>>> s390, the lower 12 bits on the page table entry already has valid value 
> >>>> while
> >>>> entering this function which would make sure that pud_clear() really does
> >>>> clear the entry ?  
> >>>
> >>> Yes, in theory the table entry on s390 would have the type set in the last
> >>> 4 bits, so preserving those would be enough. If it does not conflict with
> >>> others, I would still suggest preserving all 12 bits since those would 
> >>> contain
> >>> arch-specific flags in general, just to be sure. For s390, the pte/pmd 
> >>> tests
> >>> would also work with the memset, but for consistency I think the same 
> >>> logic
> >>> should be used in all pxd_clear_tests.  
> >>
> >> Makes sense but..
> >>
> >> There is a small challenge with this. Modifying individual bits on a given
> >> page table entry from generic code like this test case is bit tricky. That
> >> is because there are not enough helpers to create entries with an absolute
> >> value. This would have been easier if all the platforms provided functions
> >> like __pxx() which is not the case now. Otherwise something like this 
> >> should
> >> have worked.
> >>
> >>
> >> pud_t pud = READ_ONCE(*pudp);
> >> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> >> WRITE_ONCE(*pudp, pud);
> >>
> >> But __pud() will fail to build in many platforms.
> > 
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> > 
> > pud_val(*pudp) |= RANDOM_NZVALUE;
> 
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there.

Use instead

*pudp = __pud(pud_val(*pudp) | RANDOM_NZVALUE);

It *should* be more portable.

-- 
 Kirill A. Shutemov


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Kirill A. Shutemov
On Thu, Sep 05, 2019 at 01:48:27PM +0530, Anshuman Khandual wrote:
> >> +#define VADDR_TEST(PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
> > 
> > What is special about this address? How do you know if it is not occupied
> > yet?
> 
> We are creating the page table from scratch after allocating an mm_struct
> for a given random virtual address 'VADDR_TEST'. Hence nothing is occupied
> just yet. There is nothing special about this address, just that it tries
> to ensure the page table entries are being created with some offset from
> beginning of respective page table page at all levels ? The idea is to
> have a more representative form of page table structure for test.

Why P4D_SIZE is missing?

Are you sure it will not land into kernel address space on any arch?

I think more robust way to deal with this would be using
get_unmapped_area() instead of fixed address.

> This makes sense for runtime cases but there is a problem here.
> 
> On arm64, pgd_populate() which takes (pud_t *) as last argument instead of
> (p4d_t *) will fail to build when not wrapped in !__PAGETABLE_P4D_FOLDED
> on certain configurations.
> 
> ./arch/arm64/include/asm/pgalloc.h:81:75: note:
> expected ‘pud_t *’ {aka ‘struct  *’}
> but argument is of type ‘pgd_t *’ {aka ‘struct  *’}
> static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgdp, pud_t 
> *pudp)
>~~~^~~~
> Wondering if this is something to be fixed on arm64 or its more general
> problem. Will look into this further.

I think you need wrap this into #ifndef __ARCH_HAS_5LEVEL_HACK.

> >> +  pmd_populate_tests(mm, pmdp, (pgtable_t) page);
> > 
> > This is not correct for architectures that defines pgtable_t as pte_t
> > pointer, not struct page pointer.
> 
> Right, a grep on the source confirms that.
> 
> These platforms define pgtable_t as struct page *
> 
> arch/alpha/include/asm/page.h:typedef struct page *pgtable_t;
> arch/arm/include/asm/page.h:typedef struct page *pgtable_t;
> arch/arm64/include/asm/page.h:typedef struct page *pgtable_t;
> arch/csky/include/asm/page.h:typedef struct page *pgtable_t;
> arch/hexagon/include/asm/page.h:typedef struct page *pgtable_t;
> arch/ia64/include/asm/page.h:  typedef struct page *pgtable_t;
> arch/ia64/include/asm/page.h:typedef struct page *pgtable_t;
> arch/m68k/include/asm/page.h:typedef struct page *pgtable_t;
> arch/microblaze/include/asm/page.h:typedef struct page *pgtable_t;
> arch/mips/include/asm/page.h:typedef struct page *pgtable_t;
> arch/nds32/include/asm/page.h:typedef struct page *pgtable_t;
> arch/nios2/include/asm/page.h:typedef struct page *pgtable_t;
> arch/openrisc/include/asm/page.h:typedef struct page *pgtable_t;
> arch/parisc/include/asm/page.h:typedef struct page *pgtable_t;
> arch/riscv/include/asm/page.h:typedef struct page *pgtable_t;
> arch/sh/include/asm/page.h:typedef struct page *pgtable_t;
> arch/sparc/include/asm/page_32.h:typedef struct page *pgtable_t;
> arch/um/include/asm/page.h:typedef struct page *pgtable_t;
> arch/unicore32/include/asm/page.h:typedef struct page *pgtable_t;
> arch/x86/include/asm/pgtable_types.h:typedef struct page *pgtable_t;
> arch/xtensa/include/asm/page.h:typedef struct page *pgtable_t;
> 
> These platforms define pgtable_t as pte_t *
> 
> arch/arc/include/asm/page.h:typedef pte_t * pgtable_t;
> arch/powerpc/include/asm/mmu.h:typedef pte_t *pgtable_t;
> arch/s390/include/asm/page.h:typedef pte_t *pgtable_t;
> arch/sparc/include/asm/page_64.h:typedef pte_t *pgtable_t;
> 
> Should we need have two pmd_populate_tests() definitions with
> different arguments (struct page pointer or pte_t pointer) and then
> call either one after detecting the given platform ?

Use pte_alloc_one() instead of alloc_mapped_page() to allocate the page
table.

> >> +  pud_populate_tests(mm, pudp, pmdp);
> >> +  p4d_populate_tests(mm, p4dp, pudp);
> >> +  pgd_populate_tests(mm, pgdp, p4dp);
> > 
> > This is wrong. All p?dp points to the second entry in page table entry.
> > This is not valid pointer for page table and triggers p?d_bad() on x86.
> 
> Yeah these are second entries because of the way we create the page table.
> But I guess its applicable only to the second argument in all these above
> cases because the first argument can be any valid entry on previous page
> table level.

Yes:

@@ -397,9 +396,9 @@ static int __init arch_pgtable_tests_init(void)
pgd_clear_tests(pgdp);
 
pmd_populate_tests(mm, pmdp, (pgtable_t) page);
-   pud_populate_tests(mm, pudp, pmdp);
-   p4d_populate_tests(mm, p4dp, pudp);
-   pgd_populate_tests(mm, pgdp, p4dp);
+   pud_populate_tests(mm, pudp, saved_pmdp);
+   p4d_populate_tests(mm, p4dp, saved_pudp);
+   pgd_populate_tests(mm, pgdp, saved_p4dp);
 
p4d_free(mm, saved_p4dp);
pud_free(mm, saved_pudp);

-- 
 Kirill A. Shutemov


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Kirill A. Shutemov
; +static struct page *alloc_gigantic_page(nodemask_t *nodemask,
> + int nid, gfp_t gfp_mask, int order)
> +{
> + struct zonelist *zonelist;
> + struct zone *zone;
> + struct zoneref *z;
> + enum zone_type zonesel;
> + unsigned long ret, pfn, flags, nr_pages;
> +
> + nr_pages = 1UL << order;
> + zonesel = gfp_zone(gfp_mask);
> + zonelist = node_zonelist(nid, gfp_mask);
> + for_each_zone_zonelist_nodemask(zone, z, zonelist, zonesel, nodemask) {
> + spin_lock_irqsave(>lock, flags);
> + pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> + while (zone_spans_pfn(zone, pfn + nr_pages - 1)) {
> + if (pfn_range_valid(zone, pfn, nr_pages)) {
> + spin_unlock_irqrestore(>lock, flags);
> + ret = alloc_contig_range(pfn, pfn + nr_pages,
> +  MIGRATE_MOVABLE,
> +  gfp_mask);
> + if (!ret)
> + return pfn_to_page(pfn);
> + spin_lock_irqsave(>lock, flags);
> + }
> + pfn += nr_pages;
> + }
> + spin_unlock_irqrestore(>lock, flags);
> + }
> + return NULL;
> +}
> +
> +static struct page *alloc_mapped_page(void)
> +{
> + gfp_t gfp_mask = GFP_KERNEL | __GFP_ZERO;
> + struct page *page = NULL;
> +
> + page = alloc_gigantic_page(_states[N_MEMORY], first_memory_node,
> +gfp_mask, get_order(PUD_SIZE));
> + if (page) {
> + pud_aligned = true;
> + pmd_aligned = true;
> + return page;
> + }
> +
> + page = alloc_pages(gfp_mask, get_order(PMD_SIZE));
> + if (page) {
> + pmd_aligned = true;
> + return page;
> + }
> + return alloc_page(gfp_mask);
> +}
> +
> +static void free_mapped_page(struct page *page)
> +{
> + if (pud_aligned) {
> + unsigned long pfn = page_to_pfn(page);
> +
> + free_contig_range(pfn, 1ULL << get_order(PUD_SIZE));
> + return;
> + }
> +
> + if (pmd_aligned) {
> + int order = get_order(PMD_SIZE);
> +
> + free_pages((unsigned long)page_address(page), order);
> + return;
> + }
> + free_page((unsigned long)page_address(page));
> +}
> +
> +static int __init arch_pgtable_tests_init(void)
> +{
> + struct mm_struct *mm;
> + struct page *page;
> + pgd_t *pgdp;
> + p4d_t *p4dp, *saved_p4dp;
> + pud_t *pudp, *saved_pudp;
> + pmd_t *pmdp, *saved_pmdp;
> + pte_t *ptep, *saved_ptep;
> + pgprot_t prot = vm_get_page_prot(VMA_TEST_FLAGS);
> + unsigned long vaddr = VADDR_TEST;
> +
> + mm = mm_alloc();
> + if (!mm) {
> + pr_err("mm_struct allocation failed\n");
> + return 1;
> + }
> +
> + page = alloc_mapped_page();
> + if (!page) {
> + pr_err("memory allocation failed\n");
> + return 1;
> + }
> +
> + pgdp = pgd_offset(mm, vaddr);
> + p4dp = p4d_alloc(mm, pgdp, vaddr);
> + pudp = pud_alloc(mm, p4dp, vaddr);
> + pmdp = pmd_alloc(mm, pudp, vaddr);
> + ptep = pte_alloc_map(mm, pmdp, vaddr);
> +
> + /*
> +  * Save all the page table page addresses as the page table
> +  * entries will be used for testing with random or garbage
> +  * values. These saved addresses will be used for freeing
> +  * page table pages.
> +  */
> + saved_p4dp = p4d_offset(pgdp, 0UL);
> +     saved_pudp = pud_offset(p4dp, 0UL);
> + saved_pmdp = pmd_offset(pudp, 0UL);
> + saved_ptep = pte_offset_map(pmdp, 0UL);
> +
> + pte_basic_tests(page, prot);
> + pmd_basic_tests(page, prot);
> + pud_basic_tests(page, prot);
> + p4d_basic_tests(page, prot);
> + pgd_basic_tests(page, prot);
> +
> + pte_clear_tests(ptep);
> + pmd_clear_tests(pmdp);
> + pud_clear_tests(pudp);
> + p4d_clear_tests(p4dp);
> + pgd_clear_tests(pgdp);
> +
> + pmd_populate_tests(mm, pmdp, (pgtable_t) page);

This is not correct for architectures that defines pgtable_t as pte_t
pointer, not struct page pointer.

> + pud_populate_tests(mm, pudp, pmdp);
> + p4d_populate_tests(mm, p4dp, pudp);
> + pgd_populate_tests(mm, pgdp, p4dp);

This is wrong. All p?dp points to the second entry in page table entry.
This is not valid pointer for page table and triggers p?d_bad() on x86.

Use saved_p?dp instead.

> +
> + p4d_free(mm, saved_p4dp);
> + pud_free(mm, saved_pudp);
> + pmd_free(mm, saved_pmdp);
> + pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));
> +
> + mm_dec_nr_puds(mm);
> + mm_dec_nr_pmds(mm);
> + mm_dec_nr_ptes(mm);
> + __mmdrop(mm);
> +
> + free_mapped_page(page);
> + return 0;
> +}
> +
> +static void __exit arch_pgtable_tests_exit(void) { }
> +
> +module_init(arch_pgtable_tests_init);
> +module_exit(arch_pgtable_tests_exit);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Anshuman Khandual ");
> +MODULE_DESCRIPTION("Test archicture page table helpers");
> -- 
> 2.20.1
> 
> 

-- 
 Kirill A. Shutemov


Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-19 Thread Kirill A. Shutemov
On Wed, Mar 13, 2019 at 09:07:13AM -0700, Dan Williams wrote:
> On Wed, Mar 6, 2019 at 4:46 AM Aneesh Kumar K.V
>  wrote:
> >
> > On 3/6/19 5:14 PM, Michal Suchánek wrote:
> > > On Wed, 06 Mar 2019 14:47:33 +0530
> > > "Aneesh Kumar K.V"  wrote:
> > >
> > >> Dan Williams  writes:
> > >>
> > >>> On Thu, Feb 28, 2019 at 1:40 AM Oliver  wrote:
> > >>>>
> > >>>> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> > >>>>  wrote:
> > >
> > >> Also even if the user decided to not use THP, by
> > >> echo "never" > transparent_hugepage/enabled , we should continue to map
> > >> dax fault using huge page on platforms that can support huge pages.
> > >
> > > Is this a good idea?
> > >
> > > This knob is there for a reason. In some situations having huge pages
> > > can severely impact performance of the system (due to host-guest
> > > interaction or whatever) and the ability to really turn off all THP
> > > would be important in those cases, right?
> > >
> >
> > My understanding was that is not true for dax pages? These are not
> > regular memory that got allocated. They are allocated out of /dev/dax/
> > or /dev/pmem*. Do we have a reason not to use hugepages for mapping
> > pages in that case?
> 
> The problem with the transparent_hugepage/enabled interface is that it
> conflates performing compaction work to produce THP-pages with the
> ability to map huge pages at all.

That's not [entirely] true. transparent_hugepage/defrag gates heavy-duty
compaction. We do only very limited compaction if it's not advised by
transparent_hugepage/defrag.

I believe DAX has to respect transparent_hugepage/enabled. Or not
advertise its huge pages as THP. It's confusing for user.

-- 
 Kirill A. Shutemov


Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-03-06 Thread Kirill A. Shutemov
On Wed, Mar 06, 2019 at 06:15:25PM +0530, Aneesh Kumar K.V wrote:
> On 3/6/19 5:14 PM, Michal Suchánek wrote:
> > On Wed, 06 Mar 2019 14:47:33 +0530
> > "Aneesh Kumar K.V"  wrote:
> > 
> > > Dan Williams  writes:
> > > 
> > > > On Thu, Feb 28, 2019 at 1:40 AM Oliver  wrote:
> > > > > 
> > > > > On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V
> > > > >  wrote:
> > > Also even if the user decided to not use THP, by
> > > echo "never" > transparent_hugepage/enabled , we should continue to map
> > > dax fault using huge page on platforms that can support huge pages.
> > 
> > Is this a good idea?
> > 
> > This knob is there for a reason. In some situations having huge pages
> > can severely impact performance of the system (due to host-guest
> > interaction or whatever) and the ability to really turn off all THP
> > would be important in those cases, right?
> > 
> 
> My understanding was that is not true for dax pages? These are not regular
> memory that got allocated. They are allocated out of /dev/dax/ or
> /dev/pmem*. Do we have a reason not to use hugepages for mapping pages in
> that case?

Yes. Like when you don't want dax to compete for TLB with mission-critical
application (which uses hugetlb for instance).

-- 
 Kirill A. Shutemov


Re: [PATCH] mmap.2: describe the 5level paging hack

2019-02-12 Thread Kirill A. Shutemov
On Mon, Feb 11, 2019 at 05:36:53PM +0100, Jann Horn wrote:
> The manpage is missing information about the compatibility hack for
> 5-level paging that went in in 4.14, around commit ee00f4a32a76 ("x86/mm:
> Allow userspace have mappings above 47-bit"). Add some information about
> that.
> 
> While I don't think any hardware supporting this is shipping yet (?), I
> think it's useful to try to write a manpage for this API, partly to
> figure out how usable that API actually is, and partly because when this
> hardware does ship, it'd be nice if distro manpages had information about
> how to use it.
> 
> Signed-off-by: Jann Horn 

Thanks for doing this.

> ---
> This patch goes on top of the patch "[PATCH] mmap.2: fix description of
> treatment of the hint" that I just sent, but I'm not sending them in a
> series because I want the first one to go in, and I think this one might
> be a bit more controversial.
> 
> It would be nice if the architecture maintainers and mm folks could have
> a look at this and check that what I wrote is right - I only looked at
> the source for this, I haven't tried it.
> 
>  man2/mmap.2 | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 8556bbfeb..977782fa8 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -67,6 +67,8 @@ is NULL,
>  then the kernel chooses the (page-aligned) address
>  at which to create the mapping;
>  this is the most portable method of creating a new mapping.
> +On Linux, in this case, the kernel may limit the maximum address that can be
> +used for allocations to a legacy limit for compatibility reasons.
>  If
>  .I addr
>  is not NULL,
> @@ -77,6 +79,19 @@ or equal to the value specified by
>  and attempt to create the mapping there.
>  If another mapping already exists there, the kernel picks a new
>  address, independent of the hint.
> +However, if a hint above the architecture's legacy address limit is provided
> +(on x86-64: above 0x7000, on arm64: above 0x1, on ppc64 
> with
> +book3s: above 0x7fff or 0x3fff, depending on page size), the
> +kernel is permitted to allocate mappings beyond the architecture's legacy
> +address limit. The availability of such addresses is hardware-dependent.
> +Therefore, if you want to be able to use the full virtual address space of
> +hardware that supports addresses beyond the legacy range, you need to 
> specify an
> +address above that limit; however, for security reasons, you should avoid
> +specifying a fixed valid address outside the compatibility range,
> +since that would reduce the value of userspace address space layout
> +randomization. Therefore, it is recommended to specify an address
> +.I beyond
> +the end of the userspace address space.

It probably worth recommending (void *) -1 as such address.

>  .\" Before Linux 2.6.24, the address was rounded up to the next page
>  .\" boundary; since 2.6.24, it is rounded down!
>  The address of the new mapping is returned as the result of the call.
> -- 
> 2.20.1.791.gb4d0f1c61a-goog
> 

-- 
 Kirill A. Shutemov


Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Kirill A. Shutemov
On Mon, Dec 10, 2018 at 12:21:05PM -0200, Rafael David Tinoco wrote:
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 84bd9bdc1987..d808cfde3d19 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -64,8 +64,6 @@ extern unsigned int ptrs_per_p4d;
>  #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK (~(P4D_SIZE - 1))
>  
> -#define MAX_POSSIBLE_PHYSMEM_BITS52
> -
>  #else /* CONFIG_X86_5LEVEL */
>  
>  /*
> @@ -154,4 +152,6 @@ extern unsigned int ptrs_per_p4d;
>  
>  #define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t))
>  
> +#define MAX_POSSIBLE_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
> +

...

>  #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0787d33b80d8..132c20b6fd4f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c

...

> @@ -116,6 +100,25 @@
>   */
>  #define OBJ_ALLOCATED_TAG 1
>  #define OBJ_TAG_BITS 1
> +
> +/*
> + * MAX_POSSIBLE_PHYSMEM_BITS should be defined by all archs using zsmalloc:
> + * Trying to guess it from MAX_PHYSMEM_BITS, or considering it BITS_PER_LONG,
> + * proved to be wrong by not considering PAE capabilities, or using SPARSEMEM
> + * only headers, leading to bad object encoding due to object index overflow.
> + */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> + #define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
> + #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be defined by arch using zsmalloc";
> +#else
> + #ifndef CONFIG_64BIT
> +  #if (MAX_POSSIBLE_PHYSMEM_BITS >= (BITS_PER_LONG + PAGE_SHIFT - 
> OBJ_TAG_BITS))
> +   #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for this arch";
> +  #endif
> + #endif
> +#endif
> +
> +#define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>  #define OBJ_INDEX_BITS   (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
>  #define OBJ_INDEX_MASK   ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)

Have you tested it with CONFIG_X86_5LEVEL=y?

ASAICS, the patch makes OBJ_INDEX_BITS and what depends from it dynamic --
it depends what paging mode we are booting in. ZS_SIZE_CLASSES depends
indirectly on OBJ_INDEX_BITS and I don't see how struct zs_pool definition
can compile with dynamic ZS_SIZE_CLASSES.

Hm?

-- 
 Kirill A. Shutemov


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-25 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote:
> On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..2fd163cff406 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct 
> > > > > *vma, pmd_t *old_pmd,
> > > > >   drop_rmap_locks(vma);
> > > > >  }
> > > > >  
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned 
> > > > > long old_addr,
> > > > > +   unsigned long new_addr, unsigned long old_end,
> > > > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > + spinlock_t *old_ptl, *new_ptl;
> > > > > + struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > + || old_end - old_addr < PMD_SIZE)
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > > > +  * should have release it.
> > > > > +  */
> > > > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * We don't have to worry about the ordering of src and dst
> > > > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > +  */
> > > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > + if (old_ptl) {
> > > > 
> > > > How can it ever be false?
> 
> Kirill,
> It cannot, you are right. I'll remove the test.
> 
> By the way, there are new changes upstream by Linus which flush the TLB
> before releasing the ptlock instead of after. I'm guessing that patch came
> about because of reviews of this patch and someone spotted an issue in the
> existing code :)
> 
> Anyway the patch in concern is:
> eb66ae030829 ("mremap: properly flush TLB before releasing the page")
> 
> I need to rebase on top of that with appropriate modifications, but I worry
> that this patch will slow down performance since we have to flush at every
> PMD/PTE move before releasing the ptlock. Where as with my patch, the
> intention is to flush only at once in the end of move_page_tables. When I
> tried to flush TLB on every PMD move, it was quite slow on my arm64 device 
> [2].
> 
> Further observation [1] is, it seems like the move_huge_pmds and move_ptes 
> code
> is a bit sub optimal in the sense, we are acquiring and releasing the same
> ptlock for a bunch of PMDs if the said PMDs are on the same page-table page
> right? Instead we can do better by acquiring and release the ptlock less
> often.
> 
> I think this observation [1] and the frequent TLB flush issue [2] can be 
> solved
> by acquiring the ptlock once for a bunch of PMDs, move them all, then flush
> the tlb and then release the ptlock, and then proceed to doing the same thing
> for the PMDs in the next page-table page. What do you think?

Yeah, that's viable optimization.

The tricky part is that one PMD page table can have PMD entires of
different types: THP, page table that you can move as whole and the one
that you cannot (for any reason).

If we cannot move the PMD entry as a whole and must go to PTE page table
we would need to drop PMD ptl and take PTE ptl (it might be the same lock
in some configuations).

Also we don't want to take PMD lock unless it's required.

I expect it to be not very trivial to get everything right. But take a
shot :)

-- 
 Kirill A. Shutemov


Re: [PATCH 1/4] treewide: remove unused address argument from pte_alloc functions (v2)

2018-10-25 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:37:16AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 12, 2018 at 06:31:57PM -0700, Joel Fernandes (Google) wrote:
> > This series speeds up mremap(2) syscall by copying page tables at the
> > PMD level even for non-THP systems. There is concern that the extra
> > 'address' argument that mremap passes to pte_alloc may do something
> > subtle architecture related in the future that may make the scheme not
> > work.  Also we find that there is no point in passing the 'address' to
> > pte_alloc since its unused. So this patch therefore removes this
> > argument tree-wide resulting in a nice negative diff as well. Also
> > ensuring along the way that the enabled architectures do not do anything
> > funky with 'address' argument that goes unnoticed by the optimization.
> 
> Did you happen to look at the history of where that address argument
> came from? -- just being curious here. ISTR something vague about
> architectures having different paging structure for different memory
> ranges.

I see some archicetures (i.e. sparc and, I believe power) used the address
for coloring. It's not needed anymore. Page allocator and SL?B are good
enough now.

See 3c936465249f ("[SPARC64]: Kill pgtable quicklists and use SLAB.")

-- 
 Kirill A. Shutemov


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 9e68a02a52b1..2fd163cff406 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > > pmd_t *old_pmd,
> > >   drop_rmap_locks(vma);
> > >  }
> > >  
> > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > > old_addr,
> > > +   unsigned long new_addr, unsigned long old_end,
> > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > +{
> > > + spinlock_t *old_ptl, *new_ptl;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > +
> > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > + || old_end - old_addr < PMD_SIZE)
> > > + return false;
> > > +
> > > + /*
> > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > +  * should have release it.
> > > +  */
> > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > + return false;
> > > +
> > > + /*
> > > +  * We don't have to worry about the ordering of src and dst
> > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > +  */
> > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > + if (old_ptl) {
> > 
> > How can it ever be false?
> > 
> > > + pmd_t pmd;
> > > +
> > > + new_ptl = pmd_lockptr(mm, new_pmd);
> 
> 
> Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> the code applies, why not just reuse as much as possible? The same comments
> w.r.t mmap_sem helping protect against lock order issues applies as well.

pmd_lock() cannot fail, but __pmd_trans_huge_lock() can. We should not
copy the code blindly.

-- 
 Kirill A. Shutemov


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 9e68a02a52b1..2fd163cff406 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
> *old_pmd,
>   drop_rmap_locks(vma);
>  }
>  
> +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> old_addr,
> +   unsigned long new_addr, unsigned long old_end,
> +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> +{
> + spinlock_t *old_ptl, *new_ptl;
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> + || old_end - old_addr < PMD_SIZE)
> + return false;
> +
> + /*
> +  * The destination pmd shouldn't be established, free_pgtables()
> +  * should have release it.
> +  */
> + if (WARN_ON(!pmd_none(*new_pmd)))
> + return false;
> +
> + /*
> +  * We don't have to worry about the ordering of src and dst
> +  * ptlocks because exclusive mmap_sem prevents deadlock.
> +  */
> + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> + if (old_ptl) {

How can it ever be false?

> + pmd_t pmd;
> +
> + new_ptl = pmd_lockptr(mm, new_pmd);
> + if (new_ptl != old_ptl)
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Clear the pmd */
> + pmd = *old_pmd;
> + pmd_clear(old_pmd);
> +
> + VM_BUG_ON(!pmd_none(*new_pmd));
> +
> + /* Set the new pmd */
> + set_pmd_at(mm, new_addr, new_pmd, pmd);
> + if (new_ptl != old_ptl)
> + spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + *need_flush = true;
> + return true;
> + }
> + return false;
> +}
> +
-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:42:24PM +0100, Anton Ivanov wrote:
> 
> On 10/12/18 3:48 PM, Anton Ivanov wrote:
> > On 12/10/2018 15:37, Kirill A. Shutemov wrote:
> > > On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> > > > On 10/12/18 2:37 AM, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during
> > > > > memory management
> > > > > related operations. The mremap system call can be really
> > > > > slow if THP is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map.
> > > > > Turning on THP
> > > > > may not be a viable option, and is not for us. This patch
> > > > > speeds up the
> > > > > performance for non-THP system by copying at the PMD level
> > > > > when possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400
> > > > > microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both
> > > > > x86-64 and arm64.
> > > > > 
> > > > > Cc: minc...@kernel.org
> > > > > Cc: pan...@google.com
> > > > > Cc: hu...@google.com
> > > > > Cc: lokeshgi...@google.com
> > > > > Cc: dan...@google.com
> > > > > Cc: mho...@kernel.org
> > > > > Cc: kir...@shutemov.name
> > > > > Cc: a...@linux-foundation.org
> > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > ---
> > > > >    mm/mremap.c | 62
> > > > > +
> > > > >    1 file changed, 62 insertions(+)
> > > > > 
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..d82c485822ef 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct
> > > > > vm_area_struct *vma, pmd_t *old_pmd,
> > > > >    drop_rmap_locks(vma);
> > > > >    }
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma,
> > > > > unsigned long old_addr,
> > > > > +  unsigned long new_addr, unsigned long old_end,
> > > > > +  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > +    spinlock_t *old_ptl, *new_ptl;
> > > > > +    struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > +    if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > +    || old_end - old_addr < PMD_SIZE)
> > > > > +    return false;
> > > > > +
> > > > > +    /*
> > > > > + * The destination pmd shouldn't be established, free_pgtables()
> > > > > + * should have release it.
> > > > > + */
> > > > > +    if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > +    return false;
> > > > > +
> > > > > +    /*
> > > > > + * We don't have to worry about the ordering of src and dst
> > > > > + * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > + */
> > > > > +    old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > +    if (old_ptl) {
> > > > > +    pmd_t pmd;
> > > > > +
> > > &

Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 09:57:19AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 04:19:46PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> > > On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during memory 
> > > > > management
> > > > > related operations. The mremap system call can be really slow if THP 
> > > > > is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map. Turning on 
> > > > > THP
> > > > > may not be a viable option, and is not for us. This patch speeds up 
> > > > > the
> > > > > performance for non-THP system by copying at the PMD level when 
> > > > > possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400 
> > > > > microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both x86-64 and 
> > > > > arm64.
> > > > 
> > > > I looked into the code more and noticed move_pte() helper called from
> > > > move_ptes(). It changes PTE entry to suite new address.
> > > > 
> > > > It is only defined in non-trivial way on Sparc. I don't know much about
> > > > Sparc and it's hard for me to say if the optimization will break 
> > > > anything
> > > > there.
> > > 
> > > Sparc's move_pte seems to be flushing the D-cache to prevent aliasing. It 
> > > is
> > > not modifying the PTE itself AFAICS:
> > > 
> > > #ifdef DCACHE_ALIASING_POSSIBLE
> > > #define __HAVE_ARCH_MOVE_PTE
> > > #define move_pte(pte, prot, old_addr, new_addr) \
> > > ({  \
> > > pte_t newpte = (pte);   \
> > > if (tlb_type != hypervisor && pte_present(pte)) {   \
> > > unsigned long this_pfn = pte_pfn(pte);  \
> > > \
> > > if (pfn_valid(this_pfn) &&  \
> > > (((old_addr) ^ (new_addr)) & (1 << 13)))\
> > > flush_dcache_page_all(current->mm,  \
> > >   pfn_to_page(this_pfn));   \
> > > }   \
> > > newpte; \
> > > })
> > > #endif
> > > 
> > > If its an issue, then how do transparent huge pages work on Sparc?  I 
> > > don't
> > > see the huge page code (move_huge_pages) during mremap doing anything 
> > > special
> > > for Sparc architecture when moving PMDs..
> > 
> > My *guess* is that it will work fine on Sparc as it apprarently it only
> > cares about change in bit 13 of virtual address. It will never happen for
> > huge pages or when PTE page tables move.
> > 
> > But I just realized that the problem is bigger: since we pass new_addr to
> > the set_pte_at() we would need to audit all implementations that they are
> > safe with just moving PTE page table.
> > 
> > I would rather go with per-architecture enabling. It's much safer.
> 
> I'm Ok with the per-arch enabling, I agree its safer. So I should be adding a
> a new __HAVE_ARCH_MOVE_PMD right, or did you have a better name for that?

I believe Kconfig option is more cononical way to do this nowadays.
So CONFIG_HAVE_ARCH_MOVE_PMD, I guess. Or CONFIG_HAVE_MOVE_PMD.
An arch that supports it would select the option.

> Also, do you feel we should still need to remove the address argument from
> set_pte_alloc? Or should we leave that alone if we do per-arch?
> I figure I spent a bunch of time on that already anyway, and its a clean up
> anyway, so may as well do it. But perhaps that "pte_alloc cleanup" can then
> be a separate patch independent of this series?

Yeah. The cleanup makes sense anyway.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> On 10/12/18 2:37 AM, Joel Fernandes (Google) wrote:
> > Android needs to mremap large regions of memory during memory management
> > related operations. The mremap system call can be really slow if THP is
> > not enabled. The bottleneck is move_page_tables, which is copying each
> > pte at a time, and can be really slow across a large map. Turning on THP
> > may not be a viable option, and is not for us. This patch speeds up the
> > performance for non-THP system by copying at the PMD level when possible.
> > 
> > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > 
> > Before:
> > Total mremap time for 1GB data: 242321014 nanoseconds.
> > Total mremap time for 1GB data: 196842467 nanoseconds.
> > Total mremap time for 1GB data: 167051162 nanoseconds.
> > 
> > After:
> > Total mremap time for 1GB data: 385781 nanoseconds.
> > Total mremap time for 1GB data: 388959 nanoseconds.
> > Total mremap time for 1GB data: 402813 nanoseconds.
> > 
> > Incase THP is enabled, the optimization is skipped. I also flush the
> > tlb every time we do this optimization since I couldn't find a way to
> > determine if the low-level PTEs are dirty. It is seen that the cost of
> > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > Cc: minc...@kernel.org
> > Cc: pan...@google.com
> > Cc: hu...@google.com
> > Cc: lokeshgi...@google.com
> > Cc: dan...@google.com
> > Cc: mho...@kernel.org
> > Cc: kir...@shutemov.name
> > Cc: a...@linux-foundation.org
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >   mm/mremap.c | 62 +
> >   1 file changed, 62 insertions(+)
> > 
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 9e68a02a52b1..d82c485822ef 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > pmd_t *old_pmd,
> > drop_rmap_locks(vma);
> >   }
> > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > old_addr,
> > + unsigned long new_addr, unsigned long old_end,
> > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > +{
> > +   spinlock_t *old_ptl, *new_ptl;
> > +   struct mm_struct *mm = vma->vm_mm;
> > +
> > +   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > +   || old_end - old_addr < PMD_SIZE)
> > +   return false;
> > +
> > +   /*
> > +* The destination pmd shouldn't be established, free_pgtables()
> > +* should have release it.
> > +*/
> > +   if (WARN_ON(!pmd_none(*new_pmd)))
> > +   return false;
> > +
> > +   /*
> > +* We don't have to worry about the ordering of src and dst
> > +* ptlocks because exclusive mmap_sem prevents deadlock.
> > +*/
> > +   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > +   if (old_ptl) {
> > +   pmd_t pmd;
> > +
> > +   new_ptl = pmd_lockptr(mm, new_pmd);
> > +   if (new_ptl != old_ptl)
> > +   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > +   /* Clear the pmd */
> > +   pmd = *old_pmd;
> > +   pmd_clear(old_pmd);
> > +
> > +   VM_BUG_ON(!pmd_none(*new_pmd));
> > +
> > +   /* Set the new pmd */
> > +   set_pmd_at(mm, new_addr, new_pmd, pmd);
> 
> UML does not have set_pmd_at at all

Every architecture does. :)

But it may come not from the arch code.

> If I read the code right, MIPS completely ignores the address argument so
> set_pmd_at there may not have the effect which this patch is trying to
> achieve.

Ignoring address is fine. Most architectures do that..
The ideas is to move page table to the new pmd slot. It's nothing to do
with the address passed to set_pmd_at().

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > Android needs to mremap large regions of memory during memory management
> > > related operations. The mremap system call can be really slow if THP is
> > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > pte at a time, and can be really slow across a large map. Turning on THP
> > > may not be a viable option, and is not for us. This patch speeds up the
> > > performance for non-THP system by copying at the PMD level when possible.
> > > 
> > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > > 
> > > Before:
> > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > 
> > > After:
> > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > 
> > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > tlb every time we do this optimization since I couldn't find a way to
> > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > I looked into the code more and noticed move_pte() helper called from
> > move_ptes(). It changes PTE entry to suite new address.
> > 
> > It is only defined in non-trivial way on Sparc. I don't know much about
> > Sparc and it's hard for me to say if the optimization will break anything
> > there.
> 
> Sparc's move_pte seems to be flushing the D-cache to prevent aliasing. It is
> not modifying the PTE itself AFAICS:
> 
> #ifdef DCACHE_ALIASING_POSSIBLE
> #define __HAVE_ARCH_MOVE_PTE
> #define move_pte(pte, prot, old_addr, new_addr) \
> ({  \
> pte_t newpte = (pte);   \
> if (tlb_type != hypervisor && pte_present(pte)) {   \
> unsigned long this_pfn = pte_pfn(pte);  \
> \
> if (pfn_valid(this_pfn) &&  \
> (((old_addr) ^ (new_addr)) & (1 << 13)))\
> flush_dcache_page_all(current->mm,  \
>   pfn_to_page(this_pfn));   \
> }   \
> newpte; \
> })
> #endif
> 
> If its an issue, then how do transparent huge pages work on Sparc?  I don't
> see the huge page code (move_huge_pages) during mremap doing anything special
> for Sparc architecture when moving PMDs..

My *guess* is that it will work fine on Sparc as it apprarently it only
cares about change in bit 13 of virtual address. It will never happen for
huge pages or when PTE page tables move.

But I just realized that the problem is bigger: since we pass new_addr to
the set_pte_at() we would need to audit all implementations that they are
safe with just moving PTE page table.

I would rather go with per-architecture enabling. It's much safer.

> Also, do we not flush the caches from any path when we munmap address space?
> We do call do_munmap on the old mapping from mremap after moving to the new 
> one.

Are you sure about that? It can be hided deeper in architecture-specific
code.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct 
> > *vma,
> > split_huge_pmd(vma, old_pmd, old_addr);
> > if (pmd_trans_unstable(old_pmd))
> > continue;
> > +   } else if (extent == PMD_SIZE) {
> 
> Hm. What guarantees that new_addr is PMD_SIZE-aligned?
> It's not obvious to me.

Ignore this :)

-- 
 Kirill A. Shutemov


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> Android needs to mremap large regions of memory during memory management
> related operations. The mremap system call can be really slow if THP is
> not enabled. The bottleneck is move_page_tables, which is copying each
> pte at a time, and can be really slow across a large map. Turning on THP
> may not be a viable option, and is not for us. This patch speeds up the
> performance for non-THP system by copying at the PMD level when possible.
> 
> The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> completion times drops from 160-250 millesconds to 380-400 microseconds.
> 
> Before:
> Total mremap time for 1GB data: 242321014 nanoseconds.
> Total mremap time for 1GB data: 196842467 nanoseconds.
> Total mremap time for 1GB data: 167051162 nanoseconds.
> 
> After:
> Total mremap time for 1GB data: 385781 nanoseconds.
> Total mremap time for 1GB data: 388959 nanoseconds.
> Total mremap time for 1GB data: 402813 nanoseconds.
> 
> Incase THP is enabled, the optimization is skipped. I also flush the
> tlb every time we do this optimization since I couldn't find a way to
> determine if the low-level PTEs are dirty. It is seen that the cost of
> doing so is not much compared the improvement, on both x86-64 and arm64.

I looked into the code more and noticed move_pte() helper called from
move_ptes(). It changes PTE entry to suite new address.

It is only defined in non-trivial way on Sparc. I don't know much about
Sparc and it's hard for me to say if the optimization will break anything
there.

I think it worth to disable the optimization if __HAVE_ARCH_MOVE_PTE is
defined. Or make architectures state explicitely that the optimization is
safe.

> @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct 
> *vma,
>   split_huge_pmd(vma, old_pmd, old_addr);
>   if (pmd_trans_unstable(old_pmd))
>   continue;
> + } else if (extent == PMD_SIZE) {

Hm. What guarantees that new_addr is PMD_SIZE-aligned?
It's not obvious to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v2 1/2] treewide: remove unused address argument from pte_alloc functions

2018-10-12 Thread Kirill A. Shutemov
On Thu, Oct 11, 2018 at 06:37:55PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/arch/m68k/include/asm/mcf_pgalloc.h 
> b/arch/m68k/include/asm/mcf_pgalloc.h
> index 12fe700632f4..4399d712f6db 100644
> --- a/arch/m68k/include/asm/mcf_pgalloc.h
> +++ b/arch/m68k/include/asm/mcf_pgalloc.h
> @@ -12,8 +12,7 @@ extern inline void pte_free_kernel(struct mm_struct *mm, 
> pte_t *pte)
>  
>  extern const char bad_pmd_string[];
>  
> -extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> - unsigned long address)
> +extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
>  {
>   unsigned long page = __get_free_page(GFP_DMA);
>  
> @@ -32,8 +31,6 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned 
> long address)
>  #define pmd_alloc_one_fast(mm, address) ({ BUG(); ((pmd_t *)1); })
>  #define pmd_alloc_one(mm, address)  ({ BUG(); ((pmd_t *)2); })
>  
> -#define pte_alloc_one_fast(mm, addr) pte_alloc_one(mm, addr)
> -

I believe this was one done manually, right?
Please explicitely state everthing you did on not of sematic patch

...

> diff --git a/arch/microblaze/include/asm/pgalloc.h 
> b/arch/microblaze/include/asm/pgalloc.h
> index 7c89390c0c13..f4cc9ffc449e 100644
> --- a/arch/microblaze/include/asm/pgalloc.h
> +++ b/arch/microblaze/include/asm/pgalloc.h
> @@ -108,10 +108,9 @@ static inline void free_pgd_slow(pgd_t *pgd)
>  #define pmd_alloc_one_fast(mm, address)  ({ BUG(); ((pmd_t *)1); })
>  #define pmd_alloc_one(mm, address)   ({ BUG(); ((pmd_t *)2); })
>  
> -extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
> +extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
>  
> -static inline struct page *pte_alloc_one(struct mm_struct *mm,
> - unsigned long address)
> +static inline struct page *pte_alloc_one(struct mm_struct *mm)
>  {
>   struct page *ptepage;
>  
> @@ -132,20 +131,6 @@ static inline struct page *pte_alloc_one(struct 
> mm_struct *mm,
>   return ptepage;
>  }
>  
> -static inline pte_t *pte_alloc_one_fast(struct mm_struct *mm,
> - unsigned long address)
> -{
> - unsigned long *ret;
> -
> - ret = pte_quicklist;
> - if (ret != NULL) {
> - pte_quicklist = (unsigned long *)(*ret);
> - ret[0] = 0;
> - pgtable_cache_size--;
> - }
> - return (pte_t *)ret;
> -}
> -

Ditto.

-- 
 Kirill A. Shutemov


Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5

2018-06-29 Thread Kirill A. Shutemov
On Fri, Jun 29, 2018 at 02:01:46PM -0700, Linus Torvalds wrote:
> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger  
> wrote:
> >
> > I have more information regarding this BUG. Line 700 of page-flags.h is the
> > macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually 
> > expanded
> > the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), 
> > page)
> > in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
> > include/linux/mm.h. I also added a printk call to PageTable() that logs
> > page->page_type. The routine was called twice. The first had page_type of
> > 0xfbff, which would have been expected for a . The second call had
> > 0x, which led to the BUG.
> 
> So it looks to me like the tear-down of the page tables first found a
> page that is indeed a page table, and cleared the page table bit
> (well, it set it - the bits are reversed).
> 
> Then it took an exception (that "interrupt: 700") and that causes
> do_exit() again, and it tries to free the same page table - and now
> it's no longer marked as a page table, because it already went through
> the __ClearPageTable() dance once.
> 
> So on the second path through, it catches that "the bit already said
> it wasn't a page table" and does the BUG.
> 
> But the real question is what the problem was the *first* time around.

+Aneesh.

Looks like pgtable_page_dtor() gets called in __pte_free_tlb() path twice.
Once in __pte_free_tlb() itself and the second time in pgtable_free().

Would this help?

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 6a6673907e45..e7a2f0e6b695 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -137,7 +137,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
 {
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1707781d2f20..30a13b80fd58 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -139,7 +139,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
  unsigned long address)
 {
tlb_flush_pgtable(tlb, address);
-   pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
-- 
 Kirill A. Shutemov


Re: [PATCH] selftests/vm: Update max va test to check for high address return.

2018-02-28 Thread Kirill A. Shutemov
On Wed, Feb 28, 2018 at 09:28:30AM +0530, Aneesh Kumar K.V wrote:
> mmap(-1,..) is expected to search from max supported VA top down. It should 
> find
> an address above ADDR_SWITCH_HINT. Explicitly check for this.

Hm. I don't think this correct. -1 means the application supports any
address, not restricted to 47-bit address space. It doesn't mean the
application *require* the address to be above 47-bit.

At least on x86, -1 just shift upper boundary of address range where we
can look for unmapped area.

-- 
 Kirill A. Shutemov


Re: [mainline][Memory off/on][83e3c48] kernel Oops with memory hot-unplug on ppc

2018-02-19 Thread Kirill A. Shutemov
On Mon, Feb 19, 2018 at 02:47:35PM +0100, Michal Hocko wrote:
> [CC Kirill - I have a vague recollection that there were some follow ups
>  for 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
>  CONFIG_SPARSEMEM_EXTREME=y"). Does any of them apply to this issue?]

All fixups are in v4.15.

> On Mon 05-02-18 19:54:24, Abdul Haleem wrote:
> > 
> > Greetings,
> > 
> > Kernel Oops seen when memory hot-unplug on powerpc mainline kernel.
> > 
> > Machine: Power6 PowerVM ppc64
> > Kernel: 4.15.0
> > Config: attached
> > gcc: 4.8.2
> > Test: Memory hot-unplug of a memory block
> > echo offline > /sys/devices/system/memory/memory/state
> > 
> > The faulty instruction address points to the code path:
> > 
> > # gdb -batch vmlinux -ex 'list *(0xc0238330)'
> > 0xc0238330 is in get_pfnblock_flags_mask
> > (./include/linux/mmzone.h:1157).
> > 1152#endif
> > 1153
> > 1154static inline struct mem_section *__nr_to_section(unsigned long 
> > nr)
> > 1155{
> > 1156#ifdef CONFIG_SPARSEMEM_EXTREME
> > 1157if (!mem_section)
> > 1158return NULL;
> > 1159#endif
> > 1160if (!mem_section[SECTION_NR_TO_ROOT(nr)])
> > 1161return NULL;
> > 
> > 
> > The code was first introduced with commit( 83e3c48: mm/sparsemem:
> > Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y)

Any chance to bisect it?

Could you check if the commit just before 83e3c48729d9 is fine?

-- 
 Kirill A. Shutemov


Re: [PATCH v6 00/24] Speculative page faults

2018-01-16 Thread Kirill A. Shutemov
On Fri, Jan 12, 2018 at 06:25:44PM +0100, Laurent Dufour wrote:
> --
> Benchmarks results
> 
> Base kernel is 4.15-rc6-mmotm-2018-01-04-16-19
> SPF is BASE + this series

Do you have THP=always here? Lack of THP support worries me.

What is performance in the worst case scenario? Like when we go far enough into
speculative code path on every page fault and then fallback to normal page
fault?

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-08 Thread Kirill A. Shutemov
On Wed, Nov 08, 2017 at 03:56:06PM +1100, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.ku...@linux.vnet.ibm.com> writes:
> 
> >> 
> >> If it is decided to keep these kind of heuristics, can we get just a
> >> small but reasonably precise description of each change to the
> >> interface and ways for using the new functionality, such that would be
> >> suitable for the man page? I couldn't fix powerpc because nothing
> >> matches and even Aneesh and you differ on some details (MAP_FIXED
> >> behaviour).
> >
> >
> > I would consider MAP_FIXED as my mistake. We never discussed this 
> > explicitly and I kind of assumed it to behave the same way. ie, we 
> > search in lower address space (128TB) if the hint addr is below 128TB.
> >
> > IIUC we agree on the below.
> >
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> > but hint_addr + len is > 128TB.
> 
> So:
>   mmap(0x7000, 0x2000, ..., MAP_FIXED ...) = 0x7000
> 
> > 2) For everything else we search in < 128TB space if hint addr is below 
> > 128TB
> 
>   mmap((x < 128T), 0x1000, ...) = (y < 128T)
>   ...
>   mmap(0x7000, 0x1000, ...) = 0x7000
>   mmap(0x8000, 0x1000, ...) = 0x8000
>   ...
>   mmap((x >= 128T), 0x1000, ...) = (y >= 128T)
> 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. 
> > The decision to switch to large address space is primarily based on hint 
> > addr
> 
> But does the mmap succeed in that case or not?
> 
> ie:  mmap(0x7000, 0x2000, ...) = ?

It does, but resulting address doesn't match the hint. It's somewhere
below 47-bit border.

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> 
> > 
> > If it is decided to keep these kind of heuristics, can we get just a
> > small but reasonably precise description of each change to the
> > interface and ways for using the new functionality, such that would be
> > suitable for the man page? I couldn't fix powerpc because nothing
> > matches and even Aneesh and you differ on some details (MAP_FIXED
> > behaviour).
> 
> 
> I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> and I kind of assumed it to behave the same way. ie, we search in lower
> address space (128TB) if the hint addr is below 128TB.
> 
> IIUC we agree on the below.
> 
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> hint_addr + len is > 128TB.
> 
> 2) For everything else we search in < 128TB space if hint addr is below
> 128TB
> 
> 3) We don't switch to large address space if hint_addr + len > 128TB. The
> decision to switch to large address space is primarily based on hint addr
> 
> Is there any other rule we need to outline? Or is any of the above not
> correct?

That's correct.

-- 
 Kirill A. Shutemov



Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> > > 
> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > > > good example that steadily keeps incrementing address, so depending
> > > > > on malloc usage and address space randomization, you will get a brk()
> > > > > that ends exactly at 128T, then the next one will be >
> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > > > 
> > > > No, it won't. You will hit stack first.
> > > 
> > > That's not actually true on POWER in some cases.  See the process maps I
> > > posted here:
> > > 
> > ><https://marc.info/?l=linuxppc-embedded=150988538106263=2>
> > 
> > Hm? I see that in all three cases the [stack] is the last mapping.
> > Do I miss something?
> 
> Hah, I had not noticed.  Occasionally, the order of heap and stack is
> reversed.  This happens in approximately 15% of the runs.

Heh. I guess ASLR on Power is too fancy :)

That's strange layout. It doesn't give that much (relatively speaking)
virtual address space for both stack and heap to grow.

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > No, it won't. You will hit stack first.
> 
> I guess so. Florian's bug didn't crash there for some reason, okay
> but I suppose my point about brk is not exactly where the standard
> heap is, but the pattern of allocations. An allocator that uses
> mmap for managing its address space might do the same thing, e.g.,
> incrementally expand existing mmaps as necessary.

With MAP_FIXED? I don't think so.

> > > Second, the kernel can never completely solve the problem this way.
> > > How do we know a malloc library will not ask for > 128TB addresses
> > > and pass them to an unknowing application?  
> > 
> > The idea is that an application can provide hint (mallopt() ?) to malloc
> > implementation that it's ready to full address space. In this case, malloc
> > can use mmap((void *) -1,...) for its allocations and get full address
> > space this way.
> 
> Point is, there's nothing stopping an allocator library or runtime
> from asking for mmap anywhere and returning it to userspace.

Right. Nobody would stop it from doing stupid things. There are many
things that a library may do that application would not be happy about.

> Do > 128TB pointers matter so much that we should add this heuristic
> to prevent breakage, but little enough that we can accept some rare
> cases getting through? Genuine question.

At the end of the day what matters is if heuristic helps prevent breakage
of existing userspace and doesn't stay in the way of legitimate use of
full address space.

So far, it looks okay to me.

> > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > 
> > Correctly functioning program would never request addr+len above 47-bit
> > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > request would simply fail on machine that doesn't support large VA.
> > 
> > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > machine that doesn't support large VA, kernel will find another place
> > under 47-bit. And I can imagine a reasonable application that does
> > something like this.
> > 
> > So we cannot rely that application is ready to handle large
> > addresses if we see addr+len without MAP_FIXED.
> 
> By the same logic, a request for addr > 128TB without MAP_FIXED will
> not fail, therefore we can't rely on that either.
> 
> Or an app that links to a library that attempts MAP_FIXED allocation
> of addr + len above 128TB might use high bits of pointer returned by
> that library because those are never satisfied today and the library
> would fall back.

If you want to point that it's ABI break, yes it is.

But we allow ABI break as long as nobody notices. I think it's reasonable
to expect that nobody relies on such corner cases.

If we would find any piece of software affect by the change we would need
to reconsider.

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> 
> > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > good example that steadily keeps incrementing address, so depending
> > > on malloc usage and address space randomization, you will get a brk()
> > > that ends exactly at 128T, then the next one will be >
> > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > 
> > No, it won't. You will hit stack first.
> 
> That's not actually true on POWER in some cases.  See the process maps I
> posted here:
> 
>   <https://marc.info/?l=linuxppc-embedded=150988538106263=2>

Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 09:15:21AM +0100, Florian Weimer wrote:
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect
> applications to do that.  If you want address-based opt in, it should work
> without MAP_FIXED.  Sure, in obscure cases, applications might still see
> out-of-range addresses, but I expected a full opt-out based on RLIMIT_AS
> would be sufficient for them.

Just use mmap(-1), without MAP_FIXED to get full address space.

-- 
 Kirill A. Shutemov


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Kirill A. Shutemov
On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> C'ing everyone who was on the x86 56-bit user virtual address patch.
> 
> I think we need more time to discuss this behaviour, in light of the
> regression Florian uncovered. I would propose we turn off the 56-bit
> user virtual address support for x86 for 4.14, and powerpc would
> follow and turn off its 512T support until we can get a better handle
> on the problems. (Actually Florian initially hit a couple of bugs in
> powerpc implementation, but pulling that string uncovers a whole lot
> of difficulties.)
> 
> The bi-modal behavior switched based on a combination of mmap address
> hint and MAP_FIXED just sucks. It's segregating our VA space with
> some non-standard heuristics, and it doesn't seem to work very well.
> 
> What are we trying to do? Allow SAP HANA etc use huge address spaces
> by coding to these specific mmap heuristics we're going to add,
> rather than solving it properly in a way that requires adding a new
> syscall or personality or prctl or sysctl. Okay, but the cost is that
> despite best efforts, it still changes ABI behaviour for existing
> applications and these heuristics will become baked into the ABI that
> we will have to support. Not a good tradeoff IMO.
> 
> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

No, it won't. You will hit stack first.

> Second, the kernel can never completely solve the problem this way.
> How do we know a malloc library will not ask for > 128TB addresses
> and pass them to an unknowing application?

The idea is that an application can provide hint (mallopt() ?) to malloc
implementation that it's ready to full address space. In this case, malloc
can use mmap((void *) -1,...) for its allocations and get full address
space this way.

> And lastly, there are a fair few bugs and places where description
> in changelogs and mailing lists does not match code. You don't want
> to know the mess in powerpc, but even x86 has two I can see:
> MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> indicated it should not),

Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
shouldn't work. My intention was that it should, although I haven't stated
it in the changelog.

The idea was we shouldn't allow to slip above 47-bits by accidentally.

Correctly functioning program would never request addr+len above 47-bit
with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
request would simply fail on machine that doesn't support large VA.

In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
machine that doesn't support large VA, kernel will find another place
under 47-bit. And I can imagine a reasonable application that does
something like this.

So we cannot rely that application is ready to handle large
addresses if we see addr+len without MAP_FIXED.

> arch_get_unmapped_area_topdown() with an address hint is checking
> against TASK_SIZE rather than the limited 128TB address, so it looks
> like it won't follow the heuristics.

You are right. This is broken. If user would request mapping above vdso,
but below DEFAULT_MAP_WINDOW it will succeed.

I'll send patch to fix this. But it doesn't look as a show-stopper to me.

Re-checking things for this reply I found actual bug, see:

http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shute...@linux.intel.com

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk().

2017-10-03 Thread Kirill A. Shutemov
On Tue, Oct 03, 2017 at 10:26:06AM -0400, Zi Yan wrote:
> From: Zi Yan <zi@cs.rutgers.edu>
> 
> A non present pmd entry can appear after pmd_lock is taken in
> page_vma_mapped_walk(), even if THP migration is not enabled.
> The WARN_ONCE is unnecessary.
> 
> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> Reported-and-tested-by: Abdul Haleem <abdha...@linux.vnet.ibm.com>
> Signed-off-by: Zi Yan <zi@cs.rutgers.edu>
> Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
> Cc: Anshuman Khandual <khand...@linux.vnet.ibm.com>
> Cc: Andrew Morton <a...@linux-foundation.org>

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

-- 
 Kirill A. Shutemov


Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-26 Thread Kirill A. Shutemov
ed to look how we can hand over from speculative to
non-speculative path without starting from scratch (when possible)?

> + /* Transparent huge pages are not supported. */
> + if (unlikely(pmd_trans_huge(*pmd)))
> + goto out_walk;

That's looks like a blocker to me.

Is there any problem with making it supported (besides plain coding)?

> +
> + vmf.vma = vma;
> + vmf.pmd = pmd;
> + vmf.pgoff = linear_page_index(vma, address);
> + vmf.gfp_mask = __get_fault_gfp_mask(vma);
> + vmf.sequence = seq;
> + vmf.flags = flags;
> +
> + local_irq_enable();
> +
> + /*
> +  * We need to re-validate the VMA after checking the bounds, otherwise
> +  * we might have a false positive on the bounds.
> +  */
> + if (read_seqcount_retry(>vm_sequence, seq))
> + goto unlock;
> +
> + ret = handle_pte_fault();
> +
> +unlock:
> + srcu_read_unlock(_srcu, idx);
> + return ret;
> +
> +out_walk:
> + local_irq_enable();
> + goto unlock;
> +}
> +#endif /* __HAVE_ARCH_CALL_SPF */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> -- 
> 2.7.4
> 

-- 
 Kirill A. Shutemov


Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-10 Thread Kirill A. Shutemov
On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote:
> On 10/08/2017 02:58, Kirill A. Shutemov wrote:
> > On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> >> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> >>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> >>>> The VMA sequence count has been introduced to allow fast detection of
> >>>> VMA modification when running a page fault handler without holding
> >>>> the mmap_sem.
> >>>>
> >>>> This patch provides protection agains the VMA modification done in :
> >>>>  - madvise()
> >>>>  - mremap()
> >>>>  - mpol_rebind_policy()
> >>>>  - vma_replace_policy()
> >>>>  - change_prot_numa()
> >>>>  - mlock(), munlock()
> >>>>  - mprotect()
> >>>>  - mmap_region()
> >>>>  - collapse_huge_page()
> >>>
> >>> I don't thinks it's anywhere near complete list of places where we touch
> >>> vm_flags. What is your plan for the rest?
> >>
> >> The goal is only to protect places where change to the VMA is impacting the
> >> page fault handling. If you think I missed one, please advise.
> > 
> > That's very fragile approach. We rely here too much on specific compiler 
> > behaviour.
> > 
> > Any write access to vm_flags can, in theory, be translated to several
> > write accesses. For instance with setting vm_flags to 0 in the middle,
> > which would result in sigfault on page fault to the vma.
> 
> Indeed, just setting vm_flags to 0 will not result in sigfault, the real
> job is done when the pte are updated and the bits allowing access are
> cleared. Access to the pte is controlled by the pte lock.
> Page fault handler is triggered based on the pte bits, not the content of
> vm_flags and the speculative page fault is checking for the vma again once
> the pte lock is held. So there is no concurrency when dealing with the pte
> bits.

Suppose we are getting page fault to readable VMA, pte is clear at the
time of page fault. In this case we need to consult vm_flags to check if
the vma is read-accessible.

If by the time of check vm_flags happend to be '0' we would get SIGSEGV as
the vma appears to be non-readable.

Where is my logic faulty?

> Regarding the compiler behaviour, there are memory barriers and locking
> which should prevent that.

Which locks barriers are you talking about?

We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere.

-- 
 Kirill A. Shutemov


Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-09 Thread Kirill A. Shutemov
On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> > On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> >> The VMA sequence count has been introduced to allow fast detection of
> >> VMA modification when running a page fault handler without holding
> >> the mmap_sem.
> >>
> >> This patch provides protection agains the VMA modification done in :
> >>- madvise()
> >>- mremap()
> >>- mpol_rebind_policy()
> >>- vma_replace_policy()
> >>- change_prot_numa()
> >>- mlock(), munlock()
> >>- mprotect()
> >>- mmap_region()
> >>- collapse_huge_page()
> > 
> > I don't thinks it's anywhere near complete list of places where we touch
> > vm_flags. What is your plan for the rest?
> 
> The goal is only to protect places where change to the VMA is impacting the
> page fault handling. If you think I missed one, please advise.

That's very fragile approach. We rely here too much on specific compiler 
behaviour.

Any write access to vm_flags can, in theory, be translated to several
write accesses. For instance with setting vm_flags to 0 in the middle,
which would result in sigfault on page fault to the vma.

Nothing (apart from common sense) prevents compiler from generating this
kind of pattern.

-- 
 Kirill A. Shutemov


Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-09 Thread Kirill A. Shutemov
On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
> 
> This patch provides protection agains the VMA modification done in :
>   - madvise()
>   - mremap()
>   - mpol_rebind_policy()
>   - vma_replace_policy()
>   - change_prot_numa()
>   - mlock(), munlock()
>   - mprotect()
>   - mmap_region()
>   - collapse_huge_page()

I don't thinks it's anywhere near complete list of places where we touch
vm_flags. What is your plan for the rest?
-- 
 Kirill A. Shutemov


Re: [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE

2017-08-09 Thread Kirill A. Shutemov
On Tue, Aug 08, 2017 at 04:35:35PM +0200, Laurent Dufour wrote:
> @@ -2295,7 +2302,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>   /*
>* Re-check the pte - we dropped the lock
>*/
> - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
> + if (!pte_map_lock(vmf)) {
> + mem_cgroup_cancel_charge(new_page, memcg, false);
> + ret = VM_FAULT_RETRY;
> + goto oom_free_new;

With the change, label is misleading.

> + }
>   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>   if (old_page) {
>   if (!PageAnon(old_page)) {

-- 
 Kirill A. Shutemov


Re: [RFC PATCH 1/3] powerpc/mm: update pmdp_invalidate to return old pmd value

2017-07-27 Thread Kirill A. Shutemov
On Thu, Jul 27, 2017 at 02:54:49PM +0200, Michal Hocko wrote:
> EMISSING_CHANGELOG
> 
> besides that no user actually uses the return value. Please fold this
> into the patch which uses the new functionality.

That's for patchset I'm working on[1].

[1] 
http://lkml.kernel.org/r/20170615145224.66200-1-kirill.shute...@linux.intel.com

-- 
 Kirill A. Shutemov


Re: 5-level pagetable patches break ppc64le

2017-03-13 Thread Kirill A. Shutemov
On Mon, Mar 13, 2017 at 09:05:50PM +1100, Anton Blanchard wrote:
> Hi,
> 
> My ppc64le boot tests stopped working as of commit c2febafc6773 ("mm:
> convert generic code to 5-level paging")
> 
> We hang part way during boot, just before bringing up the network. I
> haven't had a chance to narrow it down yet.

Please check if patch by this link helps:

http://lkml.kernel.org/r/20170313052213.11411-1-kirill.shute...@linux.intel.com

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: stop leaking PageTables

2017-01-08 Thread Kirill A. Shutemov
On Sat, Jan 07, 2017 at 03:37:31PM -0800, Hugh Dickins wrote:
> 4.10-rc loadtest (even on x86, even without THPCache) fails with
> "fork: Cannot allocate memory" or some such; and /proc/meminfo
> shows PageTables growing.
> 
> rc1 removed the freeing of an unused preallocated pagetable after
> do_fault_around() has called map_pages(): which is usually a good
> optimization, so that the followup doesn't have to reallocate one;
> but it's not sufficient to shift the freeing into alloc_set_pte(),
> since there are failure cases (most commonly VM_FAULT_RETRY) which
> never reach finish_fault().
> 
> Check and free it at the outer level in do_fault(), then we don't
> need to worry in alloc_set_pte(), and can restore that to how it was
> (I cannot find any reason to pte_free() under lock as it was doing).
> 
> And fix a separate pagetable leak, or crash, introduced by the same
> change, that could only show up on some ppc64: why does do_set_pmd()'s
> failure case attempt to withdraw a pagetable when it never deposited
> one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
> Residue of an earlier implementation, perhaps?  Delete it.
> 
> Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
> Signed-off-by: Hugh Dickins <hu...@google.com>

Sorry, that I missed this initially.

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

-- 
 Kirill A. Shutemov


Re: [RFC 1/4] mm: remove unused TASK_SIZE_OF()

2017-01-02 Thread Kirill A. Shutemov
On Fri, Dec 30, 2016 at 06:56:31PM +0300, Dmitry Safonov wrote:
> All users of TASK_SIZE_OF(tsk) have migrated to mm->task_size or
> TASK_SIZE_MAX since:
> commit d696ca016d57 ("x86/fsgsbase/64: Use TASK_SIZE_MAX for
> FSBASE/GSBASE upper limits"),
> commit a06db751c321 ("pagemap: check permissions and capabilities at
> open time"),
> 
> Cc: Catalin Marinas <catalin.mari...@arm.com>
> Cc: Will Deacon <will.dea...@arm.com>
> Cc: Ralf Baechle <r...@linux-mips.org>
> Cc: "James E.J. Bottomley" <j...@parisc-linux.org>
> Cc: Helge Deller <del...@gmx.de>
> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Cc: Paul Mackerras <pau...@samba.org>
> Cc: Michael Ellerman <m...@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidef...@de.ibm.com>
> Cc: Heiko Carstens <heiko.carst...@de.ibm.com>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@linux-mips.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Signed-off-by: Dmitry Safonov <dsafo...@virtuozzo.com>

I've noticed this too.

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

-- 
 Kirill A. Shutemov


Re: [PATCH V3 2/2] mm: THP page cache support for ppc64

2016-11-13 Thread Kirill A. Shutemov
On Sun, Nov 13, 2016 at 08:30:25PM +0530, Aneesh Kumar K.V wrote:
> Add arch specific callback in the generic THP page cache code that will
> deposit and withdarw preallocated page table. Archs like ppc64 use
> this preallocated table to store the hash pte slot information.
> 
> Testing:
> kernel build of the patch series on tmpfs mounted with option huge=always
> 
> The related thp stat:
> thp_fault_alloc 72939
> thp_fault_fallback 60547
> thp_collapse_alloc 603
> thp_collapse_alloc_failed 0
> thp_file_alloc 253763
> thp_file_mapped 4251
> thp_split_page 51518
> thp_split_page_failed 1
> thp_deferred_split_page 73566
> thp_split_pmd 665
> thp_zero_page_alloc 3
> thp_zero_page_alloc_failed 0
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.vnet.ibm.com>

One nit-pick below, but otherwise

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

> @@ -2975,6 +3004,13 @@ static int do_set_pmd(struct fault_env *fe, struct 
> page *page)
>   ret = 0;
>   count_vm_event(THP_FILE_MAPPED);
>  out:
> + /*
> +  * If we are going to fallback to pte mapping, do a
> +  * withdraw with pmd lock held.
> +  */
> + if (arch_needs_pgtable_deposit() && (ret == VM_FAULT_FALLBACK))

Parenthesis are redundant around ret check.

> + fe->prealloc_pte = pgtable_trans_huge_withdraw(vma->vm_mm,
> +        fe->pmd);
>   spin_unlock(fe->ptl);
>   return ret;
>  }

-- 
 Kirill A. Shutemov


Re: [PATCH 2/2] mm: THP page cache support for ppc64

2016-11-11 Thread Kirill A. Shutemov
On Fri, Nov 11, 2016 at 05:42:11PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kir...@shutemov.name> writes:
> 
> > On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> >> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct 
> >> page *page)
> >>ret = VM_FAULT_FALLBACK;
> >>page = compound_head(page);
> >>  
> >> +  /*
> >> +   * Archs like ppc64 need additonal space to store information
> >> +   * related to pte entry. Use the preallocated table for that.
> >> +   */
> >> +  if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
> >> +  fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> >> +
> >
> > -ENOMEM handling?
> 
> How about
> 
>   if (arch_needs_pgtable_deposit() && !fe->prealloc_pte) {
>   fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
>   if (!fe->prealloc_pte)
>   return VM_FAULT_OOM;
>   }
> 
> 
> 
> >
> > I think we should do this way before this point. Maybe in do_fault() or
> > something.
> 
> doing this in do_set_pmd keeps this closer to where we set the pmd. Any
> reason you thing we should move it higher up the stack. We already do
> pte_alloc() at the same level for a non transhuge case in
> alloc_set_pte().

I vaguely remember Hugh mentioned deadlock of allocation under page-lock vs.
OOM-killer (or something else?).

If the deadlock is still there it would be matter of making preallocation
unconditional to fix the issue.

But what you propose about doesn't make situation any worse. I'm fine with
that.

-- 
 Kirill A. Shutemov


Re: [PATCH 2/2] mm: THP page cache support for ppc64

2016-11-11 Thread Kirill A. Shutemov
On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct 
> page *page)
>   ret = VM_FAULT_FALLBACK;
>   page = compound_head(page);
>  
> + /*
> +  * Archs like ppc64 need additonal space to store information
> +  * related to pte entry. Use the preallocated table for that.
> +  */
> + if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
> + fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> +

-ENOMEM handling?

I think we should do this way before this point. Maybe in do_fault() or
something.

-- 
 Kirill A. Shutemov


Re: [PATCH V2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-11 Thread Kirill A. Shutemov
On Mon, Nov 07, 2016 at 08:16:39PM +0530, Aneesh Kumar K.V wrote:
> Architectures like ppc64 want to use page table deposit/withraw
> even with huge pmd dax entries. Allow arch to override the
> vma_is_anonymous check by moving that to pmd_move_must_withdraw
> function
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.vnet.ibm.com>

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

-- 
 Kirill A. Shutemov


Re: [RFC PATCH] powerpc/mm: THP page cache support

2016-09-26 Thread Kirill A. Shutemov
On Thu, Sep 22, 2016 at 09:32:40PM +0530, Aneesh Kumar K.V wrote:
> Update arch hook in the generic THP page cache code, that will
> deposit and withdarw preallocated page table. Archs like ppc64 use
> this preallocated table to store the hash pte slot information.
> 
> This is an RFC patch and I am sharing this early to get feedback on the
> approach taken. I have used stress-ng mmap-file operation and that
> resulted in some thp_file_mmap as show below.
> 
> [/mnt/stress]$ grep thp_file /proc/vmstat
> thp_file_alloc 25403
> thp_file_mapped 16967
> [/mnt/stress]$
> 
> I did observe wrong nr_ptes count once. I need to recreate the problem
> again.

I don't see anything that could cause that.

The patch looks good to me (apart from nr_ptes issue). Few minor nitpicks
below.

> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  3 ++
>  include/asm-generic/pgtable.h|  8 +++-
>  mm/Kconfig   |  6 +--
>  mm/huge_memory.c | 19 +-
>  mm/khugepaged.c  | 21 ++-
>  mm/memory.c  | 56 
> +++-
>  6 files changed, 93 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 263bf39ced40..1f45b06ce78e 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1017,6 +1017,9 @@ static inline int pmd_move_must_withdraw(struct 
> spinlock *new_pmd_ptl,
>*/
>   return true;
>  }
> +
> +#define arch_needs_pgtable_deposit() (true)
> +
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index d4458b6dbfb4..0d1e400e82a2 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -660,11 +660,17 @@ static inline int pmd_move_must_withdraw(spinlock_t 
> *new_pmd_ptl,
>   /*
>* With split pmd lock we also need to move preallocated
>* PTE page table if new_pmd is on different PMD page table.
> +  *
> +  * We also don't deposit and withdraw tables for file pages.
>*/
> - return new_pmd_ptl != old_pmd_ptl;
> + return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
>  }
>  #endif
>  
> +#ifndef arch_needs_pgtable_deposit
> +#define arch_needs_pgtable_deposit() (false)
> +#endif
> +
>  /*
>   * This function is meant to be used by sites walking pagetables with
>   * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> diff --git a/mm/Kconfig b/mm/Kconfig
> index be0ee11fa0d9..0a279d399722 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -447,13 +447,9 @@ choice
> benefit.
>  endchoice
>  
> -#
> -# We don't deposit page tables on file THP mapping,
> -# but Power makes use of them to address MMU quirk.
> -#
>  config   TRANSPARENT_HUGE_PAGECACHE
>   def_bool y
> - depends on TRANSPARENT_HUGEPAGE && !PPC
> + depends on TRANSPARENT_HUGEPAGE
>  
>  #
>  # UP and nommu archs use km based percpu allocator
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a6abd76baa72..37176f455d16 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1320,6 +1320,14 @@ out_unlocked:
>   return ret;
>  }
>  
> +void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)

static?

> +{
> + pgtable_t pgtable;
> + pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> + pte_free(mm, pgtable);
> + atomic_long_dec(>nr_ptes);
> +}
> +
>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>pmd_t *pmd, unsigned long addr)
>  {
> @@ -1359,6 +1367,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   atomic_long_dec(>mm->nr_ptes);
>   add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>   } else {
> + if (arch_needs_pgtable_deposit())

Just hide the arch_needs_pgtable_deposit() check in zap_deposited_table().

> + zap_deposited_table(tlb->mm, pmd);
>   add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
>   }
>   spin_unlock(ptl);
-- 
 Kirill A. Shutemov


Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-25 Thread Kirill A. Shutemov
On Thu, Feb 25, 2016 at 03:49:33PM +, Steve Capper wrote:
> On 23 February 2016 at 18:47, Will Deacon <will.dea...@arm.com> wrote:
> > [adding Steve, since he worked on THP for 32-bit ARM]
> 
> Apologies for my late reply...
> 
> >
> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> >> On Tue, 23 Feb 2016 13:32:21 +0300
> >> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
> >> > The theory is that the splitting bit effetely masked bogus pmd_present():
> >> > we had pmd_trans_splitting() in all code path and that prevented mm from
> >> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with 
> >> > the
> >> > pmd where it shouldn't and here's a boom.
> >>
> >> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd 
> >> under
> >> splitting, after all there is a page behind the the pmd. Also, if it was
> >> bogus, and it would need to be false, why should it be marked 
> >> !pmd_present()
> >> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
> >> is pmd_present() before that, on all architectures, and if there was any
> >> problem/race with that, setting it to !pmd_present() at this stage would
> >> only (marginally) reduce the race window.
> >>
> >> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
> >> i.e. they do not set pmd_present() == false, only mark it so that it would
> >> not generate a new TLB entry, just like on s390. After all, the function
> >> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
> >> before that call is just a little ambiguous in its wording. When it says
> >> "mark the pmd notpresent" it probably means "mark it so that it will not
> >> generate a new TLB entry", which is also what the comment is really about:
> >> prevent huge and small entries in the TLB for the same page at the same
> >> time.
> >>
> >> FWIW, and since the ARM arch-list is already on cc, I think there is
> >> an issue with pmdp_invalidate() on ARM, since it also seems to clear
> >> the trans_huge (and formerly trans_splitting) bit, which actually makes
> >> the pmd !pmd_present(), but it violates the other requirement from the
> >> comment:
> >> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
> >> on the pmd until the split is complete for this pmd"
> >
> > I've only been testing this for arm64 (where I'm yet to see a problem),
> > but we use the generic pmdp_invalidate implementation from
> > mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
> > after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
> > the entire entry... Steve?
> 
> pmd_mknotpresent on arm looks inconsistent with the other
> architectures and can be changed.
> 
> Having had a look at the usage, I can't see it causing an immediate
> problem (that needs to be addressed by an emergency patch).
> We don't have a notion of splitting pmds (so there is no splitting
> information to lose), and the only usage I could see of
> pmd_mknotpresent was:
> 
> pmdp_invalidate(vma, haddr, pmd);
> pmd_populate(mm, pmd, pgtable);
> 
> In mm/huge_memory.c, around line 3588.
> 
> So we invalidate the entry (which puts down a faulting entry from
> pmd_mknotpresent and invalidates tlb), then immediately put down a
> table entry with pmd_populate.
> 
> I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
> what took me time), and I didn't notice any problems with 4.5-rc5.

If I read code correctly, your pmd_mknotpresent() makes the pmd
pmd_none(), right? If yes, it's a problem.

It introduces race I've described here:

https://marc.info/?l=linux-mm=144723658100512=4

Basically, if zap_pmd_range() would see pmd_none() between
pmdp_mknotpresent() and pmd_populate(), we're screwed.

The race window is small, but it's there.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-23 Thread Kirill A. Shutemov
On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> I'll check with Martin, maybe it is actually trivial, then we can
> do a quick test it to rule that one out.

Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
_the_ bug.

pmdp_invalidate() is called for the wrong address :-/
I guess that can be destructive on the architecture, right?

Could you check this?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c317b85ea7d..4246bc70e55a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2865,7 +2865,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
 
-   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+   for (i = 0; i < HPAGE_PMD_NR; i++) {
pte_t entry, *pte;
/*
 * Note that NUMA hinting access restrictions are not
@@ -2886,9 +2886,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
}
if (dirty)
SetPageDirty(page + i);
-   pte = pte_offset_map(&_pmd, haddr);
+   pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
BUG_ON(!pte_none(*pte));
-   set_pte_at(mm, haddr, pte, entry);
+   set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
atomic_inc([i]._mapcount);
pte_unmap(pte);
}
@@ -2938,7 +2938,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
pmd_populate(mm, pmd, pgtable);
 
if (freeze) {
-   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+   for (i = 0; i < HPAGE_PMD_NR; i++) {
page_remove_rmap(page + i, false);
put_page(page + i);
    }
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Question on follow_page_mask

2016-02-23 Thread Kirill A. Shutemov
On Tue, Feb 23, 2016 at 06:45:05PM +0530, Anshuman Khandual wrote:
> Not able to understand the first code block of follow_page_mask
> function. follow_huge_addr function is expected to find the page
> struct for the given address if it turns out to be a HugeTLB page
> but then when it finds the page we bug on if it had been called
> with FOLL_GET flag.
> 
>   page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
>   if (!IS_ERR(page)) {
>   BUG_ON(flags & FOLL_GET);
>   return page;
>   }
> 
> do_move_page_to_node_array calls follow_page with FOLL_GET which
> in turn calls follow_page_mask with FOLL_GET. On POWER, the
> function follow_huge_addr is defined and does not return -EINVAL
> like the generic one. It returns the page struct if its a HugeTLB
> page. Just curious to know what is the purpose behind the BUG_ON.

I would guess requesting pin on non-reclaimable page is considered
useless, meaning suspicius behavior. BUG_ON() is overkill, I think.
WARN_ON_ONCE() would make it.

Not that this follow_huge_addr() on Power is not reachable via
do_move_page_to_node_array(), because the vma is !vma_is_migratable().

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-23 Thread Kirill A. Shutemov
On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> On Fri, 12 Feb 2016 16:57:27 +0100
> Christian Borntraeger <borntrae...@de.ibm.com> wrote:
> 
> > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > 
> > Don't know, Gerald or Martin?
> 
> The implementation frequently changes depending on how many new bits Martin
> needs to squeeze out :-)
> We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if 
> the
> entry is not empty. pmd_none() of course does the opposite, it checks if it is
> empty.

I still worry about pmd_present(). It looks wrong to me. I wounder if
patch below makes a difference.

The theory is that the splitting bit effetely masked bogus pmd_present():
we had pmd_trans_splitting() in all code path and that prevented mm from
touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
pmd where it shouldn't and here's a boom.

I'm not sure that the patch is correct wrt yound/old pmds and I have no
way to test it...

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 64ead8091248..2eeb17ab68ac 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -490,7 +490,7 @@ static inline int pud_bad(pud_t pud)
 
 static inline int pmd_present(pmd_t pmd)
 {
-   return pmd_val(pmd) != _SEGMENT_ENTRY_INVALID;
+   return !(pmd_val(pmd) & _SEGMENT_ENTRY_INVALID);
 }
 
 static inline int pmd_none(pmd_t pmd)
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-18 Thread Kirill A. Shutemov
On Thu, Feb 18, 2016 at 04:00:37PM +0100, Gerald Schaefer wrote:
> On Thu, 18 Feb 2016 01:58:08 +0200
> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
> 
> > On Wed, Feb 17, 2016 at 08:13:40PM +0100, Gerald Schaefer wrote:
> > > On Sat, 13 Feb 2016 12:58:31 +0100 (CET)
> > > Sebastian Ott <seb...@linux.vnet.ibm.com> wrote:
> > > 
> > > > [   59.875935] [ cut here ]
> > > > [   59.875937] kernel BUG at mm/huge_memory.c:2884!
> > > > [   59.875979] illegal operation: 0001 ilc:1 [#1] PREEMPT SMP 
> > > > DEBUG_PAGEALLOC
> > > > [   59.875986] Modules linked in: bridge stp llc btrfs xor mlx4_en 
> > > > vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad 
> > > > ib_core ib_addr ghash_s390 prng raid6_pq ecb aes_s390 des_s390 
> > > > des_generic sha512_s390 sha256_s390 sha1_s390 mlx4_core sha_common 
> > > > genwqe_card scm_block crc_itu_t vhost_net tun vhost dm_mod macvtap 
> > > > eadm_sch macvlan kvm autofs4
> > > > [   59.876033] CPU: 2 PID: 5402 Comm: git Tainted: GW   
> > > > 4.4.0-07794-ga4eff16-dirty #77
> > > > [   59.876036] task: d2312948 ti: cfecc000 task.ti: 
> > > > cfecc000
> > > > [   59.876039] Krnl PSW : 0704d0018000 002bf3aa 
> > > > (__split_huge_pmd_locked+0x562/0xa10)
> > > > [   59.876045]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > > > PM:0 EA:3
> > > >Krnl GPRS: 01a7a1cf 03d10177c000 
> > > > 00044068 5df00215
> > > > [   59.876051]0001 0001 
> > > >  774e6900
> > > > [   59.876054]03ff5200 6d403b10 
> > > > 6e1eb800 03ff51f0
> > > > [   59.876058]03d10177c000 00715190 
> > > > 002bf234 cfecfb58
> > > > [   59.876068] Krnl Code: 002bf39c: d507d010a000clc 
> > > > 16(8,%%r13),0(%%r10)
> > > >   002bf3a2: a7840004brc 
> > > > 8,2bf3aa
> > > >  #002bf3a6: a7f40001brc 
> > > > 15,2bf3a8
> > > >  >002bf3aa: 91407440tm  
> > > > 1088(%%r7),64
> > > >   002bf3ae: a7840208brc 
> > > > 8,2bf7be
> > > >   002bf3b2: a7f401e9brc 
> > > > 15,2bf784
> > > >   002bf3b6: 9104a006tm  
> > > > 6(%%r10),4
> > > >   002bf3ba: a7740004brc 
> > > > 7,2bf3c2
> > > > [   59.876089] Call Trace:
> > > > [   59.876092] ([<002bf234>] 
> > > > __split_huge_pmd_locked+0x3ec/0xa10)
> > > > [   59.876095]  [<002c4310>] __split_huge_pmd+0x118/0x218
> > > > [   59.876099]  [<002810e8>] unmap_single_vma+0x2d8/0xb40
> > > > [   59.876102]  [<00282d66>] zap_page_range+0x116/0x318
> > > > [   59.876105]  [<0029b834>] SyS_madvise+0x23c/0x5e8
> > > > [   59.876108]  [<006f9f56>] system_call+0xd6/0x258
> > > > [   59.876111]  [<03ff9bbfd282>] 0x3ff9bbfd282
> > > > [   59.876113] INFO: lockdep is turned off.
> > > > [   59.876115] Last Breaking-Event-Address:
> > > > [   59.876118]  [<002bf3a6>] __split_huge_pmd_locked+0x55e/0xa10
> > > 
> > > The BUG at mm/huge_memory.c:2884 is interesting, it's the 
> > > BUG_ON(!pte_none(*pte))
> > > check in __split_huge_pmd_locked(). Obviously we expect the pre-allocated
> > > pagetables to be empty, but in collapse_huge_page() we deposit the 
> > > original
> > > pagetable instead of allocating a new (empty) one. This saves an 
> > > allocation,
> > > which is good, but doesn't that mean that if such a collapsed hugepage 
> > > will
> > > ever be split, we will always run into the BUG_ON(!pte_none(*pte)), or one
> > > of the two other VM_BUG_ONs in mm/huge_memory.c that check the same?
> > > 
> > > This behavior is not new, it was the same before the THP rework, so I do 
> > > not
> > > assume 

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-17 Thread Kirill A. Shutemov
On Wed, Feb 17, 2016 at 08:13:40PM +0100, Gerald Schaefer wrote:
> On Sat, 13 Feb 2016 12:58:31 +0100 (CET)
> Sebastian Ott <seb...@linux.vnet.ibm.com> wrote:
> 
> > [   59.875935] [ cut here ]
> > [   59.875937] kernel BUG at mm/huge_memory.c:2884!
> > [   59.875979] illegal operation: 0001 ilc:1 [#1] PREEMPT SMP 
> > DEBUG_PAGEALLOC
> > [   59.875986] Modules linked in: bridge stp llc btrfs xor mlx4_en vxlan 
> > ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad ib_core ib_addr 
> > ghash_s390 prng raid6_pq ecb aes_s390 des_s390 des_generic sha512_s390 
> > sha256_s390 sha1_s390 mlx4_core sha_common genwqe_card scm_block crc_itu_t 
> > vhost_net tun vhost dm_mod macvtap eadm_sch macvlan kvm autofs4
> > [   59.876033] CPU: 2 PID: 5402 Comm: git Tainted: GW   
> > 4.4.0-07794-ga4eff16-dirty #77
> > [   59.876036] task: d2312948 ti: cfecc000 task.ti: 
> > cfecc000
> > [   59.876039] Krnl PSW : 0704d0018000 002bf3aa 
> > (__split_huge_pmd_locked+0x562/0xa10)
> > [   59.876045]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > PM:0 EA:3
> >Krnl GPRS: 01a7a1cf 03d10177c000 
> > 00044068 5df00215
> > [   59.876051]0001 0001 
> >  774e6900
> > [   59.876054]03ff5200 6d403b10 
> > 6e1eb800 03ff51f0
> > [   59.876058]03d10177c000 00715190 
> > 002bf234 cfecfb58
> > [   59.876068] Krnl Code: 002bf39c: d507d010a000clc 
> > 16(8,%%r13),0(%%r10)
> >   002bf3a2: a7840004brc 
> > 8,2bf3aa
> >  #002bf3a6: a7f40001brc 
> > 15,2bf3a8
> >  >002bf3aa: 91407440tm  
> > 1088(%%r7),64
> >   002bf3ae: a7840208brc 
> > 8,2bf7be
> >   002bf3b2: a7f401e9brc 
> > 15,2bf784
> >   002bf3b6: 9104a006tm  
> > 6(%%r10),4
> >   002bf3ba: a7740004brc 
> > 7,2bf3c2
> > [   59.876089] Call Trace:
> > [   59.876092] ([<002bf234>] __split_huge_pmd_locked+0x3ec/0xa10)
> > [   59.876095]  [<002c4310>] __split_huge_pmd+0x118/0x218
> > [   59.876099]  [<002810e8>] unmap_single_vma+0x2d8/0xb40
> > [   59.876102]  [<00282d66>] zap_page_range+0x116/0x318
> > [   59.876105]  [<0029b834>] SyS_madvise+0x23c/0x5e8
> > [   59.876108]  [<006f9f56>] system_call+0xd6/0x258
> > [   59.876111]  [<03ff9bbfd282>] 0x3ff9bbfd282
> > [   59.876113] INFO: lockdep is turned off.
> > [   59.876115] Last Breaking-Event-Address:
> > [   59.876118]  [<002bf3a6>] __split_huge_pmd_locked+0x55e/0xa10
> 
> The BUG at mm/huge_memory.c:2884 is interesting, it's the 
> BUG_ON(!pte_none(*pte))
> check in __split_huge_pmd_locked(). Obviously we expect the pre-allocated
> pagetables to be empty, but in collapse_huge_page() we deposit the original
> pagetable instead of allocating a new (empty) one. This saves an allocation,
> which is good, but doesn't that mean that if such a collapsed hugepage will
> ever be split, we will always run into the BUG_ON(!pte_none(*pte)), or one
> of the two other VM_BUG_ONs in mm/huge_memory.c that check the same?
> 
> This behavior is not new, it was the same before the THP rework, so I do not
> assume that it is related to the current problems, maybe with the exception
> of this specific crash. I never saw the BUG at mm/huge_memory.c:2884 myself,
> and the other crashes probably cannot be explained with this. Maybe I am
> also missing something, but I do not see how collapse_huge_page() and the
> (non-empty) pgtable deposit there can work out with the 
> BUG_ON(!pte_none(*pte))
> checks. Any thoughts?

I don't think there's a problem: ptes in the pgtable are cleared with
pte_clear() in __collapse_huge_page_copy().

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-17 Thread Kirill A. Shutemov
On Tue, Feb 16, 2016 at 05:24:44PM +0100, Gerald Schaefer wrote:
> On Mon, 15 Feb 2016 23:35:26 +0200
> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
> 
> > Is there any chance that I'll be able to trigger the bug using QEMU?
> > Does anybody have an QEMU image I can use?
> > 
> 
> I have no image, but trying to reproduce this under virtualization may
> help to trigger this also on other architectures. After ruling out IPI
> vs. fast_gup I do not really see why this should be arch-specific, and
> it wouldn't be the first time that we hit subtle races first on s390, due
> to our virtualized environment (my test case is make -j20 with 10 CPUs and
> 4GB of memory, no swap).

Could you post your kernel config?

It would be nice also to check if disabling split_huge_page() would make
any difference:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a75081ca31cf..26d2b7b21021 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3364,6 +3364,8 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
bool mlocked;
unsigned long flags;
 
+   return -EBUSY;
+
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-15 Thread Kirill A. Shutemov
On Mon, Feb 15, 2016 at 07:37:02PM +0100, Gerald Schaefer wrote:
> On Mon, 15 Feb 2016 13:31:59 +0200
> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
> 
> > On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> > > 
> > > On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > > > Could you check if revert of fecffad25458 helps?
> > > 
> > > I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
> > > 
> > > ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual 
> > > kernel address space
> > > ¢ 1851.721075! failing address:  TEID: 0483
> > > ¢ 1851.721078! Fault in home space mode while using kernel ASCE.
> > > ¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 
> > > P:003d
> > > ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
> > > ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en 
> > > ib_sa ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core 
> > > ib_addr ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 
> > > des_generic genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common 
> > > crc_itu_t dm_mod scm_block vhost_net tun vhost eadm_sch macvtap macvlan 
> > > kvm autofs4
> > > ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
> > > 4.5.0-rc3-00058-g07923d7-dirty #178
> > > ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
> > > 8c604000
> > > ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
> > > (__rb_erase_color+0x280/0x308)
> > > ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > > PM:0 EA:3
> > >Krnl GPRS: 0001 0020 
> > >  bd07eff1
> > > ¢ 1851.721205!0027ca10  
> > > 83e45898 77b61198
> > > ¢ 1851.721207!7ce1a490 bd07eff0 
> > > 7ce1a548 0027ca10
> > > ¢ 1851.721210!bd07c350 bd07eff0 
> > > 8c607aa8 8c607a68
> > > ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
> > > %%r12,8(%%r13)
> > >   0045d3b0: b9040039   lgr 
> > > %%r3,%%r9
> > >  #0045d3b4: a53b0001   oill
> > > %%r3,1
> > >  >0045d3b8: e3301024   stg 
> > > %%r3,0(%%r1)
> > >   0045d3be: ec28000e007c   cgij
> > > %%r2,0,8,45d3da
> > >   0045d3c4: e3402004   lg  
> > > %%r4,0(%%r2)
> > >   0045d3ca: b904001c   lgr 
> > > %%r1,%%r12
> > >   0045d3ce: ec143f3f0056   rosbg   
> > > %%r1,%%r4,63,63,0
> > > ¢ 1851.721269! Call Trace:
> > > ¢ 1851.721273! (¢<83e45898>! 0x83e45898)
> > > ¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
> > > ¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
> > > ¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
> > > ¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
> > > ¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
> > > ¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
> > > ¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
> > > ¢ 1851.721304!  ¢<002d8970>! 
> > > do_execveat_common.isra.32+0x668/0x9a0
> > > ¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
> > > ¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
> > > ¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
> > > ¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
> > > ¢ 1851.721319! INFO: lockdep is turned off.
> > > ¢ 1851.721321! Last Breaking-Event-Address:
> > > ¢ 1851.721323!  ¢<0045d31a>! __rb_erase_color+0x1e2/0x308
> > > ¢ 1851.721327!
> > > ¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !---
> > > 
> > > 
> > > > 
> > > > And could you share how crashes looks like? I haven't seen backtraces 
> > > > yet.
> > > > 
> > > 
> 

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-15 Thread Kirill A. Shutemov
On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> 
> On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > Could you check if revert of fecffad25458 helps?
> 
> I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
> 
> ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual kernel 
> address space
> ¢ 1851.721075! failing address:  TEID: 0483
> ¢ 1851.721078! Fault in home space mode while using kernel ASCE.
> ¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 
> P:003d
> ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
> ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en ib_sa 
> ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core ib_addr 
> ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 des_generic 
> genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common crc_itu_t dm_mod 
> scm_block vhost_net tun vhost eadm_sch macvtap macvlan kvm autofs4
> ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
> 4.5.0-rc3-00058-g07923d7-dirty #178
> ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
> 8c604000
> ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
> (__rb_erase_color+0x280/0x308)
> ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
> EA:3
>Krnl GPRS: 0001 0020  
> bd07eff1
> ¢ 1851.721205!0027ca10  83e45898 
> 77b61198
> ¢ 1851.721207!7ce1a490 bd07eff0 7ce1a548 
> 0027ca10
> ¢ 1851.721210!bd07c350 bd07eff0 8c607aa8 
> 8c607a68
> ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
> %%r12,8(%%r13)
>   0045d3b0: b9040039   lgr 
> %%r3,%%r9
>  #0045d3b4: a53b0001   oill%%r3,1
>  >0045d3b8: e3301024   stg 
> %%r3,0(%%r1)
>   0045d3be: ec28000e007c   cgij
> %%r2,0,8,45d3da
>   0045d3c4: e3402004   lg  
> %%r4,0(%%r2)
>   0045d3ca: b904001c   lgr 
> %%r1,%%r12
>   0045d3ce: ec143f3f0056   rosbg   
> %%r1,%%r4,63,63,0
> ¢ 1851.721269! Call Trace:
> ¢ 1851.721273! (¢<83e45898>! 0x83e45898)
> ¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
> ¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
> ¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
> ¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
> ¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
> ¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
> ¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
> ¢ 1851.721304!  ¢<002d8970>! do_execveat_common.isra.32+0x668/0x9a0
> ¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
> ¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
> ¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
> ¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
> ¢ 1851.721319! INFO: lockdep is turned off.
> ¢ 1851.721321! Last Breaking-Event-Address:
> ¢ 1851.721323!  ¢<0045d31a>! __rb_erase_color+0x1e2/0x308
> ¢ 1851.721327!
> ¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !---
> 
> 
> > 
> > And could you share how crashes looks like? I haven't seen backtraces yet.
> > 
> 
> Sure. I didn't because they really looked random to me. Most of the time
> in rcu or list debugging but I thought these have just been the messenger
> observing a corruption first. Anyhow, here is an older one that might look
> interesting:
> 
> [   59.851421] list_del corruption. next->prev should be 6e1eb000, 
> but was 0400

This kinda interesting: 0x400 is TAIL_MAPPING.. Hm..

Could you check if you see the problem on commit 1c290f642101 and its
immediate parent?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-12 Thread Kirill A. Shutemov
On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> On Thu, 11 Feb 2016 21:09:42 +0200
> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
> 
> > On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > > Hi,
> > > 
> > > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> > > he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > > review of the THP rework patches, which cannot be bisected, revealed
> > > commit fecffad "s390, thp: remove infrastructure for handling splitting 
> > > PMDs"
> > > (and also similar commits for other archs).
> > > 
> > > This commit removes the THP splitting bit and also the architecture
> > > implementation of pmdp_splitting_flush(), which took care of the IPI for
> > > fast_gup serialization. The commit message says
> > > 
> > > pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> > > pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> > > needed for fast_gup
> > > 
> > > The assumption that a TLB flush will also produce an IPI is wrong on s390,
> > > and maybe also on other architectures, and I thought that this was 
> > > actually
> > > the main reason for having an arch-specific pmdp_splitting_flush().
> > > 
> > > At least PowerPC and ARM also had an individual implementation of
> > > pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > > flush to send the IPI, and those were also removed. Putting the arch
> > > maintainers and mailing lists on cc to verify.
> > > 
> > > On s390 this will break the IPI serialization against fast_gup, which
> > > would certainly explain the random kernel crashes, please revert or fix
> > > the pmdp_splitting_flush() removal.
> > 
> > Sorry for that.
> > 
> > I believe, the problem was already addressed for PowerPC:
> > 
> > http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> > 
> > I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> > the trick, right?
> 
> Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
> fast_gup will still return false, because the pmd is not empty (at least
> on s390). So I don't see spontaneously how it will help fast_gup to break
> out to the slow path in case of THP splitting.

What pmdp_flush_direct() does in pmdp_invalidate()? It's hard to unwrap for me 
:-/
Does it make the pmd !pmd_present()?

I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-12 Thread Kirill A. Shutemov
On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> On Fri, 12 Feb 2016 16:57:27 +0100
> Christian Borntraeger <borntrae...@de.ibm.com> wrote:
> 
> > On 02/12/2016 04:41 PM, Kirill A. Shutemov wrote:
> > > On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> > >> On Thu, 11 Feb 2016 21:09:42 +0200
> > >> "Kirill A. Shutemov" <kir...@shutemov.name> wrote:
> > >>
> > >>> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > >>>> Hi,
> > >>>>
> > >>>> Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 
> > >>>> and
> > >>>> he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > >>>> review of the THP rework patches, which cannot be bisected, revealed
> > >>>> commit fecffad "s390, thp: remove infrastructure for handling 
> > >>>> splitting PMDs"
> > >>>> (and also similar commits for other archs).
> > >>>>
> > >>>> This commit removes the THP splitting bit and also the architecture
> > >>>> implementation of pmdp_splitting_flush(), which took care of the IPI 
> > >>>> for
> > >>>> fast_gup serialization. The commit message says
> > >>>>
> > >>>> pmdp_splitting_flush() is not needed too: on splitting PMD we will 
> > >>>> do
> > >>>> pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI 
> > >>>> as
> > >>>> needed for fast_gup
> > >>>>
> > >>>> The assumption that a TLB flush will also produce an IPI is wrong on 
> > >>>> s390,
> > >>>> and maybe also on other architectures, and I thought that this was 
> > >>>> actually
> > >>>> the main reason for having an arch-specific pmdp_splitting_flush().
> > >>>>
> > >>>> At least PowerPC and ARM also had an individual implementation of
> > >>>> pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > >>>> flush to send the IPI, and those were also removed. Putting the arch
> > >>>> maintainers and mailing lists on cc to verify.
> > >>>>
> > >>>> On s390 this will break the IPI serialization against fast_gup, which
> > >>>> would certainly explain the random kernel crashes, please revert or fix
> > >>>> the pmdp_splitting_flush() removal.
> > >>>
> > >>> Sorry for that.
> > >>>
> > >>> I believe, the problem was already addressed for PowerPC:
> > >>>
> > >>> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> > >>>
> > >>> I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> > >>> the trick, right?
> > >>
> > >> Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
> > >> fast_gup will still return false, because the pmd is not empty (at least
> > >> on s390). So I don't see spontaneously how it will help fast_gup to break
> > >> out to the slow path in case of THP splitting.
> > > 
> > > What pmdp_flush_direct() does in pmdp_invalidate()? It's hard to unwrap 
> > > for me :-/
> > > Does it make the pmd !pmd_present()?
> > 
> > It uses the idte instruction, which in an atomic fashion flushes the 
> > associated
> > TLB entry and changes the value of the pmd entry to invalid. This comes 
> > from the
> > HW requirement to not  change a PTE/PMD that might be still in use, other 
> > than 
> > with special instructions that does the tlb handling and the invalidation 
> > together.
> 
> Correct, and it does _not_ make the pmd !pmd_present(), that would only be the
> case after a _clear_flush(). It only marks the pmd as invalid and flushes,
> so that it cannot generate a new TLB entry before the following 
> pmd_populate(),
> but it keeps its other content. This is to fulfill the requirements outlined 
> in
> the comment in mm/huge_memory.c before the call to pmdp_invalidate(). And
> independent from that comment, we would need such an _invalidate() or
> _clear_flush() on s390 before the pmd_populate() because of the HW details
> that Christian described.
> 
> Reading the comment again, I do now notice that it also says

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-11 Thread Kirill A. Shutemov
On Thu, Feb 11, 2016 at 09:09:42PM +0200, Kirill A. Shutemov wrote:
> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > Hi,
> > 
> > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> > he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > review of the THP rework patches, which cannot be bisected, revealed
> > commit fecffad "s390, thp: remove infrastructure for handling splitting 
> > PMDs"
> > (and also similar commits for other archs).
> > 
> > This commit removes the THP splitting bit and also the architecture
> > implementation of pmdp_splitting_flush(), which took care of the IPI for
> > fast_gup serialization. The commit message says
> > 
> > pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> > pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> > needed for fast_gup
> > 
> > The assumption that a TLB flush will also produce an IPI is wrong on s390,
> > and maybe also on other architectures, and I thought that this was actually
> > the main reason for having an arch-specific pmdp_splitting_flush().
> > 
> > At least PowerPC and ARM also had an individual implementation of
> > pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > flush to send the IPI, and those were also removed. Putting the arch
> > maintainers and mailing lists on cc to verify.
> > 
> > On s390 this will break the IPI serialization against fast_gup, which
> > would certainly explain the random kernel crashes, please revert or fix
> > the pmdp_splitting_flush() removal.
> 
> Sorry for that.
> 
> I believe, the problem was already addressed for PowerPC:
> 
> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

Correct link is

http://lkml.kernel.org/g/1454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-11 Thread Kirill A. Shutemov
On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> Hi,
> 
> Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> review of the THP rework patches, which cannot be bisected, revealed
> commit fecffad "s390, thp: remove infrastructure for handling splitting PMDs"
> (and also similar commits for other archs).
> 
> This commit removes the THP splitting bit and also the architecture
> implementation of pmdp_splitting_flush(), which took care of the IPI for
> fast_gup serialization. The commit message says
> 
> pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> needed for fast_gup
> 
> The assumption that a TLB flush will also produce an IPI is wrong on s390,
> and maybe also on other architectures, and I thought that this was actually
> the main reason for having an arch-specific pmdp_splitting_flush().
> 
> At least PowerPC and ARM also had an individual implementation of
> pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> flush to send the IPI, and those were also removed. Putting the arch
> maintainers and mailing lists on cc to verify.
> 
> On s390 this will break the IPI serialization against fast_gup, which
> would certainly explain the random kernel crashes, please revert or fix
> the pmdp_splitting_flush() removal.

Sorry for that.

I believe, the problem was already addressed for PowerPC:

http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com

I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
the trick, right?

If yes, I'll prepare patch tomorrow (some sleep required).

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update

2016-02-07 Thread Kirill A. Shutemov
->vm_mm, address, pmdp, _PAGE_USER, 0);
> +}
> +
> +
>  /*
>   * set a new huge pmd. We should not be called for updating
>   * an existing pmd entry. That should go via pmd_hugepage_update.
> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long 
> addr,
>   return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>  }
>  
> +/*
> + * We use this to invalidate a pmdp entry before switching from a
> + * hugepte to regular pmd entry.
> + */
>  void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>pmd_t *pmdp)
>  {
> - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
> + pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> + /*
> +  * This ensures that generic code that rely on IRQ disabling
> +  * to prevent a parallel THP split work as expected.
> +  */
> + kick_all_cpus_sync();
>  }
>  
>  /*
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 0b3c0d39ef75..93a0937652ec 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -239,6 +239,14 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, 
> unsigned long address,
>   pmd_t *pmdp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH
> +static inline void pmdp_huge_splitting_flush(struct vm_area_struct *vma,
> +  unsigned long address, pmd_t *pmdp)
> +{
> +
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTE_SAME
>  static inline int pte_same(pte_t pte_a, pte_t pte_b)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 36c070167b71..b52d16a86e91 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   young = pmd_young(*pmd);
>   dirty = pmd_dirty(*pmd);
>  
> + pmdp_huge_splitting_flush(vma, haddr, pmd);

Let's call it pmdp_huge_split_prepare().

"_flush" part is ppc-specific implementation detail and generic code
should not expect tlb to be flushed there.

Otherwise,

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

>   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>   pmd_populate(mm, &_pmd, pgtable);
>  
> -- 
> 2.5.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/mm: Fix Multi hit ERAT cause by recent THP update

2016-02-05 Thread Kirill A. Shutemov
nsures that generic code that rely on IRQ disabling
> +  * to prevent a parallel THP split work as expected.
> +  */
> + kick_all_cpus_sync();
>  }
>  
>  /*
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 0b3c0d39ef75..388065c79795 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -239,6 +239,14 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, 
> unsigned long address,
>   pmd_t *pmdp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH
> +static inline void pmdp_huge_splitting_flush(struct vm_area_struct *vma,
> +  unsigned long address, pmd_t *pmdp)
> +{
> + return;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTE_SAME
>  static inline int pte_same(pte_t pte_a, pte_t pte_b)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 36c070167b71..b52d16a86e91 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   young = pmd_young(*pmd);
>   dirty = pmd_dirty(*pmd);
>  
> + pmdp_huge_splitting_flush(vma, haddr, pmd);
>   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>   pmd_populate(mm, &_pmd, pgtable);
>  
> -- 
> 2.5.0
> 

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 5/7] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-27 Thread Kirill A. Shutemov
On Mon, Jul 27, 2015 at 09:41:26AM -0400, Eric B Munson wrote:
 On Mon, 27 Jul 2015, Kirill A. Shutemov wrote:
 
  On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
   The cost of faulting in all memory to be locked can be very high when
   working with large mappings.  If only portions of the mapping will be
   used this can incur a high penalty for locking.
   
   Now that we have the new VMA flag for the locked but not present state,
   expose it as an mmap option like MAP_LOCKED - VM_LOCKED.
  
  As I mentioned before, I don't think this interface is justified.
  
  MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
  affects MAP_LOCKONFAULT, but still.
  
  Let's not add new interface unless it's demonstrably useful.
  
  [1] http://lkml.kernel.org/g/20150114095019.gc4...@dhcp22.suse.cz
 
 I understand and should have been more explicit.  This patch is still
 included becuase I have an internal user that wants to see it added.
 The problem discussed in the thread you point out does not affect
 MAP_LOCKONFAULT because we do not attempt to populate the region with
 MAP_LOCKONFAULT.
 
 As I told Vlastimil, if this is a hard NAK with the patch I can work
 with that.  Otherwise I prefer it stays.

That's not how it works.

Once an ABI added to the kernel it stays there practically forever.
Therefore it must be useful to justify maintenance cost. I don't see it
demonstrated.

So, NAK.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 2/7] mm: mlock: Add new mlock system call

2015-07-27 Thread Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:40PM -0400, Eric B Munson wrote:
 With the refactored mlock code, introduce a new system call for mlock.
 The new call will allow the user to specify what lock states are being
 added.  mlock2 is trivial at the moment, but a follow on patch will add
 a new mlock state making it useful.
 
 Signed-off-by: Eric B Munson emun...@akamai.com
 Cc: Michal Hocko mho...@suse.cz
 Cc: Vlastimil Babka vba...@suse.cz
 Cc: Heiko Carstens heiko.carst...@de.ibm.com
 Cc: Geert Uytterhoeven ge...@linux-m68k.org
 Cc: Catalin Marinas catalin.mari...@arm.com
 Cc: Stephen Rothwell s...@canb.auug.org.au
 Cc: Guenter Roeck li...@roeck-us.net
 Cc: linux-al...@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org
 Cc: linux-arm-ker...@lists.infradead.org
 Cc: adi-buildroot-de...@lists.sourceforge.net
 Cc: linux-cris-ker...@axis.com
 Cc: linux-i...@vger.kernel.org
 Cc: linux-m...@lists.linux-m68k.org
 Cc: linux-am33-l...@redhat.com
 Cc: linux-par...@vger.kernel.org
 Cc: linuxppc-dev@lists.ozlabs.org
 Cc: linux-s...@vger.kernel.org
 Cc: linux...@vger.kernel.org
 Cc: sparcli...@vger.kernel.org
 Cc: linux-xte...@linux-xtensa.org
 Cc: linux-...@vger.kernel.org
 Cc: linux-a...@vger.kernel.org
 Cc: linux...@kvack.org
 ---
 Changes from V4:
 * Drop all architectures except x86[_64] from this patch, MIPS is added
   later in the series.  All others will be left to their maintainers.
 
 Changes from V3:
 * Do a (hopefully) complete job of adding the new system calls
  arch/alpha/include/uapi/asm/mman.h | 2 ++
  arch/mips/include/uapi/asm/mman.h  | 5 +
  arch/parisc/include/uapi/asm/mman.h| 2 ++
  arch/powerpc/include/uapi/asm/mman.h   | 2 ++
  arch/sparc/include/uapi/asm/mman.h | 2 ++
  arch/tile/include/uapi/asm/mman.h  | 5 +
  arch/x86/entry/syscalls/syscall_32.tbl | 1 +
  arch/x86/entry/syscalls/syscall_64.tbl | 1 +
  arch/xtensa/include/uapi/asm/mman.h| 5 +

Define MLOCK_LOCKED in include/uapi/asm-generic/mman-common.h.
This way you can drop changes in powerpc, sparc and tile.

Otherwise looks good.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 4/7] mm: mlock: Add mlock flags to enable VM_LOCKONFAULT usage

2015-07-27 Thread Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:42PM -0400, Eric B Munson wrote:
 The previous patch introduced a flag that specified pages in a VMA
 should be placed on the unevictable LRU, but they should not be made
 present when the area is created.  This patch adds the ability to set
 this state via the new mlock system calls.
 
 We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
 MLOCK_ONFAULT will set the VM_LOCKONFAULT flag as well as the VM_LOCKED
 flag for the target region.  MCL_CURRENT and MCL_ONFAULT are used to
 lock current mappings.  With MCL_CURRENT all pages are made present and
 with MCL_ONFAULT they are locked when faulted in.  When specified with
 MCL_FUTURE all new mappings will be marked with VM_LOCKONFAULT.
 
 Currently, mlockall() clears all VMA lock flags and then sets the
 requested flags.  For instance, if a process has MCL_FUTURE and
 MCL_CURRENT set, but they want to clear MCL_FUTURE this would be
 accomplished by calling mlockall(MCL_CURRENT).  This still holds with
 the introduction of MCL_ONFAULT.  Each call to mlockall() resets all
 VMA flags to the values specified in the current call.  The new mlock2
 system call behaves in the same way.  If a region is locked with
 MLOCK_ONFAULT and a user wants to force it to be populated now, a second
 call to mlock2(MLOCK_LOCKED) will accomplish this.
 
 munlock() will unconditionally clear both vma flags.  munlockall()
 unconditionally clears for VMA flags on all VMAs and in the
 mm-def_flags field.
 
 Signed-off-by: Eric B Munson emun...@akamai.com
 Cc: Michal Hocko mho...@suse.cz
 Cc: Vlastimil Babka vba...@suse.cz
 Cc: Jonathan Corbet cor...@lwn.net
 Cc: Kirill A. Shutemov kir...@shutemov.name
 Cc: linux-al...@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org
 Cc: linux-m...@linux-mips.org
 Cc: linux-par...@vger.kernel.org
 Cc: linuxppc-dev@lists.ozlabs.org
 Cc: sparcli...@vger.kernel.org
 Cc: linux-xte...@linux-xtensa.org
 Cc: linux-a...@vger.kernel.org
 Cc: linux-...@vger.kernel.org
 Cc: linux...@kvack.org
 ---
 Changes from V4:
 * Split addition of VMA flag
 
 Changes from V3:
 * Do extensive search for VM_LOCKED and ensure that VM_LOCKONFAULT is also 
 handled
  where appropriate
  arch/alpha/include/uapi/asm/mman.h   |  2 ++
  arch/mips/include/uapi/asm/mman.h|  2 ++
  arch/parisc/include/uapi/asm/mman.h  |  2 ++
  arch/powerpc/include/uapi/asm/mman.h |  2 ++
  arch/sparc/include/uapi/asm/mman.h   |  2 ++
  arch/tile/include/uapi/asm/mman.h|  3 +++
  arch/xtensa/include/uapi/asm/mman.h  |  2 ++

Again, you can save few lines by moving some code into mman-common.h.

Otherwise looks good.

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 5/7] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-27 Thread Kirill A. Shutemov
On Fri, Jul 24, 2015 at 05:28:43PM -0400, Eric B Munson wrote:
 The cost of faulting in all memory to be locked can be very high when
 working with large mappings.  If only portions of the mapping will be
 used this can incur a high penalty for locking.
 
 Now that we have the new VMA flag for the locked but not present state,
 expose it as an mmap option like MAP_LOCKED - VM_LOCKED.

As I mentioned before, I don't think this interface is justified.

MAP_LOCKED has known issues[1]. The MAP_LOCKED problem is not necessary
affects MAP_LOCKONFAULT, but still.

Let's not add new interface unless it's demonstrably useful.

[1] http://lkml.kernel.org/g/20150114095019.gc4...@dhcp22.suse.cz

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-22 Thread Kirill A. Shutemov
On Wed, Jul 22, 2015 at 10:32:20AM -0400, Eric B Munson wrote:
 On Wed, 22 Jul 2015, Kirill A. Shutemov wrote:
 
  On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
   The cost of faulting in all memory to be locked can be very high when
   working with large mappings.  If only portions of the mapping will be
   used this can incur a high penalty for locking.
   
   Now that we have the new VMA flag for the locked but not present state,
   expose it as an mmap option like MAP_LOCKED - VM_LOCKED.
  
  What is advantage over mmap() + mlock(MLOCK_ONFAULT)?
 
 There isn't one, it was added to maintain parity with the
 mlock(MLOCK_LOCK) - mmap(MAP_LOCKED) set.  I think not having will lead
 to confusion because we have MAP_LOCKED so why don't we support
 LOCKONFAULT from mmap as well.

I don't think it's ia good idea to spend bits in flags unless we have a
reason for that.

BTW, you have typo on sparc: s/0x8000/0x8/.


-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 5/6] mm: mmap: Add mmap flag to request VM_LOCKONFAULT

2015-07-22 Thread Kirill A. Shutemov
On Tue, Jul 21, 2015 at 03:59:40PM -0400, Eric B Munson wrote:
 The cost of faulting in all memory to be locked can be very high when
 working with large mappings.  If only portions of the mapping will be
 used this can incur a high penalty for locking.
 
 Now that we have the new VMA flag for the locked but not present state,
 expose it as an mmap option like MAP_LOCKED - VM_LOCKED.

What is advantage over mmap() + mlock(MLOCK_ONFAULT)?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4 1/3] mm/thp: Split out pmd collpase flush into a separate functions

2015-05-12 Thread Kirill A. Shutemov
On Tue, May 12, 2015 at 11:38:32AM +0530, Aneesh Kumar K.V wrote:
 Architectures like ppc64 [1] need to do special things while clearing
 pmd before a collapse. For them this operation is largely different
 from a normal hugepage pte clear. Hence add a separate function
 to clear pmd before collapse. After this patch pmdp_* functions
 operate only on hugepage pte, and not on regular pmd_t values
 pointing to page table.
 
 [1] ppc64 needs to invalidate all the normal page pte mappings we
 already have inserted in the hardware hash page table. But before
 doing that we need to make sure there are no parallel hash page
 table insert going on. So we need to do a kick_all_cpus_sync()
 before flushing the older hash table entries. By moving this to
 a separate function we capture these details and mention how it
 is different from a hugepage pte clear.
 
 This patch is a cleanup and only does code movement for clarity.
 There should not be any change in functionality.
 
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

For the patchset:

Acked-by: Kirill A. Shutemov kirill.shute...@linux.intel.com

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V3] powerpc/thp: Serialize pmd clear against a linux page table walk.

2015-05-11 Thread Kirill A. Shutemov
On Mon, May 11, 2015 at 11:56:01AM +0530, Aneesh Kumar K.V wrote:
 Serialize against find_linux_pte_or_hugepte which does lock-less
 lookup in page tables with local interrupts disabled. For huge pages
 it casts pmd_t to pte_t. Since format of pte_t is different from
 pmd_t we want to prevent transit from pmd pointing to page table
 to pmd pointing to huge page (and back) while interrupts are disabled.
 We clear pmd to possibly replace it with page table pointer in
 different code paths. So make sure we wait for the parallel
 find_linux_pte_or_hugepage to finish.
 
 Without this patch, a find_linux_pte_or_hugepte running in parallel to
 __split_huge_zero_page_pmd or do_huge_pmd_wp_page_fallback or zap_huge_pmd
 can run into the above issue. With __split_huge_zero_page_pmd and
 do_huge_pmd_wp_page_fallback we clear the hugepage pte before inserting
 the pmd entry with a regular pgtable address. Such a clear need to
 wait for the parallel find_linux_pte_or_hugepte to finish.
 
 With zap_huge_pmd, we can run into issues, with a hugepage pte
 getting zapped due to a MADV_DONTNEED while other cpu fault it
 in as small pages.
 
 Reported-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Reviewed-by: Kirill A. Shutemov kirill.shute...@linux.intel.com

CC: stable@ ?

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions

2015-05-07 Thread Kirill A. Shutemov
();
 + /*
 +  * Now invalidate the hpte entries in the range
 +  * covered by pmd. This make sure we take a
 +  * fault and will find the pmd as none, which will
 +  * result in a major fault which takes mmap_sem and
 +  * hence wait for collapse to complete. Without this
 +  * the __collapse_huge_page_copy can result in copying
 +  * the old content.
 +  */
 + flush_tlb_pmd_range(vma-vm_mm, pmd, address);
   return pmd;
  }
  
 diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
 index 39f1d6a2b04d..80e6d415cd57 100644
 --- a/include/asm-generic/pgtable.h
 +++ b/include/asm-generic/pgtable.h
 @@ -189,6 +189,25 @@ extern void pmdp_splitting_flush(struct vm_area_struct 
 *vma,
unsigned long address, pmd_t *pmdp);
  #endif
  
 +#ifndef __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 +static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 +unsigned long address,
 +pmd_t *pmdp)
 +{
 + return pmdp_clear_flush(vma, address, pmdp);
 +}
 +#else
 +static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 +unsigned long address,
 +pmd_t *pmdp)
 +{
 + BUILD_BUG();
 + return __pmd(0);
 +}
 +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 +#endif
 +
  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
  pgtable_t pgtable);
 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index 078832cf3636..88f695a4e38b 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -2499,7 +2499,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* huge and small TLB entries for the same virtual address
* to avoid the risk of CPU bugs in that area.
*/
 - _pmd = pmdp_clear_flush(vma, address, pmd);
 + _pmd = pmdp_collapse_flush(vma, address, pmd);

Why? pmdp_clear_flush() does kick_all_cpus_sync() already.

   spin_unlock(pmd_ptl);
   mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
  
 -- 
 2.1.4
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

  1   2   >