Re: [PATCH v13 14/14] powerpc/64s/radix: Enable huge vmalloc mappings

2021-04-16 Thread Nicholas Piggin
Excerpts from Andrew Morton's message of April 16, 2021 4:55 am:
> On Thu, 15 Apr 2021 12:23:55 +0200 Christophe Leroy 
>  wrote:
>> > +   * is done. STRICT_MODULE_RWX may require extra work to support this
>> > +   * too.
>> > +   */
>> >   
>> > -  return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, 
>> > GFP_KERNEL,
>> > -  PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
>> > NUMA_NO_NODE,
>> 
>> 
>> I think you should add the following in 
>> 
>> #ifndef MODULES_VADDR
>> #define MODULES_VADDR VMALLOC_START
>> #define MODULES_END VMALLOC_END
>> #endif
>> 
>> And leave module_alloc() as is (just removing the enclosing #ifdef 
>> MODULES_VADDR and adding the 
>> VM_NO_HUGE_VMAP  flag)
>> 
>> This would minimise the conflits with the changes I did in powerpc/next 
>> reported by Stephen R.
>> 
> 
> I'll drop powerpc-64s-radix-enable-huge-vmalloc-mappings.patch for now,
> make life simpler.

Yeah that's fine.

> Nick, a redo on top of Christophe's changes in linux-next would be best
> please.

Will do.

Thanks,
Nick


Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Nicholas Piggin
Excerpts from Segher Boessenkool's message of April 14, 2021 7:58 am:
> On Tue, Apr 13, 2021 at 06:33:19PM +0200, Christophe Leroy wrote:
>> Le 12/04/2021 à 23:54, Segher Boessenkool a écrit :
>> >On Thu, Apr 08, 2021 at 03:33:44PM +, Christophe Leroy wrote:
>> >>For clear bits, on 32 bits 'rlwinm' can be used instead or 'andc' for
>> >>when all bits to be cleared are consecutive.
>> >
>> >Also on 64-bits, as long as both the top and bottom bits are in the low
>> >32-bit half (for 32 bit mode, it can wrap as well).
>> 
>> Yes. But here we are talking about clearing a few bits, all other ones must 
>> remain unchanged. An rlwinm on PPC64 will always clear the upper part, 
>> which is unlikely what we want.
> 
> No, it does not.  It takes the low 32 bits of the source reg, duplicated
> to the top half as well, then rotated, then ANDed with the mask (which
> can wrap around).  This isn't very often very useful, but :-)
> 
> (One useful operation is splatting 32 bits to both halves of a 64-bit
> register, which is just rlwinm d,s,0,1,0).
> 
> If you only look at the low 32 bits, it does exactly the same as on
> 32-bit implementations.
> 
>> >>For the time being only
>> >>handle the single bit case, which we detect by checking whether the
>> >>mask is a power of two.
>> >
>> >You could look at rs6000_is_valid_mask in GCC:
>> >   
>> > 
>> >used by rs6000_is_valid_and_mask immediately after it.  You probably
>> >want to allow only rlwinm in your case, and please note this checks if
>> >something is a valid mask, not the inverse of a valid mask (as you
>> >want here).
>> 
>> This check looks more complex than what I need. It is used for both rlw... 
>> and rld..., and it calculates the operants.  The only thing I need is to 
>> validate the mask.
> 
> It has to do exactly the same thing for rlwinm as for all 64-bit
> variants (rldicl, rldicr, rldic).
> 
> One side effect of calculation the bit positions with exact_log2 is that
> that returns negative if the argument is not a power of two.
> 
> Here is a simpler way, that handles all cases:  input in "u32 val":
> 
>   if (!val)
>   return nonono;
>   if (val & 1)
>   val = ~val; // make the mask non-wrapping
>   val += val & -val;  // adding the low set bit should result in
>   // at most one bit set
>   if (!(val & (val - 1)))
>   return okidoki_all_good;
> 
>> I found a way: By anding the mask with the complement of itself rotated by 
>> left bits to 1, we identify the transitions from 0 to 1. If the result is a 
>> power of 2, it means there's only one transition so the mask is as expected.
> 
> That does not handle all cases (it misses all bits set at least).  Which
> isn't all that interesting of course, but is a valid mask (but won't
> clear any bits, so not too interesting for your specific case :-) )

Would be nice if we could let the compiler deal with it all...

static inline unsigned long lr(unsigned long *mem)
{
unsigned long val;

/*
 * This doesn't clobber memory but want to avoid memory operations
 * moving ahead of it
 */
asm volatile("ldarx %0, %y1" : "=r"(val) : "Z"(*mem) : "memory");

return val;
}

static inline bool stc(unsigned long *mem, unsigned long val)
{
/*
 * This doesn't really clobber memory but same as above, also can't
 * specify output in asm goto.
 */
asm volatile goto(
"stdcx. %0, %y1 \n\t"
"bne-   %l[fail]\n\t"
: : "r"(val), "Z"(*mem) : "cr0", "memory" : fail);

return true;
fail: __attribute__((cold))
return false;
}

static inline void atomic_add(unsigned long *mem, unsigned long val)
{
unsigned long old, new;

do {
old = lr(mem);
new = old + val;
} while (unlikely(!stc(mem, new)));
}



[tip: irq/core] genirq: Reduce irqdebug cacheline bouncing

2021-04-10 Thread tip-bot2 for Nicholas Piggin
The following commit has been merged into the irq/core branch of tip:

Commit-ID: 7c07012eb1be8b4a95d3502fd30795849007a40e
Gitweb:
https://git.kernel.org/tip/7c07012eb1be8b4a95d3502fd30795849007a40e
Author:Nicholas Piggin 
AuthorDate:Fri, 02 Apr 2021 23:20:37 +10:00
Committer: Thomas Gleixner 
CommitterDate: Sat, 10 Apr 2021 13:35:54 +02:00

genirq: Reduce irqdebug cacheline bouncing

note_interrupt() increments desc->irq_count for each interrupt even for
percpu interrupt handlers, even when they are handled successfully. This
causes cacheline bouncing and limits scalability.

Instead of incrementing irq_count every time, only start incrementing it
after seeing an unhandled irq, which should avoid the cache line
bouncing in the common path.

This actually should give better consistency in handling misbehaving
irqs too, because instead of the first unhandled irq arriving at an
arbitrary point in the irq_count cycle, its arrival will begin the
irq_count cycle.

Cédric reports the result of his IPI throughput test:

   Millions of IPIs/s
 ---   --
   upstream   upstream   patched
 chips  cpus   defaultnoirqdebug default (irqdebug)
 ---   -
 1  0-15 4.061  4.153  4.084
0-31 7.937  8.186  8.158
0-4711.018 11.392 11.233
0-6311.460 13.907 14.022
 2  0-79 8.376 18.105 18.084
0-95 7.338 22.101 22.266
0-1116.716 25.306 25.473
0-1276.223 27.814 28.029

Signed-off-by: Nicholas Piggin 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20210402132037.574661-1-npig...@gmail.com

---
 kernel/irq/spurious.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index f865e5f..c481d84 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -403,6 +403,10 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t 
action_ret)
desc->irqs_unhandled -= ok;
}
 
+   if (likely(!desc->irqs_unhandled))
+   return;
+
+   /* Now getting into unhandled irq detection */
desc->irq_count++;
if (likely(desc->irq_count < 10))
return;


Re: [PATCH v4] powerpc/traps: Enhance readability for trap types

2021-04-09 Thread Nicholas Piggin
Thanks for working on this, I think it's a nice cleanup and helps
non-powerpc people understand the code a bit better.

Excerpts from Xiongwei Song's message of April 10, 2021 12:28 am:
> From: Xiongwei Song 
> 
> Create a new header named traps.h, define macros to list ppc interrupt
> types in traps.h, replace the references of the trap hex values with these
> macros.
> 
> Referred the hex numbers in arch/powerpc/kernel/exceptions-64e.S,
> arch/powerpc/kernel/exceptions-64s.S and
> arch/powerpc/include/asm/kvm_asm.h.
> 
> Reported-by: kernel test robot 

It now looks like lkp asked for this whole cleanup patch. I would
put [kernel test robot ] in your v3->4 changelog
item.

> Signed-off-by: Xiongwei Song 
> ---
> 
> v3-v4:
> Fix compile issue:
> arch/powerpc/kernel/process.c:1473:14: error: 'INTERRUPT_MACHINE_CHECK' 
> undeclared (first use in this function); did you mean 'TAINT_MACHINE_CHECK'?
> I didn't add "Reported-by: kernel test robot " here,
> because it's improper for this patch.

[...]

> diff --git a/arch/powerpc/include/asm/traps.h 
> b/arch/powerpc/include/asm/traps.h
> new file mode 100644
> index ..2e64e10afcef
> --- /dev/null
> +++ b/arch/powerpc/include/asm/traps.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_PPC_TRAPS_H
> +#define _ASM_PPC_TRAPS_H

These could go in interrupt.h.

> +#if defined(CONFIG_BOOKE) || defined(CONFIG_4xx)
> +#define INTERRUPT_MACHINE_CHECK   0x000
> +#define INTERRUPT_CRITICAL_INPUT  0x100
> +#define INTERRUPT_ALTIVEC_UNAVAIL 0x200
> +#define INTERRUPT_PERFMON 0x260
> +#define INTERRUPT_DOORBELL0x280
> +#define INTERRUPT_DEBUG   0xd00
> +#else
> +#define INTERRUPT_SYSTEM_RESET0x100
> +#define INTERRUPT_MACHINE_CHECK   0x200

[...]

> @@ -1469,7 +1470,9 @@ static void __show_regs(struct pt_regs *regs)
>   trap = TRAP(regs);
>   if (!trap_is_syscall(regs) && cpu_has_feature(CPU_FTR_CFAR))
>   pr_cont("CFAR: "REG" ", regs->orig_gpr3);
> - if (trap == 0x200 || trap == 0x300 || trap == 0x600) {
> + if (trap == INTERRUPT_MACHINE_CHECK ||
> + trap == INTERRUPT_DATA_STORAGE ||
> + trap == INTERRUPT_ALIGNMENT) {
>   if (IS_ENABLED(CONFIG_4xx) || IS_ENABLED(CONFIG_BOOKE))
>   pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, 
> regs->dsisr);
>   else

This is now a change in behaviour because previously BOOKE/4xx tested
0x200, but now it tests 0.

That looks wrong for 4xx. 64e does put 0x000 there but I wonder if it 
should use 0x200 instead. Bit difficult to test this stuff, I do have
some MCE injection patches for QEMU for 64s, might be able to look at
porting them to 64e although I have no idea about booke machine checks.

Anyway I don't think this patch should change generated code at all.
Either change the code first with smaller patches, or make sure you
keep the tests the same.

Thanks,
Nick


[PATCH] genirq: reduce irqdebug bouncing cachelines

2021-04-02 Thread Nicholas Piggin
note_interrupt increments desc->irq_count for each interrupt even for
percpu interrupt handlers, even when they are handled successfully. This
causes cacheline bouncing and limits scalability.

Instead of incrementing irq_count every time, only start incrementing it
after seeing an unhandled irq, which should avoid the cache line
bouncing in the common path.

This actually should give better consistency in handling misbehaving
irqs too, because instead of the first unhandled irq arriving at an
arbitrary point in the irq_count cycle, its arrival will begin the
irq_count cycle.

Cédric reports the result of his IPI throughput test:

   Millions of IPIs/s
 ---   --
   upstream   upstream   patched
 chips  cpus   defaultnoirqdebug default (irqdebug)
 ---   -
 1  0-15 4.061  4.153  4.084
0-31 7.937  8.186  8.158
0-4711.018 11.392 11.233
0-6311.460 13.907 14.022
 2  0-79 8.376 18.105 18.084
0-95 7.338 22.101 22.266
0-1116.716 25.306 25.473
0-1276.223 27.814 28.029

Signed-off-by: Nicholas Piggin 
---
 kernel/irq/spurious.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index f865e5f4d382..c481d8458325 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -403,6 +403,10 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t 
action_ret)
desc->irqs_unhandled -= ok;
}
 
+   if (likely(!desc->irqs_unhandled))
+   return;
+
+   /* Now getting into unhandled irq detection */
desc->irq_count++;
if (likely(desc->irq_count < 10))
return;
-- 
2.23.0



Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-04-01 Thread Nicholas Piggin
Excerpts from Segher Boessenkool's message of April 2, 2021 2:11 am:
> On Thu, Apr 01, 2021 at 10:55:58AM +0800, Xiongwei Song wrote:
>> Segher Boessenkool  于2021年4月1日周四 上午6:15写道:
>> 
>> > On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote:
>> > > So perhaps:
>> > >
>> > >   EXC_SYSTEM_RESET
>> > >   EXC_MACHINE_CHECK
>> > >   EXC_DATA_STORAGE
>> > >   EXC_DATA_SEGMENT
>> > >   EXC_INST_STORAGE
>> > >   EXC_INST_SEGMENT
>> > >   EXC_EXTERNAL_INTERRUPT
>> > >   EXC_ALIGNMENT
>> > >   EXC_PROGRAM_CHECK
>> > >   EXC_FP_UNAVAILABLE
>> > >   EXC_DECREMENTER
>> > >   EXC_HV_DECREMENTER
>> > >   EXC_SYSTEM_CALL
>> > >   EXC_HV_DATA_STORAGE
>> > >   EXC_PERF_MONITOR
>> >
>> > These are interrupt (vectors), not exceptions.  It doesn't matter all
>> > that much, but confusing things more isn't useful either!  There can be
>> > multiple exceptions that all can trigger the same interrupt.
>> >
>> >  When looking at the reference manual of e500 and e600 from NXP
>>  official, they call them as interrupts.While looking at the "The
>> Programming Environments"
>>  that is also from NXP, they call them exceptions. Looks like there is
>>  no explicit distinction between interrupts and exceptions.
> 
> The architecture documents have always called it interrupts.  The PEM
> says it calls them exceptions instead, but they are called interrupts in
> the architecture (and the PEM says that, too).
> 
>> Here is the "The Programming Environments" link:
>> https://www.nxp.com.cn/docs/en/user-guide/MPCFPE_AD_R1.pdf
> 
> That document is 24 years old.  The architecture is still published,
> new versions regularly.
> 
>> As far as I know, the values of interrupts or exceptions above are defined
>> explicitly in reference manual or the programming environments.
> 
> They are defined in the architecture.
> 
>> Could
>> you please provide more details about multiple exceptions with the same
>> interrupts?
> 
> The simplest example is 700, program interrupt.  There are many causes
> for it, including all the exceptions in FPSCR: VX, ZX, OX, UX, XX, and
> VX is actually divided into nine separate cases itself.  There also are
> the various causes of privileged instruction type program interrupts,
> and  the trap type program interrupt, but the FEX ones are most obvious
> here.

Also:

* Some interrupts have no corresponding exception (system call and 
system call vectored). This is not just semantics or a bug in the ISA
because it is different from other synchronous interrupts: instructions 
which cause exceptions (e.g., a page fault) do not complete before 
taking the interrupt whereas sc does.

* It's quite usual for an exception to not cause an interrupt 
immediately (MSR[EE]=0, HMEER) or never cause one and be cleared by 
other means (msgclr, mtDEC, mtHMER, etc).

* It's possible for an exception to cause different interrupts!
A decrementer exception usually causes a decrementer interrupt, but it
can cause a system reset interrupt if the processor was in a power
saving mode. A data storage exception can cause a DSI or HDSI interrupt
depending on LPCR settings, and many other examples.

So I agree with Segher on this. We should use interrupt for interrupts, 
reduce exception except where we really mean it, and move away from vec 
and trap (I've got this wrong in the past too I admit). We don't have to 
do it all immediately, but new code should go in this direction.

Thanks,
Nick


Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-04-01 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of April 1, 2021 12:39 pm:
> Segher Boessenkool  writes:
>> On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote:
>>> So perhaps:
>>> 
>>>   EXC_SYSTEM_RESET
>>>   EXC_MACHINE_CHECK
>>>   EXC_DATA_STORAGE
>>>   EXC_DATA_SEGMENT
>>>   EXC_INST_STORAGE
>>>   EXC_INST_SEGMENT
>>>   EXC_EXTERNAL_INTERRUPT
>>>   EXC_ALIGNMENT
>>>   EXC_PROGRAM_CHECK
>>>   EXC_FP_UNAVAILABLE
>>>   EXC_DECREMENTER
>>>   EXC_HV_DECREMENTER
>>>   EXC_SYSTEM_CALL
>>>   EXC_HV_DATA_STORAGE
>>>   EXC_PERF_MONITOR
>>
>> These are interrupt (vectors), not exceptions.  It doesn't matter all
>> that much, but confusing things more isn't useful either!  There can be
>> multiple exceptions that all can trigger the same interrupt.
> 
> Yeah I know, but I think that ship has already sailed as far as the
> naming we have in the kernel.

It has, but there are also several other ships also sailing in different 
directions. It could be worse though, at least they are not sideways in 
the Suez.

> We have over 250 uses of "exc", and several files called "exception"
> something.
> 
> Using "interrupt" can also be confusing because Linux uses that to mean
> "external interrupt".
> 
> But I dunno, maybe INT or VEC is clearer? .. or TRAP :)

We actually already have defines that follow Segher's suggestion, it's 
just that they're hidden away in a KVM header.

#define BOOK3S_INTERRUPT_SYSTEM_RESET   0x100
#define BOOK3S_INTERRUPT_MACHINE_CHECK  0x200
#define BOOK3S_INTERRUPT_DATA_STORAGE   0x300
#define BOOK3S_INTERRUPT_DATA_SEGMENT   0x380
#define BOOK3S_INTERRUPT_INST_STORAGE   0x400
#define BOOK3S_INTERRUPT_INST_SEGMENT   0x480
#define BOOK3S_INTERRUPT_EXTERNAL   0x500
#define BOOK3S_INTERRUPT_EXTERNAL_HV0x502
#define BOOK3S_INTERRUPT_ALIGNMENT  0x600

It would take just a small amount of work to move these to general 
powerpc header, add #ifdefs for Book E/S where the numbers differ,
and remove the BOOK3S_ prefix.

I don't mind INTERRUPT_ but INT_ would be okay too. VEC_ actually
doesn't match what Book E does (which is some weirdness to map some
of them to match Book S but not all, arguably we should clean that
up too and just use vector numbers consistently, but the INTERRUPT_
prefix would still be valid if we did that).

BookE KVM entry will still continue to use a different convention
there so I would leave all those KVM defines in place for now, we
might do another pass on them later.

Thanks,
Nick


Re: linux-next: build failure after merge of the akpm-current tree

2021-03-23 Thread Nicholas Piggin
Excerpts from Stephen Rothwell's message of March 24, 2021 6:58 am:
> Hi all,
> 
> On Thu, 18 Mar 2021 20:56:07 +1100 Stephen Rothwell  
> wrote:
>> 
>> After merging the akpm-current tree, today's linux-next build (sparc
>> defconfig) failed like this:
>> 
>> In file included from arch/sparc/include/asm/pgtable_32.h:25:0,
>>  from arch/sparc/include/asm/pgtable.h:7,
>>  from include/linux/pgtable.h:6,
>>  from include/linux/mm.h:33,
>>  from mm/vmalloc.c:12:
>> mm/vmalloc.c: In function 'vmalloc_to_page':
>> include/asm-generic/pgtable-nopud.h:51:27: error: implicit declaration of 
>> function 'pud_page'; did you mean 'put_page'? 
>> [-Werror=implicit-function-declaration]
>>  #define p4d_page(p4d)(pud_page((pud_t){ p4d }))
>>^
>> mm/vmalloc.c:643:10: note: in expansion of macro 'p4d_page'
>>return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>>   ^~~~
>> mm/vmalloc.c:643:25: warning: return makes pointer from integer without a 
>> cast [-Wint-conversion]
>>return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>> mm/vmalloc.c:651:25: warning: return makes pointer from integer without a 
>> cast [-Wint-conversion]
>>return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>   ~~~^~~~
>> 
>> Caused by commit
>> 
>>   70d18d470920 ("mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages 
>> in vmalloc_to_page")
>> 
>> I have applied the following hack path for today (hopefully someone can
>> come up with something better):
>> 
>> From: Stephen Rothwell 
>> Date: Thu, 18 Mar 2021 18:32:58 +1100
>> Subject: [PATCH] hack to make SPARC32 build
>> 
>> Signed-off-by: Stephen Rothwell 
>> ---
>>  mm/vmalloc.c | 8 
>>  1 file changed, 8 insertions(+)
>> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 57b7f62d25a7..96444d64129a 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -640,7 +640,11 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  if (p4d_none(*p4d))
>>  return NULL;
>>  if (p4d_leaf(*p4d))
>> +#ifdef CONFIG_SPARC32
>> +return NULL;
>> +#else
>>  return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>> +#endif
>>  if (WARN_ON_ONCE(p4d_bad(*p4d)))
>>  return NULL;
>>  
>> @@ -648,7 +652,11 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  if (pud_none(*pud))
>>  return NULL;
>>  if (pud_leaf(*pud))
>> +#ifdef CONFIG_SPARC32
>> +return NULL;
>> +#else
>>  return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>> +#endif
>>  if (WARN_ON_ONCE(pud_bad(*pud)))
>>  return NULL;
>>  
>> -- 
>> 2.30.0
> 
> I am still applying this hack.

Oh I missed your first mail, thanks for the ping. I'll have a look 
today.

Thanks,
Nick


Re: [PATCH v4 01/25] mm: Introduce struct folio

2021-03-21 Thread Nicholas Piggin
Excerpts from Matthew Wilcox's message of March 19, 2021 11:25 am:
> On Fri, Mar 19, 2021 at 10:56:45AM +1100, Balbir Singh wrote:
>> On Fri, Mar 05, 2021 at 04:18:37AM +, Matthew Wilcox (Oracle) wrote:
>> > A struct folio refers to an entire (possibly compound) page.  A function
>> > which takes a struct folio argument declares that it will operate on the
>> > entire compound page, not just PAGE_SIZE bytes.  In return, the caller
>> > guarantees that the pointer it is passing does not point to a tail page.
>> >
>> 
>> Is this a part of a larger use case or general cleanup/refactor where
>> the split between page and folio simplify programming?
> 
> The goal here is to manage memory in larger chunks.  Pages are now too
> small for just about every workload.  Even compiling the kernel sees a 7%
> performance improvement just by doing readahead using relatively small
> THPs (16k-256k).  You can see that work here:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/master

The 7% improvement comes from cache cold kbuild by improving IO
patterns?

Just wondering what kind of readahead is enabled by this that can't
be done with base page size.

Thanks,
Nick


Re: [PATCH][next] mm/vmalloc: fix read of uninitialized pointer area

2021-03-21 Thread Nicholas Piggin
Excerpts from Colin King's message of March 19, 2021 1:59 am:
> From: Colin Ian King 
> 
> There is a corner case where the sanity check of variable size fails
> and branches to label fail and shift can be less than PAGE_SHIFT
> causing area to never be assigned. This was picked up by static
> analysis as follows:
> 
> 1. var_decl: Declaring variable area without initializer.
>struct vm_struct *area;
> 
>...
> 
> 2. Condition !size, taking true branch.
>if (!size || (size >> PAGE_SHIFT) > totalram_pages())
> 3. Jumping to label fail.
>goto fail;
> 
> ...
> 
> 4. Condition shift > 12, taking false branch.
>   fail:
>   if (shift > PAGE_SHIFT) {
>   shift = PAGE_SHIFT;
>   align = real_align;
>   size = real_size;
>   goto again;
>   }
> 
>  Uninitialized pointer read (UNINIT)
>  5. uninit_use: Using uninitialized value area.
>   if (!area) {
>   ...
>   }
> 
> Fix this by setting area to NULL to avoid the uninitialized read
> of area.
> 
> Addresses-Coverity: ("Uninitialized pointer read")
> Fixes: 92db9fec381b ("mm/vmalloc: hugepage vmalloc mappings")
> Signed-off-by: Colin Ian King 

Looks good to me.

Acked-by: Nicholas Piggin 

Thanks,
Nick

> ---
>  mm/vmalloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 96444d64129a..4b415b4bb7ae 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2888,8 +2888,10 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>   unsigned long real_align = align;
>   unsigned int shift = PAGE_SHIFT;
>  
> - if (!size || (size >> PAGE_SHIFT) > totalram_pages())
> + if (!size || (size >> PAGE_SHIFT) > totalram_pages()) {
> + area = NULL;
>   goto fail;
> + }
>  
>   if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) &&
>   arch_vmap_pmd_supported(prot)) {
> -- 
> 2.30.2
> 
> 


Re: [PATCH v13 00/14] huge vmalloc mappings

2021-03-17 Thread Nicholas Piggin
Excerpts from Andrew Morton's message of March 18, 2021 8:58 am:
> On Wed, 17 Mar 2021 16:23:48 +1000 Nicholas Piggin  wrote:
> 
>> 
>> *** BLURB HERE ***
>> 
> 
> That's really not what it means ;)
 
Sigh, wasn't having a good yesterday.

> Could we please get a nice description for the [0/n]?  What's it all
> about, what's the benefit, what are potential downsides.
>
> And performance testing results!  Because if it ain't faster, there's
> no point in merging it?
> 

It's supposed to have a bit of description in patch 13, and has some
performance reuslts in patch 14. Is it better to put a bigger writeup
in 0? I thought that tends to get lost.

I'll write something here to discuss for now, and can fit it into the 
appropriate place in the series after that.

The kernel virtual mapping layer grew support for mapping memory with > 
PAGE_SIZE ptes with 0ddab1d2ed664 ("lib/ioremap.c: add huge I/O map 
capability interfaces"), and implemented support for using those huge
page mappings with ioremap.

According to the submission, the use-case is mapping very large 
non-volatile memory devices, which could be GB or TB.
https://lore.kernel.org/lkml/1425404664-19675-1-git-send-email-toshi.k...@hp.com/
The benefit is said to be in the overhead of maintaining the mapping,
perhaps both in memory overhead and setup / teardown time. Memory
overhead for the mapping with a 4kB page and 8 byte page table is 2GB
per TB of mapping, down to 4MB / TB with 2MB pages.

The same huge page vmap infrastructure can be quite easily adapted and
used for mapping vmalloc memory pages without more complexity for arch
or core vmap code. However unlike ioremap, vmalloc page table overhead 
is not a real problem, so the advantage to justify this is performance.

Several of the most structures in the kernel (e.g., vfs and network hash 
tables) are allocated with vmalloc on NUMA machines, in order to 
distribute access bandwidth over the machine. Mapping these with larger
pages can improve TLB usage significantly, for example this reduces TLB 
misses by nearly 30x on a `git diff` workload on a 2-node POWER9 (59,800 
-> 2,100) and reduces CPU cycles by 0.54%, due to vfs hashes being 
allocated with 2MB pages.

[ Other numbers?
  - The difference is even larger in a guest due to more costly TLB 
misses.
  - Eric Dumazet was keen on the network hash performance possibilities.
  - Other archs? Ding was doing x86 testing. ]

The kernel module allocator also uses vmalloc to map module images even 
on non-NUMA, which can result in high iTLB pressure on highly modular 
distro type of kernels. This series does not implement huge mappings for 
modules yet, but it's a step along the way. Rick Edgecombe was looking 
at that IIRC.

The per-cpu allocator similarly might be able to take advantage of this.
Also on the todo list.

The disadvantages of this I can see are:
* Memory fragmentation can waste some physical memory because it will 
  attempt to allocate larger pages to fit the required size, rounding up 
  (once the requested size is >= 2MB).
  - I don't see it being a big problem in practice unless some user 
crops up that allocates thousands of 2.5MB ranges. We can tewak 
heuristics a bit there if needed to reduce peak waste.
* Less granular mappings can make the NUMA distribution less balanced.
  - Similar to the above.
  - Could also allocate all major system hashes with one allocation
up-front and spread them all across the one block, which should help
overall NUMA distribution and reduce fragmentation waste.
* Callers might expect something about the underlying allocated pages.
  - Tried to keep the apperance of base PAGE_SIZE pages throughout the 
APIs and exposed data structures.
  - Added a VM_NO_HUGE_VMAP flag to hammer troublesome cases with.

- Finally, added a nohugevmalloc boot option to turn it off (independent
  of nohugeiomap).

Is that helpful?

Thanks,
Nick


Re: [PATCH v2] Increase page and bit waitqueue hash size

2021-03-17 Thread Nicholas Piggin
Excerpts from Linus Torvalds's message of March 18, 2021 5:26 am:
> On Wed, Mar 17, 2021 at 3:44 AM Nicholas Piggin  wrote:
>>
>> Argh, because I didn't test small. Sorry I had the BASE_SMALL setting in
>> another patch and thought it would be a good idea to mash them together.
>> In hindsight probably not even if it did build.
> 
> I was going to complain about that code in general.
> 
> First complaining about the hash being small, and then adding a config
> option to make it ridiculously much *smaller* seemed wrong to begin
> with, and didn't make any sense.
> 
> So no, please don't smash together.

Fair point, fixed.

> 
> In fact, I'd like to see this split up, and with more numbers:
> 
>  - separate out the bit_waitqueue thing that is almost certainly not
> remotely as critical (and maybe not needed at all)
> 
>  - show the profile number _after_ the patch(es)

Might take some time to get a system and run tests. We actually had 
difficulty recreating it before this patch too, so it's kind of
hard to say _that_ was the exact case that previously ran badly and
is now fixed. We thought just the statistical nature of collisions
and page / lock contention made things occasionally line up and
tank.

>  - explain why you picked the random scaling numbers (21 and 22 for
> the two different cases)?
> 
>  - give an estimate of how big the array now ends up being for
> different configurations.
> 
> I think it ends up using that "scale" factor of 21, and basically
> being "memory size >> 21" and then rounding up to a power of two.
> 
> And honestly, I'm not sure that makes much sense. So for a 1GB machine
> we get the same as we used to for the bit waitqueue (twice as many for
> the page waitqueue) , but if you run on some smaller setup, you
> apparently can end up with just a couple of buckets.
> 
> So I'd feel a lot better about this if I saw the numbers, and got the
> feeling that the patch actually tries to take legacy machines into
> account.
>
> And even on a big machine, what's the advantage of scaling perfectly
> with memory. If you have a terabyte of RAM, why would you need half a
> million hash entries (if I did the math right), and use 4GB of memory
> on it? The contention doesn't go up by amount of memory, it goes up
> roughly by number of threads, and the two are very seldom really all
> that linearly connected.
> 
> So honestly, I'd like to see more reasonable numbers. I'd like to see
> what the impact of just raising the hash bit size from 8 to 16 is on
> that big machine. Maybe still using alloc_large_system_hash(), but
> using a low-imit of 8 (our traditional very old number that hasn't
> been a problem even on small machines), and a high-limit of 16 or
> something.
> 
> And if you want even more, I really really want that justified by the
> performance / profile numbers.

Yes all good points I'll add those numbers. It may need a floor and
ceiling or something like that. We may not need quite so many entries.

> 
> And does does that "bit_waitqueue" really merit updating AT ALL? It's
> almost entirely unused these days.

I updated it mainly because keeping the code more similar ends up being 
easier than unnecessary diverging. The memory cost is no big deal (once 
limits are fixed) so I prefer not to encounter some case where it falls 
over.

> I think maybe the page lock code
> used to use that, but then realized it had more specialized needs, so
> now it's separate.
> 
> So can we split that bit-waitqueue thing up from the page waitqueue
> changes? They have basically nothing in common except for a history,
> and I think they should be treated separately (including the
> explanation for what actually hits the bottleneck).

It's still used. Buffer heads being an obvious and widely used one that
follows similar usage pattern as page lock / writeback in some cases.
Several other filesystems seem to use it for similar block / IO
tracking structures by the looks (md, btrfs, nfs).

Thanks,
Nick


[tip: sched/core] sched/wait_bit, mm/filemap: Increase page and bit waitqueue hash size

2021-03-17 Thread tip-bot2 for Nicholas Piggin
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 873d7c4c6a920d43ff82e44121e54053d4edba93
Gitweb:
https://git.kernel.org/tip/873d7c4c6a920d43ff82e44121e54053d4edba93
Author:Nicholas Piggin 
AuthorDate:Wed, 17 Mar 2021 17:54:27 +10:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Mar 2021 09:32:30 +01:00

sched/wait_bit, mm/filemap: Increase page and bit waitqueue hash size

The page waitqueue hash is a bit small (256 entries) on very big systems. A
16 socket 1536 thread POWER9 system was found to encounter hash collisions
and excessive time in waitqueue locking at times. This was intermittent and
hard to reproduce easily with the setup we had (very little real IO
capacity). The theory is that sometimes (depending on allocation luck)
important pages would happen to collide a lot in the hash, slowing down page
locking, causing the problem to snowball.

An small test case was made where threads would write and fsync different
pages, generating just a small amount of contention across many pages.

Increasing page waitqueue hash size to 262144 entries increased throughput
by 182% while also reducing standard deviation 3x. perf before the increase:

  36.23%  [k] _raw_spin_lock_irqsave-  -
  |
  |--34.60%--wake_up_page_bit
  |  0
  |  iomap_write_end.isra.38
  |  iomap_write_actor
  |  iomap_apply
  |  iomap_file_buffered_write
  |  xfs_file_buffered_aio_write
  |  new_sync_write

  17.93%  [k] native_queued_spin_lock_slowpath  -  -
  |
  |--16.74%--_raw_spin_lock_irqsave
  |  |
  |   --16.44%--wake_up_page_bit
  | iomap_write_end.isra.38
  | iomap_write_actor
  | iomap_apply
  | iomap_file_buffered_write
  | xfs_file_buffered_aio_write

This patch uses alloc_large_system_hash to allocate a bigger system hash
that scales somewhat with memory size. The bit/var wait-queue is also
changed to keep code matching, albiet with a smaller scale factor.

A very small CONFIG_BASE_SMALL option is also added because these are two
of the biggest static objects in the image on very small systems.

This hash could be made per-node, which may help reduce remote accesses
on well localised workloads, but that adds some complexity with indexing
and hotplug, so until we get a less artificial workload to test with,
keep it simple.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Ingo Molnar 
Acked-by: Peter Zijlstra 
Link: https://lore.kernel.org/r/20210317075427.587806-1-npig...@gmail.com
---
 kernel/sched/wait_bit.c | 30 +++---
 mm/filemap.c| 24 +---
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 02ce292..dba73de 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -2,19 +2,24 @@
 /*
  * The implementation of the wait_bit*() and related waiting APIs:
  */
+#include 
 #include "sched.h"
 
-#define WAIT_TABLE_BITS 8
-#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
-
-static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
+#define BIT_WAIT_TABLE_SIZE (1 << bit_wait_table_bits)
+#if CONFIG_BASE_SMALL
+static const unsigned int bit_wait_table_bits = 3;
+static wait_queue_head_t bit_wait_table[BIT_WAIT_TABLE_SIZE] 
__cacheline_aligned;
+#else
+static unsigned int bit_wait_table_bits __ro_after_init;
+static wait_queue_head_t *bit_wait_table __ro_after_init;
+#endif
 
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
unsigned long val = (unsigned long)word << shift | bit;
 
-   return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_long(val, bit_wait_table_bits);
 }
 EXPORT_SYMBOL(bit_waitqueue);
 
@@ -152,7 +157,7 @@ EXPORT_SYMBOL(wake_up_bit);
 
 wait_queue_head_t *__var_waitqueue(void *p)
 {
-   return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_ptr(p, bit_wait_table_bits);
 }
 EXPORT_SYMBOL(__var_waitqueue);
 
@@ -246,6 +251,17 @@ void __init wait_bit_init(void)
 {
int i;
 
-   for (i = 0; i < WAIT_TABLE_SIZE; i++)
+   if (!CONFIG_BASE_SMALL) {
+   bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
+   
sizeof(wait_queue_head_t),
+   0,
+   22,
+   

Re: [PATCH v2] Increase page and bit waitqueue hash size

2021-03-17 Thread Nicholas Piggin
Excerpts from Rasmus Villemoes's message of March 17, 2021 8:12 pm:
> On 17/03/2021 08.54, Nicholas Piggin wrote:
> 
>> +#if CONFIG_BASE_SMALL
>> +static const unsigned int page_wait_table_bits = 4;
>>  static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] 
>> __cacheline_aligned;
> 
>>  
>> +if (!CONFIG_BASE_SMALL) {
>> +page_wait_table = alloc_large_system_hash("page waitqueue hash",
>> +
>> sizeof(wait_queue_head_t),
>> +0,
> 
> So, how does the compiler not scream at you for assigning to an array,
> even if it's inside an if (0)?
> 

Argh, because I didn't test small. Sorry I had the BASE_SMALL setting in 
another patch and thought it would be a good idea to mash them together. 
In hindsight probably not even if it did build.

Thanks,
Nick

--
[PATCH v3] Increase page and bit waitqueue hash size

The page waitqueue hash is a bit small (256 entries) on very big systems. A
16 socket 1536 thread POWER9 system was found to encounter hash collisions
and excessive time in waitqueue locking at times. This was intermittent and
hard to reproduce easily with the setup we had (very little real IO
capacity). The theory is that sometimes (depending on allocation luck)
important pages would happen to collide a lot in the hash, slowing down page
locking, causing the problem to snowball.

An small test case was made where threads would write and fsync different
pages, generating just a small amount of contention across many pages.

Increasing page waitqueue hash size to 262144 entries increased throughput
by 182% while also reducing standard deviation 3x. perf before the increase:

  36.23%  [k] _raw_spin_lock_irqsave-  -
  |
  |--34.60%--wake_up_page_bit
  |  0
  |  iomap_write_end.isra.38
  |  iomap_write_actor
  |  iomap_apply
  |  iomap_file_buffered_write
  |  xfs_file_buffered_aio_write
  |  new_sync_write

  17.93%  [k] native_queued_spin_lock_slowpath  -  -
  |
  |--16.74%--_raw_spin_lock_irqsave
  |  |
  |   --16.44%--wake_up_page_bit
  | iomap_write_end.isra.38
  | iomap_write_actor
  | iomap_apply
  | iomap_file_buffered_write
  | xfs_file_buffered_aio_write

This patch uses alloc_large_system_hash to allocate a bigger system hash
that scales somewhat with memory size. The bit/var wait-queue is also
changed to keep code matching, albiet with a smaller scale factor.

This hash could be made per-node, which may help reduce remote accesses
on well localised workloads, but that adds some complexity with indexing
and hotplug, so until we get a less artificial workload to test with,
keep it simple.

Signed-off-by: Nicholas Piggin 
---
 kernel/sched/wait_bit.c | 25 ++---
 mm/filemap.c| 16 ++--
 2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 02ce292b9bc0..3cc5fa552516 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -2,19 +2,20 @@
 /*
  * The implementation of the wait_bit*() and related waiting APIs:
  */
+#include 
 #include "sched.h"
 
-#define WAIT_TABLE_BITS 8
-#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
-
-static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
+#define BIT_WAIT_TABLE_SIZE (1 << BIT_WAIT_TABLE_BITS)
+#define BIT_WAIT_TABLE_BITS bit_wait_table_bits
+static unsigned int bit_wait_table_bits __ro_after_init;
+static wait_queue_head_t *bit_wait_table __ro_after_init;
 
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
unsigned long val = (unsigned long)word << shift | bit;
 
-   return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_long(val, BIT_WAIT_TABLE_BITS);
 }
 EXPORT_SYMBOL(bit_waitqueue);
 
@@ -152,7 +153,7 @@ EXPORT_SYMBOL(wake_up_bit);
 
 wait_queue_head_t *__var_waitqueue(void *p)
 {
-   return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_ptr(p, BIT_WAIT_TABLE_BITS);
 }
 EXPORT_SYMBOL(__var_waitqueue);
 
@@ -246,6 +247,16 @@ void __init wait_bit_init(void)
 {
int i;
 
-   for (i = 0; i < WAIT_TABLE_SIZE; i++)
+   bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
+   sizeof(wait_queue_head_t),
+   

Re: [PATCH v2] Increase page and bit waitqueue hash size

2021-03-17 Thread Nicholas Piggin
Excerpts from Ingo Molnar's message of March 17, 2021 6:38 pm:
> 
> * Nicholas Piggin  wrote:
> 
>> The page waitqueue hash is a bit small (256 entries) on very big systems. A
>> 16 socket 1536 thread POWER9 system was found to encounter hash collisions
>> and excessive time in waitqueue locking at times. This was intermittent and
>> hard to reproduce easily with the setup we had (very little real IO
>> capacity). The theory is that sometimes (depending on allocation luck)
>> important pages would happen to collide a lot in the hash, slowing down page
>> locking, causing the problem to snowball.
>> 
>> An small test case was made where threads would write and fsync different
>> pages, generating just a small amount of contention across many pages.
>> 
>> Increasing page waitqueue hash size to 262144 entries increased throughput
>> by 182% while also reducing standard deviation 3x. perf before the increase:
>> 
>>   36.23%  [k] _raw_spin_lock_irqsave-  -
>>   |
>>   |--34.60%--wake_up_page_bit
>>   |  0
>>   |  iomap_write_end.isra.38
>>   |  iomap_write_actor
>>   |  iomap_apply
>>   |  iomap_file_buffered_write
>>   |  xfs_file_buffered_aio_write
>>   |  new_sync_write
>> 
>>   17.93%  [k] native_queued_spin_lock_slowpath  -  -
>>   |
>>   |--16.74%--_raw_spin_lock_irqsave
>>   |  |
>>   |   --16.44%--wake_up_page_bit
>>   | iomap_write_end.isra.38
>>   | iomap_write_actor
>>   | iomap_apply
>>   | iomap_file_buffered_write
>>   | xfs_file_buffered_aio_write
>> 
>> This patch uses alloc_large_system_hash to allocate a bigger system hash
>> that scales somewhat with memory size. The bit/var wait-queue is also
>> changed to keep code matching, albiet with a smaller scale factor.
>> 
>> A very small CONFIG_BASE_SMALL option is also added because these are two
>> of the biggest static objects in the image on very small systems.
>> 
>> This hash could be made per-node, which may help reduce remote accesses
>> on well localised workloads, but that adds some complexity with indexing
>> and hotplug, so until we get a less artificial workload to test with,
>> keep it simple.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  kernel/sched/wait_bit.c | 30 +++---
>>  mm/filemap.c| 24 +---
>>  2 files changed, 44 insertions(+), 10 deletions(-)
>> 
>> diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
>> index 02ce292b9bc0..dba73dec17c4 100644
>> --- a/kernel/sched/wait_bit.c
>> +++ b/kernel/sched/wait_bit.c
>> @@ -2,19 +2,24 @@
>>  /*
>>   * The implementation of the wait_bit*() and related waiting APIs:
>>   */
>> +#include 
>>  #include "sched.h"
>>  
>> -#define WAIT_TABLE_BITS 8
>> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> 
> Ugh, 256 entries is almost embarrassingly small indeed.
> 
> I've put your patch into sched/core, unless Andrew is objecting.

Thanks. Andrew and Linux might have some opinions on it, but if it's 
just in a testing branch for now that's okay.


> 
>> -for (i = 0; i < WAIT_TABLE_SIZE; i++)
>> +if (!CONFIG_BASE_SMALL) {
>> +bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
>> +
>> sizeof(wait_queue_head_t),
>> +0,
>> +22,
>> +0,
>> +_wait_table_bits,
>> +NULL,
>> +0,
>> +0);
>> +}
>> +for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
>>  init_waitqueue_head(bit_wait_table + i);
> 
> 
> Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded 
> into alloc_large_system_hash() itself?

I don't like the ugliness and that's a good suggestion in some ways, but 
having the constant size and table is 

[PATCH v2] Increase page and bit waitqueue hash size

2021-03-17 Thread Nicholas Piggin
The page waitqueue hash is a bit small (256 entries) on very big systems. A
16 socket 1536 thread POWER9 system was found to encounter hash collisions
and excessive time in waitqueue locking at times. This was intermittent and
hard to reproduce easily with the setup we had (very little real IO
capacity). The theory is that sometimes (depending on allocation luck)
important pages would happen to collide a lot in the hash, slowing down page
locking, causing the problem to snowball.

An small test case was made where threads would write and fsync different
pages, generating just a small amount of contention across many pages.

Increasing page waitqueue hash size to 262144 entries increased throughput
by 182% while also reducing standard deviation 3x. perf before the increase:

  36.23%  [k] _raw_spin_lock_irqsave-  -
  |
  |--34.60%--wake_up_page_bit
  |  0
  |  iomap_write_end.isra.38
  |  iomap_write_actor
  |  iomap_apply
  |  iomap_file_buffered_write
  |  xfs_file_buffered_aio_write
  |  new_sync_write

  17.93%  [k] native_queued_spin_lock_slowpath  -  -
  |
  |--16.74%--_raw_spin_lock_irqsave
  |  |
  |   --16.44%--wake_up_page_bit
  | iomap_write_end.isra.38
  | iomap_write_actor
  | iomap_apply
  | iomap_file_buffered_write
  | xfs_file_buffered_aio_write

This patch uses alloc_large_system_hash to allocate a bigger system hash
that scales somewhat with memory size. The bit/var wait-queue is also
changed to keep code matching, albiet with a smaller scale factor.

A very small CONFIG_BASE_SMALL option is also added because these are two
of the biggest static objects in the image on very small systems.

This hash could be made per-node, which may help reduce remote accesses
on well localised workloads, but that adds some complexity with indexing
and hotplug, so until we get a less artificial workload to test with,
keep it simple.

Signed-off-by: Nicholas Piggin 
---
 kernel/sched/wait_bit.c | 30 +++---
 mm/filemap.c| 24 +---
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 02ce292b9bc0..dba73dec17c4 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -2,19 +2,24 @@
 /*
  * The implementation of the wait_bit*() and related waiting APIs:
  */
+#include 
 #include "sched.h"
 
-#define WAIT_TABLE_BITS 8
-#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
-
-static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
+#define BIT_WAIT_TABLE_SIZE (1 << bit_wait_table_bits)
+#if CONFIG_BASE_SMALL
+static const unsigned int bit_wait_table_bits = 3;
+static wait_queue_head_t bit_wait_table[BIT_WAIT_TABLE_SIZE] 
__cacheline_aligned;
+#else
+static unsigned int bit_wait_table_bits __ro_after_init;
+static wait_queue_head_t *bit_wait_table __ro_after_init;
+#endif
 
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
unsigned long val = (unsigned long)word << shift | bit;
 
-   return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_long(val, bit_wait_table_bits);
 }
 EXPORT_SYMBOL(bit_waitqueue);
 
@@ -152,7 +157,7 @@ EXPORT_SYMBOL(wake_up_bit);
 
 wait_queue_head_t *__var_waitqueue(void *p)
 {
-   return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_ptr(p, bit_wait_table_bits);
 }
 EXPORT_SYMBOL(__var_waitqueue);
 
@@ -246,6 +251,17 @@ void __init wait_bit_init(void)
 {
int i;
 
-   for (i = 0; i < WAIT_TABLE_SIZE; i++)
+   if (!CONFIG_BASE_SMALL) {
+   bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
+   
sizeof(wait_queue_head_t),
+   0,
+   22,
+   0,
+   _wait_table_bits,
+   NULL,
+   0,
+   0);
+   }
+   for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
init_waitqueue_head(bit_wait_table + i);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 43700480d897..dbbb5b9d951d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #inclu

[PATCH v13 13/14] mm/vmalloc: Hugepage vmalloc mappings

2021-03-17 Thread Nicholas Piggin
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
or larger, and fall back to small pages if that was unsuccessful.

Architectures must ensure that any arch specific vmalloc allocations
that require PAGE_SIZE mappings (e.g., module allocations vs strict
module rwx) use the VM_NOHUGE flag to inhibit larger mappings.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig|  11 ++
 include/linux/vmalloc.h |  21 
 mm/page_alloc.c |   5 +-
 mm/vmalloc.c| 216 +++-
 4 files changed, 206 insertions(+), 47 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..b347102f2984 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -785,6 +785,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+#
+#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
+#  arch_vmap_pmd_supported() returns true), and they must make no assumptions
+#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
+#  can be used to prohibit arch-specific allocations from using hugepages to
+#  help with this (e.g., modules may require it).
+#
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 1f6844e2670a..8341964e6eb5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -26,6 +26,7 @@ struct notifier_block;/* in notifier.h */
 #define VM_KASAN   0x0080  /* has allocated kasan shadow 
memory */
 #define VM_FLUSH_RESET_PERMS   0x0100  /* reset direct map and flush 
TLB on unmap, can't be freed in atomic context */
 #define VM_MAP_PUT_PAGES   0x0200  /* put pages and free array in 
vfree */
+#define VM_NO_HUGE_VMAP0x0400  /* force PAGE_SIZE pte 
mapping */
 
 /*
  * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
@@ -54,6 +55,9 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+   unsigned intpage_order;
+#endif
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
@@ -188,6 +192,22 @@ void free_vm_area(struct vm_struct *area);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
+static inline bool is_vm_area_hugepages(const void *addr)
+{
+   /*
+* This may not 100% tell if the area is mapped with > PAGE_SIZE
+* page table entries, if for some reason the architecture indicates
+* larger sizes are available but decides not to use them, nothing
+* prevents that. This only indicates the size of the physical page
+* allocated in the vmalloc layer.
+*/
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+   return find_vm_area(addr)->page_order > 0;
+#else
+   return false;
+#endif
+}
+
 #ifdef CONFIG_MMU
 int vmap_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
@@ -205,6 +225,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+
 #else
 static inline int
 map_kernel_range_noflush(unsigned long start, unsigned long size,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfc72873961d..2e2042f39b8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -8222,6 +8223,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8289,6 +8291,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = is_vm_area_hugepages(table);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8305,7 +8308,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1

[PATCH v13 14/14] powerpc/64s/radix: Enable huge vmalloc mappings

2021-03-17 Thread Nicholas Piggin
This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due
to vfs hashes being allocated with 2MB pages.

Cc: linuxppc-...@lists.ozlabs.org
Acked-by: Michael Ellerman 
Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |  2 ++
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/kernel/module.c  | 22 +++
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..1f481f904895 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3243,6 +3243,8 @@
 
nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
 
+   nohugevmalloc   [PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 386ae12d8523..b7cade9566da 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -181,6 +181,7 @@ config PPC
select GENERIC_GETTIMEOFDAY
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
+   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index a211b0253cdb..cdb2d88c54e7 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -87,13 +88,26 @@ int module_finalize(const Elf_Ehdr *hdr,
return 0;
 }
 
-#ifdef MODULES_VADDR
 void *module_alloc(unsigned long size)
 {
+   unsigned long start = VMALLOC_START;
+   unsigned long end = VMALLOC_END;
+
+#ifdef MODULES_VADDR
BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
+   start = MODULES_VADDR;
+   end = MODULES_END;
+#endif
+
+   /*
+* Don't do huge page allocations for modules yet until more testing
+* is done. STRICT_MODULE_RWX may require extra work to support this
+* too.
+*/
 
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, 
GFP_KERNEL,
-   PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
NUMA_NO_NODE,
+   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL,
+   PAGE_KERNEL_EXEC,
+   VM_NO_HUGE_VMAP | VM_FLUSH_RESET_PERMS,
+   NUMA_NO_NODE,
__builtin_return_address(0));
 }
-#endif
-- 
2.23.0



[PATCH v13 10/14] mm/vmalloc: provide fallback arch huge vmap support functions

2021-03-17 Thread Nicholas Piggin
If an architecture doesn't support a particular page table level as
a huge vmap page size then allow it to skip defining the support
query function.

Suggested-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h   |  7 +++
 arch/powerpc/include/asm/vmalloc.h |  7 +++
 arch/x86/include/asm/vmalloc.h | 13 +
 include/linux/vmalloc.h| 24 
 4 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index fc9a12d6cc1a..7a22aeea9bb5 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,11 +4,8 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
 
+#define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
/*
@@ -19,11 +16,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
+#define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
/* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
+
 #endif
 
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index 3f0c153befb0..4c69ece52a31 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -5,21 +5,20 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
 
+#define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
+#define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
+
 #endif
 
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index e714b00fc0ca..49ce331f3ac6 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -6,24 +6,21 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
 
+#ifdef CONFIG_X86_64
+#define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
-#ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
 }
+#endif
 
+#define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return boot_cpu_has(X86_FEATURE_PSE);
 }
+
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 4b897a4a408b..82b45e1f28ff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -78,10 +78,26 @@ struct vmap_area {
};
 };
 
-#ifndef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot) { return false; }
-static inline bool arch_vmap_pud_supported(pgprot_t prot) { return false; }
-static inline bool arch_vmap_pmd_supported(pgprot_t prot) { return false; }
+/* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these 
*/
+#ifndef arch_vmap_p4d_supported
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+#endif
+
+#ifndef arch_vmap_pud_supported
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   return false;
+}
+#endif
+
+#ifndef arch_vmap_pmd_supported
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return false;
+}
 #endif
 
 /*
-- 
2.23.0



[PATCH v13 12/14] mm/vmalloc: add vmap_range_noflush variant

2021-03-17 Thread Nicholas Piggin
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 53414959845d..9455dba58b0e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -240,7 +240,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end,
+static int vmap_range_noflush(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift)
 {
@@ -263,14 +263,24 @@ int vmap_range(unsigned long addr, unsigned long end,
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0



[PATCH v13 11/14] mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c

2021-03-17 Thread Nicholas Piggin
This is a generic kernel virtual memory mapper, not specific to ioremap.

Code is unchanged other than making vmap_range non-static.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   3 +
 mm/ioremap.c| 203 
 mm/vmalloc.c| 202 +++
 3 files changed, 205 insertions(+), 203 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 82b45e1f28ff..1f6844e2670a 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -189,6 +189,9 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 3264d0203785..d1dcc7e744ac 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,209 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_max_page_shift = PAGE_SHIFT;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
-   max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot,
-   max_page_shift)) {
-   *mask |= PGTBL_PUD_MODI

[PATCH v13 04/14] mm/vmalloc: rename vmap_*_range vmap_pages_*_range

2021-03-17 Thread Nicholas Piggin
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 98e697ac764c..4693fab4f42a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -189,7 +189,7 @@ void unmap_kernel_range_noflush(unsigned long start, 
unsigned long size)
arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -217,7 +217,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -229,13 +229,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
-   if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -247,13 +247,13 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
-   if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -265,7 +265,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
-   if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -306,7 +306,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
-   err = vmap_p4d_range(pgd, addr, next, prot, pages, , );
+   err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, , 
);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v13 08/14] arm64: inline huge vmap supported functions

2021-03-17 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Acked-by: Catalin Marinas 
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h | 23 ---
 arch/arm64/mm/mmu.c  | 26 --
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 597b40405319..fc9a12d6cc1a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,9 +4,26 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /*
+* Only 4k granule supports level 1 block mappings.
+* SW table walks can't handle removal of intermediate entries.
+*/
+   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   /* See arch_vmap_pud_supported() */
+   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
 #endif
 
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 639b9de61b1d..1fb0035b0777 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1316,27 +1316,6 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-   /*
-* Only 4k granule supports level 1 block mappings.
-* SW table walks can't handle removal of intermediate entries.
-*/
-   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
-  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   /* See arch_vmap_pud_supported() */
-   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot));
@@ -1428,11 +1407,6 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
return 1;
 }
 
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;   /* Don't attempt a block mapping */
-}
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
 {
-- 
2.23.0



[PATCH v13 06/14] mm: HUGE_VMAP arch support cleanup

2021-03-17 Thread Nicholas Piggin
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Cc: linuxppc-...@lists.ozlabs.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Reviewed-by: Ding Tianhong 
Acked-by: Catalin Marinas  [arm64]
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h |  8 ++
 arch/arm64/mm/mmu.c  | 10 +--
 arch/powerpc/include/asm/vmalloc.h   |  8 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +-
 arch/x86/include/asm/vmalloc.h   |  7 ++
 arch/x86/mm/ioremap.c| 12 +--
 include/linux/io.h   |  9 ---
 include/linux/vmalloc.h  |  6 ++
 init/main.c  |  1 -
 mm/debug_vm_pgtable.c|  4 +-
 mm/ioremap.c | 94 ++--
 11 files changed, 87 insertions(+), 80 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 2ca708ab9b20..597b40405319 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_ARM64_VMALLOC_H
 #define _ASM_ARM64_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7484ea4f6ba0..639b9de61b1d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1316,12 +1316,12 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1331,9 +1331,9 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index b992dfaaa161..105abb73f075 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 98f0b243c1ab..743807fc210f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1082,13 +1082,13 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
@@ -1182,7 +1182,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 29837740b520..094ea2b565f3 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,6 +1,13 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..fbaf0c447986 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm

[PATCH v13 07/14] powerpc: inline huge vmap supported functions

2021-03-17 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: linuxppc-...@lists.ozlabs.org
Acked-by: Michael Ellerman 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/vmalloc.h   | 19 ---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 21 -
 2 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index 105abb73f075..3f0c153befb0 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,12 +1,25 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /* HPT does not cope with large pages in the vmalloc area */
+   return radix_enabled();
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return radix_enabled();
+}
 #endif
 
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 743807fc210f..8da62afccee5 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1082,22 +1082,6 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-   /* HPT does not cope with large pages in the vmalloc area */
-   return radix_enabled();
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return radix_enabled();
-}
-
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
pte_t *ptep = (pte_t *)pud;
@@ -1181,8 +1165,3 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
return 1;
 }
-
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-- 
2.23.0



[PATCH v13 09/14] x86: inline huge vmap supported functions

2021-03-17 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---
 arch/x86/include/asm/vmalloc.h | 22 +++---
 arch/x86/mm/ioremap.c  | 21 -
 arch/x86/mm/pgtable.c  | 13 -
 3 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 094ea2b565f3..e714b00fc0ca 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,13 +1,29 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+#ifdef CONFIG_X86_64
+   return boot_cpu_has(X86_FEATURE_GBPAGES);
+#else
+   return false;
+#endif
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return boot_cpu_has(X86_FEATURE_PSE);
+}
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index fbaf0c447986..12c686c65ea9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,27 +481,6 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-#ifdef CONFIG_X86_64
-   return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return boot_cpu_has(X86_FEATURE_PSE);
-}
-#endif
-
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
  * access
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f6a9e2e36642..d27cf69e811d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,14 +780,6 @@ int pmd_clear_huge(pmd_t *pmd)
return 0;
 }
 
-/*
- * Until we support 512GB pages, skip them in the vmap area.
- */
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 #ifdef CONFIG_X86_64
 /**
  * pud_free_pmd_page - Clear pud entry and free pmd page.
@@ -861,11 +853,6 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #else /* !CONFIG_X86_64 */
 
-int pud_free_pmd_page(pud_t *pud, unsigned long addr)
-{
-   return pud_none(*pud);
-}
-
 /*
  * Disable free page handling on x86-PAE. This assures that ioremap()
  * does not update sync'd pmd entries. See vmalloc_sync_one().
-- 
2.23.0



[PATCH v13 05/14] mm/ioremap: rename ioremap_*_range to vmap_*_range

2021-03-17 Thread Nicholas Piggin
This will be used as a generic kernel virtual mapping function, so
re-name it in preparation.

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/ioremap.c | 64 +++-
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..3f4d36f9745a 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,9 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pte_t *pte;
u64 pfn;
@@ -81,9 +81,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +102,9 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
addr,
return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -116,20 +115,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned 
long addr,
do {
next = pmd_addr_end(addr, end);
 
-   if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
 
-   if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +147,9 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
addr,
return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pud_t *pud;
unsigned long next;
@@ -162,20 +160,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned 
long addr,
do {
next = pud_addr_end(addr, end);
 
-   if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
 
-   if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +192,9 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long 
addr,
return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int

[PATCH v13 03/14] mm: apply_to_pte_range warn and fail if a large pte is encountered

2021-03-17 Thread Nicholas Piggin
apply_to_pte_range might mistake a large pte for bad, or treat it as a
page table, resulting in a crash or corruption. Add a test to warn and
return error if large entries are found.

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 66 +++--
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..ccaf74f070c0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2446,13 +2446,21 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
-   if (create || !pmd_none_or_clear_bad(pmd)) {
-   err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-create, mask);
-   if (err)
-   break;
+   if (pmd_none(*pmd) && !create)
+   continue;
+   if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+   return -EINVAL;
+   if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
+   if (!create)
+   continue;
+   pmd_clear_bad(pmd);
}
+   err = apply_to_pte_range(mm, pmd, addr, next,
+fn, data, create, mask);
+   if (err)
+   break;
} while (pmd++, addr = next, addr != end);
+
return err;
 }
 
@@ -2474,13 +2482,21 @@ static int apply_to_pud_range(struct mm_struct *mm, 
p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
-   if (create || !pud_none_or_clear_bad(pud)) {
-   err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-create, mask);
-   if (err)
-   break;
+   if (pud_none(*pud) && !create)
+   continue;
+   if (WARN_ON_ONCE(pud_leaf(*pud)))
+   return -EINVAL;
+   if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
+   if (!create)
+   continue;
+   pud_clear_bad(pud);
}
+   err = apply_to_pmd_range(mm, pud, addr, next,
+fn, data, create, mask);
+   if (err)
+   break;
} while (pud++, addr = next, addr != end);
+
return err;
 }
 
@@ -2502,13 +2518,21 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
-   if (create || !p4d_none_or_clear_bad(p4d)) {
-   err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-create, mask);
-   if (err)
-   break;
+   if (p4d_none(*p4d) && !create)
+   continue;
+   if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+   return -EINVAL;
+   if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) {
+   if (!create)
+   continue;
+   p4d_clear_bad(p4d);
}
+   err = apply_to_pud_range(mm, p4d, addr, next,
+fn, data, create, mask);
+   if (err)
+   break;
} while (p4d++, addr = next, addr != end);
+
return err;
 }
 
@@ -2528,9 +2552,17 @@ static int __apply_to_page_range(struct mm_struct *mm, 
unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
-   if (!create && pgd_none_or_clear_bad(pgd))
+   if (pgd_none(*pgd) && !create)
continue;
-   err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, 
);
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return -EINVAL;
+   if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
+   if (!create)
+   continue;
+   pgd_clear_bad(pgd);
+   }
+   err = apply_to_p4d_range(mm, pgd, addr, next,
+fn, data, create, );
if (err)
break;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v13 02/14] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

2021-03-17 Thread Nicholas Piggin
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4f5f8c907897..98e697ac764c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -34,7 +34,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 #include 
@@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0



[PATCH v13 01/14] ARM: mm: add missing pud_page define to 2-level page tables

2021-03-17 Thread Nicholas Piggin
ARM uses its own PMD folding scheme which is missing pud_page which
should just pass through to pmd_page. Move this from the 3-level
page table to common header.

Cc: Russell King 
Cc: Ding Tianhong 
Cc: linux-arm-ker...@lists.infradead.org
Signed-off-by: Nicholas Piggin 
---
 arch/arm/include/asm/pgtable-3level.h | 2 --
 arch/arm/include/asm/pgtable.h| 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 2b85d175e999..d4edab51a77c 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -186,8 +186,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 
 #define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
-#define pud_page(pud)  pmd_page(__pmd(pud_val(pud)))
-#define pud_write(pud) pmd_write(__pmd(pud_val(pud)))
 
 #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd))
 #define pmd_thp_or_huge(pmd)   (pmd_huge(pmd) || pmd_trans_huge(pmd))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index c02f24400369..d63a5bb6bd0c 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -166,6 +166,9 @@ extern struct page *empty_zero_page;
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 
+#define pud_page(pud)  pmd_page(__pmd(pud_val(pud)))
+#define pud_write(pud) pmd_write(__pmd(pud_val(pud)))
+
 #define pmd_none(pmd)  (!pmd_val(pmd))
 
 static inline pte_t *pmd_page_vaddr(pmd_t pmd)
-- 
2.23.0



[PATCH v13 00/14] huge vmalloc mappings

2021-03-17 Thread Nicholas Piggin
Important compound page fix thanks to Ding Tianhong. 

Thanks,
Nick

Since v12:
- Use compound pages so it works with remap_vmalloc_range [noticed by Ding]
- Fix debug_vm_pgtable.c compile error.

Since v11:
- ARM compile fix (patch 1)
- debug_vm_pgtable compile fix

Since v10:
- Fixed code style, most > 80 colums, tweak patch titles, etc [thanks Christoph]
- Made huge vmalloc code and data structure compile away if unselected
  [Christoph]
- Archs only have to provide arch_vmap_p?d_supported for levels they
  implement [Christoph]

Since v9:
- Fixed intermediate build breakage on x86-32 !PAE [thanks Ding]
- Fixed small page fallback case vm_struct double-free [thanks Ding]

Since v8:
- Fixed nommu compile.
- Added Kconfig option help text
- Added VM_NOHUGE which should help archs implement it [suggested by Rick]

Since v7:
- Rebase, added some acks, compile fix
- Removed "order=" from vmallocinfo, it's a bit confusing (nr_pages
  is in small page size for compatibility).
- Added arch_vmap_pmd_supported() test before starting to allocate
  the large page, rather than only testing it when doing the map, to
  avoid unsupported configs trying to allocate huge pages for no
  reason.

Since v6:
- Fixed a false positive warning introduced in patch 2, found by
  kbuild test robot.

Since v5:
- Split arch changes out better and make the constant folding work
- Avoid most of the 80 column wrap, fix a reference to lib/ioremap.c
- Fix compile error on some archs

*** BLURB HERE ***

Nicholas Piggin (14):
  ARM: mm: add missing pud_page define to 2-level page tables
  mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in
vmalloc_to_page
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  mm/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  powerpc: inline huge vmap supported functions
  arm64: inline huge vmap supported functions
  x86: inline huge vmap supported functions
  mm/vmalloc: provide fallback arch huge vmap support functions
  mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings
  powerpc/64s/radix: Enable huge vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |  11 +
 arch/arm/include/asm/pgtable-3level.h |   2 -
 arch/arm/include/asm/pgtable.h|   3 +
 arch/arm64/include/asm/vmalloc.h  |  24 +
 arch/arm64/mm/mmu.c   |  26 -
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/vmalloc.h|  20 +
 arch/powerpc/kernel/module.c  |  22 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  21 -
 arch/x86/include/asm/vmalloc.h|  20 +
 arch/x86/mm/ioremap.c |  19 -
 arch/x86/mm/pgtable.c |  13 -
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  46 ++
 init/main.c   |   1 -
 mm/debug_vm_pgtable.c |   4 +-
 mm/ioremap.c  | 225 +---
 mm/memory.c   |  66 ++-
 mm/page_alloc.c   |   5 +-
 mm/vmalloc.c  | 485 +++---
 21 files changed, 621 insertions(+), 404 deletions(-)

-- 
2.23.0



Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-15 Thread Nicholas Piggin
Excerpts from Davidlohr Bueso's message of March 9, 2021 11:59 am:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.

Thanks for tracking this down and the comprehensive results, great
work.

It's only a relatively recent patch, so I think the revert is a
good idea (i.e., don't keep it around for possibly other code to
hit problems with).

One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
 * smp_cond_load_relaxed was found to have performance problems if
 * implemented with spin_begin()/spin_end().
 */

I wonder if it should have a Fixes: tag to the original commit as
well.

Otherwise,

Acked-by: Nicholas Piggin 

Thanks,
Nick

> 
> Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
> 2 sockets and 8 threads per core.
> 
> 1. locktorture.
> 
> This is data for the lowest and most artificial/pathological level,
> with increasing thread counts pounding on the lock. Metrics are total
> ops/minute. Despite some small hits in the 4-8 range, scenarios are
> either neutral or favorable to this patch.
> 
> +=+==+==+===+
> | # tasks | vanilla  | dirty| %diff |
> +=+==+==+===+
> | 2   | 46718565 | 48751350 | 4.35  |
> +-+--+--+---+
> | 4   | 51740198 | 50369082 | -2.65 |
> +-+--+--+---+
> | 8   | 63756510 | 62568821 | -1.86 |
> +-+--+--+---+
> | 16  | 67824531 | 70966546 | 4.63  |
> +-+--+--+---+
> | 32  | 53843519 | 61155508 | 13.58 |
> +-+--+--+---+
> | 64  | 53005778 | 53104412 | 0.18  |
> +-+--+--+---+
> | 128 | 53331980 | 54606910 | 2.39  |
> +=+==+==+===+
> 
> 2. sockperf (tcp throughput)
> 
> Here a client will do one-way throughput tests to a localhost server, with
> increasing message sizes, dealing with the sk_lock. This patch shows to put
> the performance of the qspinlock back to par with that of the simple lock:
> 
>simple-spinlock   vanilla  dirty
> Hmean 1473.50 (   0.00%)   54.44 * -25.93%*   73.45 * 
> -0.07%*
> Hmean 100  654.47 (   0.00%)  385.61 * -41.08%*  771.43 * 
> 17.87%*
> Hmean 300 2719.39 (   0.00%) 2181.67 * -19.77%* 2666.50 * 
> -1.94%*
> Hmean 500 4400.59 (   0.00%) 3390.77 * -22.95%* 4322.14 * 
> -1.78%*
> Hmean 850 6726.21 (   0.00%) 5264.03 * -21.74%* 6863.12 * 
> 2.04%*
> 
> 3. dbench (tmpfs)
> 
> Configured to run with up to ncpusx8 clients, it shows both latency and
> throughput metrics. For the latency, with the exception of the 64 case,
> there is really nothing to go by:
>vanilladirty
> Amean latency-1  1.67 (   0.00%)1.67 *   0.09%*
> Amean latency-2  2.15 (   0.00%)2.08 *   3.36%*
> Amean latency-4  2.50 (   0.00%)2.56 *  -2.27%*
> Amean latency-8  2.49 (   0.00%)2.48 *   0.31%*
> Amean latency-16 2.69 (   0.00%)2.72 *  -1.37%*
> Amean latency-32 2.96 (   0.00%)3.04 *  -2.60%*
> Amean latency-64 7.78 (   0.00%)8.17 *  -5.07%*
> Amean latency-512  186.91 (   0.00%)  186.41 *   0.27%*
> 
> For the dbench4 Throughput (misleading but traditional) there's a small
> but rather constant improvement:
> 
>vanilladirty
> Hmean 1849.13 (   0.00%)  851.51 *   0.28%*
> Hmean 2   1664.03 (   0.00%) 1663.94 *  -0.01%*
> Hmean 4   3073.70 (   0.00%) 3104.29 *   1.00%*
> Hmean 8   5624.02 (   0.00%) 5694.16 *   1.25%*
> Hmean 16  9169.49 (   0.00%) 9324.43 *   1.69%*
> Hmean 32 11969.37 (   0.00%)12127.09 *   1.32%*
> Hmean 64 15021.12 (   0.00%)15243.14 *   1.48%*
> Hmean 51214891.27 (   0.00%)15162.11 *   1.82%*
> 
> Measuring the dbench4 Per-VFS Operation latency, shows some very minor
> differences within the noise level, around the 0-1% ranges.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
&

Re: [PATCH v2 36/43] powerpc/32: Set current->thread.regs in C interrupt entry

2021-03-11 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 11, 2021 10:38 pm:
> 
> 
> Le 11/03/2021 à 11:38, Christophe Leroy a écrit :
>> 
>> 
>> Le 10/03/2021 à 02:33, Nicholas Piggin a écrit :
>>> Excerpts from Christophe Leroy's message of March 9, 2021 10:10 pm:
>>>> No need to do that is assembly, do it in C.
>>>
>>> Hmm. No issues with the patch as such, but why does ppc32 need this but
>>> not 64? AFAIKS 64 sets this when a thread is created.
>> 
>> Looks like ppc64 was doing the same in function save_remaining_regs() in 
>> arch/ppc64/kernel/head.S 
>> until commit https://github.com/mpe/linux-fullhistory/commit/e5bb080d
>> 
>> But I can't find what happend to it in that commit.
>> 
>> Where is it done now ? Maybe that's also already done for ppc32.
>> 
> 
> I digged a bit more and found a later bug fix which adds that setting of 
> current->thread.regs at 
> task creation: https://github.com/mpe/linux-fullhistory/commit/3eac1897
> 
> That was in the ppc64 tree only at that time, and was merged into the common 
> powerpc tree via commit 
> https://github.com/mpe/linux-fullhistory/commit/06d67d54

Nice archaeology!

> So we have it for both ppc32 and ppc64 and ppc32 doesn't need to do it at 
> exception entry anymore. 
> I'll remove it.

Good, that's what I hoped (otherwise ppc64 would have been missing 
something).

Thanks,
Nick


Re: [PATCH v2 40/43] powerpc/64s: Make kuap_check_amr() and kuap_get_and_check_amr() generic

2021-03-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 9, 2021 10:10 pm:
> In preparation of porting powerpc32 to C syscall entry/exit,
> rename kuap_check_amr() and kuap_get_and_check_amr() as kuap_check()
> and kuap_get_and_check(), and move in the generic asm/kup.h the stub
> for when CONFIG_PPC_KUAP is not selected.

Looks pretty straightforward to me.

While you're renaming things, could kuap_check_amr() be changed to
kuap_assert_locked() or similar? Otherwise,

Reviewed-by: Nicholas Piggin 

> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/book3s/64/kup.h | 24 ++--
>  arch/powerpc/include/asm/kup.h   | 10 +-
>  arch/powerpc/kernel/interrupt.c  | 12 ++--
>  arch/powerpc/kernel/irq.c|  2 +-
>  4 files changed, 18 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
> b/arch/powerpc/include/asm/book3s/64/kup.h
> index 8bd905050896..d9b07e9998be 100644
> --- a/arch/powerpc/include/asm/book3s/64/kup.h
> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
> @@ -287,7 +287,7 @@ static inline void kuap_kernel_restore(struct pt_regs 
> *regs,
>*/
>  }
>  
> -static inline unsigned long kuap_get_and_check_amr(void)
> +static inline unsigned long kuap_get_and_check(void)
>  {
>   if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP)) {
>   unsigned long amr = mfspr(SPRN_AMR);
> @@ -298,27 +298,7 @@ static inline unsigned long kuap_get_and_check_amr(void)
>   return 0;
>  }
>  
> -#else /* CONFIG_PPC_PKEY */
> -
> -static inline void kuap_user_restore(struct pt_regs *regs)
> -{
> -}
> -
> -static inline void kuap_kernel_restore(struct pt_regs *regs, unsigned long 
> amr)
> -{
> -}
> -
> -static inline unsigned long kuap_get_and_check_amr(void)
> -{
> - return 0;
> -}
> -
> -#endif /* CONFIG_PPC_PKEY */
> -
> -
> -#ifdef CONFIG_PPC_KUAP
> -
> -static inline void kuap_check_amr(void)
> +static inline void kuap_check(void)
>  {
>   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG) && 
> mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
>   WARN_ON_ONCE(mfspr(SPRN_AMR) != AMR_KUAP_BLOCKED);
> diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
> index 25671f711ec2..b7efa46b3109 100644
> --- a/arch/powerpc/include/asm/kup.h
> +++ b/arch/powerpc/include/asm/kup.h
> @@ -74,7 +74,15 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long 
> address, bool is_write)
>   return false;
>  }
>  
> -static inline void kuap_check_amr(void) { }
> +static inline void kuap_check(void) { }
> +static inline void kuap_save_and_lock(struct pt_regs *regs) { }
> +static inline void kuap_user_restore(struct pt_regs *regs) { }
> +static inline void kuap_kernel_restore(struct pt_regs *regs, unsigned long 
> amr) { }
> +
> +static inline unsigned long kuap_get_and_check(void)
> +{
> + return 0;
> +}
>  
>  /*
>   * book3s/64/kup-radix.h defines these functions for the !KUAP case to flush
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 727b7848c9cc..40ed55064e54 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -76,7 +76,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   } else
>  #endif
>  #ifdef CONFIG_PPC64
> - kuap_check_amr();
> + kuap_check();
>  #endif
>  
>   booke_restore_dbcr0();
> @@ -254,7 +254,7 @@ notrace unsigned long syscall_exit_prepare(unsigned long 
> r3,
>   CT_WARN_ON(ct_state() == CONTEXT_USER);
>  
>  #ifdef CONFIG_PPC64
> - kuap_check_amr();
> + kuap_check();
>  #endif
>  
>   regs->result = r3;
> @@ -380,7 +380,7 @@ notrace unsigned long interrupt_exit_user_prepare(struct 
> pt_regs *regs, unsigned
>* AMR can only have been unlocked if we interrupted the kernel.
>*/
>  #ifdef CONFIG_PPC64
> - kuap_check_amr();
> + kuap_check();
>  #endif
>  
>   local_irq_save(flags);
> @@ -451,7 +451,7 @@ notrace unsigned long 
> interrupt_exit_kernel_prepare(struct pt_regs *regs, unsign
>   unsigned long flags;
>   unsigned long ret = 0;
>  #ifdef CONFIG_PPC64
> - unsigned long amr;
> + unsigned long kuap;
>  #endif
>  
>   if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x) &&
> @@ -467,7 +467,7 @@ notrace unsigned long 
> interrupt_exit_kernel_prepare(struct pt_regs *regs, unsign
>   CT_WARN_ON(ct_state() == CONTEXT_USER);
>  
>  #ifdef CONFIG_PPC64
> - amr = kuap_get_and_check_amr();
> + kuap = kuap_get_and_check();
>  #endi

Re: [PATCH v2 36/43] powerpc/32: Set current->thread.regs in C interrupt entry

2021-03-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 9, 2021 10:10 pm:
> No need to do that is assembly, do it in C.

Hmm. No issues with the patch as such, but why does ppc32 need this but 
not 64? AFAIKS 64 sets this when a thread is created.

Thanks,
Nick

> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/interrupt.h | 4 +++-
>  arch/powerpc/kernel/entry_32.S   | 3 +--
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index 861e6eadc98c..e6d71c2e3aa2 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -33,8 +33,10 @@ static inline void interrupt_enter_prepare(struct pt_regs 
> *regs, struct interrup
>   if (!arch_irq_disabled_regs(regs))
>   trace_hardirqs_off();
>  
> - if (user_mode(regs))
> + if (user_mode(regs)) {
> + current->thread.regs = regs;
>   account_cpu_user_entry();
> + }
>  #endif
>   /*
>* Book3E reconciles irq soft mask in asm
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index 8fe1c3fdfa6e..815a4ff1ba76 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -52,8 +52,7 @@
>  prepare_transfer_to_handler:
>   andi.   r0,r9,MSR_PR
>   addir12, r2, THREAD
> - beq 2f  /* if from user, fix up THREAD.regs */
> - stw r3,PT_REGS(r12)
> + beq 2f
>  #ifdef CONFIG_PPC_BOOK3S_32
>   kuep_lock r11, r12
>  #endif
> -- 
> 2.25.0
> 
> 


Re: [PATCH v2 28/43] powerpc/64e: Call bad_page_fault() from do_page_fault()

2021-03-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 9, 2021 10:09 pm:
> book3e/64 is the last one calling __bad_page_fault()
> from assembly.
> 
> Save non volatile registers before calling do_page_fault()
> and modify do_page_fault() to call __bad_page_fault()
> for all platforms.
> 
> Then it can be refactored by the call of bad_page_fault()
> which avoids the duplication of the exception table search.

This can go in with the 64e change after your series. I think it should
be ready for the next merge window as well.

Thanks,
Nick

> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/exceptions-64e.S |  8 +---
>  arch/powerpc/mm/fault.c  | 17 -
>  2 files changed, 5 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64e.S 
> b/arch/powerpc/kernel/exceptions-64e.S
> index e8eb9992a270..b60f89078a3f 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1010,15 +1010,9 @@ storage_fault_common:
>   addir3,r1,STACK_FRAME_OVERHEAD
>   ld  r14,PACA_EXGEN+EX_R14(r13)
>   ld  r15,PACA_EXGEN+EX_R15(r13)
> + bl  save_nvgprs
>   bl  do_page_fault
> - cmpdi   r3,0
> - bne-1f
>   b   ret_from_except_lite
> -1:   bl  save_nvgprs
> - mr  r4,r3
> - addir3,r1,STACK_FRAME_OVERHEAD
> - bl  __bad_page_fault
> - b   ret_from_except
>  
>  /*
>   * Alignment exception doesn't fit entirely in the 0x100 bytes so it
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 2e54bac99a22..7bcff3fca110 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -541,24 +541,15 @@ NOKPROBE_SYMBOL(___do_page_fault);
>  
>  static long __do_page_fault(struct pt_regs *regs)
>  {
> - const struct exception_table_entry *entry;
>   long err;
>  
>   err = ___do_page_fault(regs, regs->dar, regs->dsisr);
>   if (likely(!err))
> - return err;
> -
> - entry = search_exception_tables(regs->nip);
> - if (likely(entry)) {
> - instruction_pointer_set(regs, extable_fixup(entry));
>   return 0;
> - } else if (!IS_ENABLED(CONFIG_PPC_BOOK3E_64)) {
> - __bad_page_fault(regs, err);
> - return 0;
> - } else {
> - /* 32 and 64e handle the bad page fault in asm */
> - return err;
> - }
> +
> + bad_page_fault(regs, err);
> +
> + return 0;
>  }
>  NOKPROBE_SYMBOL(__do_page_fault);
>  
> -- 
> 2.25.0
> 
> 


Re: [PATCH v2 02/43] powerpc/traps: Declare unrecoverable_exception() as __noreturn

2021-03-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 9, 2021 10:09 pm:
> unrecoverable_exception() is never expected to return, most callers
> have an infiniteloop in case it returns.
> 
> Ensure it really never returns by terminating it with a BUG(), and
> declare it __no_return.
> 
> It always GCC to really simplify functions calling it. In the exemple
> below, it avoids the stack frame in the likely fast path and avoids
> code duplication for the exit.
> 
> With this patch:

[snip]

Nice.

> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index a44a30b0688c..d5c9d9ddd186 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -2170,11 +2170,15 @@ 
> DEFINE_INTERRUPT_HANDLER(SPEFloatingPointRoundException)
>   * in the MSR is 0.  This indicates that SRR0/1 are live, and that
>   * we therefore lost state by taking this exception.
>   */
> -void unrecoverable_exception(struct pt_regs *regs)
> +void __noreturn unrecoverable_exception(struct pt_regs *regs)
>  {
>   pr_emerg("Unrecoverable exception %lx at %lx (msr=%lx)\n",
>regs->trap, regs->nip, regs->msr);
>   die("Unrecoverable exception", regs, SIGABRT);
> + /* die() should not return */
> + WARN(true, "die() unexpectedly returned");
> + for (;;)
> + ;
>  }

I don't think the WARN should be added because that will cause another
interrupt after something is already badly wrong, so this might just
make it harder to debug.

For example if die() is falling through for some reason, we warn and
cause a program check here, and that might also be unrecoverable so it
might come through here and fall through again and warn again, etc.

Putting the infinite loop is good enough I think (and better than there 
was previously).

Otherwise

Reviewed-by: Nicholas Piggin 

Thanks,
Nick


Re: [PATCH v2 01/43] powerpc/traps: unrecoverable_exception() is not an interrupt handler

2021-03-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 9, 2021 10:09 pm:
> unrecoverable_exception() is called from interrupt handlers or
> after an interrupt handler has failed.
> 
> Make it a standard function to avoid doubling the actions
> performed on interrupt entry (e.g.: user time accounting).
> 
> Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
> Signed-off-by: Christophe Leroy 

Reviewed-by: Nicholas Piggin 

This should go in as a fix for this release I think.

> ---
>  arch/powerpc/include/asm/interrupt.h | 3 ++-
>  arch/powerpc/kernel/interrupt.c  | 1 -
>  arch/powerpc/kernel/traps.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index aedfba29e43a..e8d09a841373 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -410,7 +410,6 @@ DECLARE_INTERRUPT_HANDLER(altivec_assist_exception);
>  DECLARE_INTERRUPT_HANDLER(CacheLockingException);
>  DECLARE_INTERRUPT_HANDLER(SPEFloatingPointException);
>  DECLARE_INTERRUPT_HANDLER(SPEFloatingPointRoundException);
> -DECLARE_INTERRUPT_HANDLER(unrecoverable_exception);
>  DECLARE_INTERRUPT_HANDLER(WatchdogException);
>  DECLARE_INTERRUPT_HANDLER(kernel_bad_stack);
>  
> @@ -437,6 +436,8 @@ DECLARE_INTERRUPT_HANDLER_NMI(hmi_exception_realmode);
>  
>  DECLARE_INTERRUPT_HANDLER_ASYNC(TAUException);
>  
> +void unrecoverable_exception(struct pt_regs *regs);
> +
>  void replay_system_reset(void);
>  void replay_soft_interrupts(void);
>  
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 398cd86b6ada..b8e7d25be31b 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -436,7 +436,6 @@ notrace unsigned long interrupt_exit_user_prepare(struct 
> pt_regs *regs, unsigned
>   return ret;
>  }
>  
> -void unrecoverable_exception(struct pt_regs *regs);
>  void preempt_schedule_irq(void);
>  
>  notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, 
> unsigned long msr)
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 1583fd1c6010..a44a30b0688c 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -2170,7 +2170,7 @@ DEFINE_INTERRUPT_HANDLER(SPEFloatingPointRoundException)
>   * in the MSR is 0.  This indicates that SRR0/1 are live, and that
>   * we therefore lost state by taking this exception.
>   */
> -DEFINE_INTERRUPT_HANDLER(unrecoverable_exception)
> +void unrecoverable_exception(struct pt_regs *regs)
>  {
>   pr_emerg("Unrecoverable exception %lx at %lx (msr=%lx)\n",
>regs->trap, regs->nip, regs->msr);
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 05/22] powerpc/irq: Add helper to set regs->softe

2021-03-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of March 5, 2021 6:54 pm:
> 
> 
> Le 09/02/2021 à 08:49, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 4:18 pm:
>>>
>>>
>>> Le 09/02/2021 à 02:11, Nicholas Piggin a écrit :
>>>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>>>> regs->softe doesn't exist on PPC32.
>>>>>
>>>>> Add irq_soft_mask_regs_set_state() helper to set regs->softe.
>>>>> This helper will void on PPC32.
>>>>>
>>>>> Signed-off-by: Christophe Leroy 
>>>>> ---
>>>>
>>>> You could do the same with the kuap_ functions to change some ifdefs
>>>> to IS_ENABLED.
>>>>
>>>> That's just my preference but if you prefer this way I guess that's
>>>> okay.
>>>>
>>>
>>>
>>> That's also my preference on the long term.
>>>
>>> Here it is ephemeral, I have a follow up series implementing interrupt 
>>> exit/entry in C and getting
>>> rid of all the assembly kuap hence getting rid of those ifdefs.
>> 
>> I thought it might have been because you hate ifdef more tha most :)
>>   
>>> The issue I see when using IS_ENABLED() is that you have to indent to the 
>>> right, then you interfere
>>> with the file history and 'git blame'
>> 
>> Valid point if it's just going to indent back the other way in your next
>> series.
>> 
>>> Thanks for reviewing my series and looking forward to your feedback on my 
>>> series on the interrupt
>>> entry/exit that I will likely release later today.
>> 
>> Cool, I'm eager to see them.
>> 
> 
> Hi Nick, have you been able to look at it ?
> 
> https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1612864003.git.christophe.le...@csgroup.eu/

Hi Christophe,

I had a look at it, it's mostly ppc32 code which I don't know well but 
it looks like a very nice cleanup and it's good to be sharing the C
code here. All the common code changes look fine to me.

I'll take a closer look if you can rebase and repost the series I need 
to create a tree and base 64e conversion on top of yours as they touch
the same common places.

Thanks,
Nick


Re: [PATCH] [RFC] arm64: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION

2021-02-28 Thread Nicholas Piggin
Excerpts from Arnd Bergmann's message of February 27, 2021 7:49 pm:
> On Fri, Feb 26, 2021 at 10:13 PM 'Fangrui Song' via Clang Built Linux
>  wrote:
>>
>> For folks who are interested in --gc-sections on metadata sections,
>> I want to bring you awareness of the implication of __start_/__stop_ symbols 
>> and C identifier name sections.
>> You can see https://github.com/ClangBuiltLinux/linux/issues/1307 for a 
>> summary.
>> (Its linked blog article has some examples.)
>>
>> In the kernel linker scripts, most C identifier name sections begin with 
>> double-underscore __.
>> Some are surrounded by `KEEP(...)`, some are not.
>>
>> * A `KEEP` keyword has GC root semantics and makes ld --gc-sections 
>> ineffectful.
>> * Without `KEEP`, __start_/__stop_ references from a live input section
>>can unnecessarily retain all the associated C identifier name input
>>sections. The new ld.lld option `-z start-stop-gc` can defeat this rule.
>>
>> As an example, a __start___jump_table reference from a live section
>> causes all `__jump_table` input section to be retained, even if you
>> change `KEEP(__jump_table)` to `(__jump_table)`.
>> (If you change the symbol name from `__start_${section}` to something
>> else (e.g. `__start${section}`), the rule will not apply.)
> 
> I suspect the __start_* symbols are cargo-culted by many developers
> copying stuff around between kernel linker scripts, that's certainly how I
> approach making changes to it normally without a deeper understanding
> of how the linker actually works or what the different bits of syntax mean
> there.
> 
> I see the original vmlinux.lds linker script showed up in linux-2.1.23, and
> it contained
> 
> +  . = ALIGN(16);   /* Exception table */
> +  __start___ex_table = .;
> +  __ex_table : { *(__ex_table) }
> +  __stop___ex_table = .;
> +
> +  __start___ksymtab = .;   /* Kernel symbol table */
> +  __ksymtab : { *(__ksymtab) }
> +  __stop___ksymtab = .;
> 
> originally for arch/sparc, and shortly afterwards for i386. The magic
> __ex_table section was first used in linux-2.1.7 without a linker
> script. It's probably a good idea to try cleaning these up by using
> non-magic start/stop symbols for all sections, and relying on KEEP()
> instead where needed.
> 
>> There are a lot of KEEP usage. Perhaps some can be dropped to facilitate
>> ld --gc-sections.
> 
> I see a lot of these were added by Nick Piggin (added to Cc) in this commit:
> 
> commit 266ff2a8f51f02b429a987d87634697eb0d01d6a
> Author: Nicholas Piggin 
> Date:   Wed May 9 22:59:58 2018 +1000
> 
> kbuild: Fix asm-generic/vmlinux.lds.h for LD_DEAD_CODE_DATA_ELIMINATION
> 
> KEEP more tables, and add the function/data section wildcard to more
> section selections.
> 
> This is a little ad-hoc at the moment, but kernel code should be moved
> to consistently use .text..x (note: double dots) for explicit sections
> and all references to it in the linker script can be made with
> TEXT_MAIN, and similarly for other sections.
> 
> For now, let's see if major architectures move to enabling this option
> then we can do some refactoring passes. Otherwise if it remains unused
> or superseded by LTO, this may not be required.
> 
> Signed-off-by: Nicholas Piggin 
> Signed-off-by: Masahiro Yamada 
> 
> which apparently was intentionally cautious.
> 
> Unlike what Nick expected in his submission, I now think the annotations
> will be needed for LTO just like they are for --gc-sections.

Yeah I wasn't sure exactly what LTO looks like or how it would work.
I thought perhaps LTO might be able to find dead code with circular / 
back references, we could put references from the code back to these 
tables or something so they would be kept without KEEP. I don't know, I 
was handwaving!

I managed to get powerpc (and IIRC x86?) working with gc sections with
those KEEP annotations, but effectiveness of course is far worse than 
what Nicolas was able to achieve with all his techniques and tricks.

But yes unless there is some other mechanism to handle these tables, 
then KEEP probably has to stay. I suggest this wants a very explicit and 
systematic way to handle it (maybe with some toolchain support) rather 
than trying to just remove things case by case and see what breaks.

I don't know if Nicolas is still been working on his shrinking patches
recenty but he probably knows more than anyone about this stuff.

Thanks,
Nick



Re: [PATCH] powerpc/syscall: Force inlining of __prep_irq_for_enabled_exit()

2021-02-25 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 24, 2021 4:34 pm:
> As reported by kernel test robot, a randconfig with high amount of
> debuging options can lead to build failure for undefined reference
> to replay_soft_interrupts() on ppc32.
> 
> This is due to gcc not seeing that __prep_irq_for_enabled_exit()
> always returns true on ppc32 because it doesn't inline it for
> some reason.
> 
> Force inlining of __prep_irq_for_enabled_exit() to fix the build.
> 
> Reported-by: kernel test robot 
> Signed-off-by: Christophe Leroy 

Acked-by: Nicholas Piggin 

> Fixes: 344bb20b159d ("powerpc/syscall: Make interrupt.c buildable on PPC32")
> ---
>  arch/powerpc/kernel/interrupt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 398cd86b6ada..2ef3c4051bb9 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -149,7 +149,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   * enabled when the interrupt handler returns (indicating a process-context /
>   * synchronous interrupt) then irqs_enabled should be true.
>   */
> -static notrace inline bool __prep_irq_for_enabled_exit(bool clear_ri)
> +static notrace __always_inline bool __prep_irq_for_enabled_exit(bool 
> clear_ri)
>  {
>   /* This must be done with RI=1 because tracing may touch vmaps */
>   trace_hardirqs_on();
> -- 
> 2.25.0
> 
> 


Re: [PATCH v12 13/14] mm/vmalloc: Hugepage vmalloc mappings

2021-02-18 Thread Nicholas Piggin
Excerpts from Ding Tianhong's message of February 19, 2021 1:45 pm:
> Hi Nicholas:
> 
> I met some problem for this patch, like this:
> 
> kva = vmalloc(3*1024k);
> 
> remap_vmalloc_range(xxx, kva, xxx)
> 
> It failed because that the check for page_count(page) is null so return, it 
> break the some logic for current modules.
> because the new huge page is not valid for composed page.

Hey Ding, that's a good catch. How are you testing this stuff, do you 
have a particular driver that does this?

> I think some guys really don't get used to the changes for the vmalloc that 
> the small pages was transparency to the hugepage
> when the size is bigger than the PMD_SIZE.

I think in this case vmalloc could allocate the large page as a compound
page which would solve this problem I think? (without having actually 
tested it)

> can we think about give a new static huge page to fix it? just like use a a 
> new vmalloc_huge_xxx function to disginguish the current function,
> the user could choose to use the transparent hugepage or static hugepage for 
> vmalloc.

Yeah that's a good question, there are a few things in the huge vmalloc 
code that accounts things as small pages and you can't assume large or 
small. If there is benefit from forcing large pages that could certainly
be added.

Interestingly, remap_vmalloc_range in theory could map the pages as 
large in userspace as well. That takes more work but if something
really needs that for performance, it could be done.

Thanks,
Nick


Re: [PATCH] powerpc/bug: Remove specific powerpc BUG_ON()

2021-02-11 Thread Nicholas Piggin
Excerpts from Segher Boessenkool's message of February 11, 2021 9:50 pm:
> On Thu, Feb 11, 2021 at 08:04:55PM +1000, Nicholas Piggin wrote:
>> It would be nice if we could have a __builtin_trap_if that gcc would use 
>> conditional traps with, (and which never assumes following code is 
>> unreachable even for constant true, so we can use it with WARN and put 
>> explicit unreachable for BUG).
> 
> It automatically does that with just __builtin_trap, see my other mail :-)

If that is generated without branches (or at least with no more
branches than existing asm implementation), then it could be usable 
without trashing CFAR.

Unfortunately I don't think we will be parsing the dwarf information
to get line numbers from it any time soon, so not a drop in replacement 
but maybe one day someone would find a solution.

Thanks,
Nick


Re: [PATCH] powerpc/bug: Remove specific powerpc BUG_ON()

2021-02-11 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 11, 2021 5:41 pm:
> powerpc BUG_ON() is based on using twnei or tdnei instruction,
> which obliges gcc to format the condition into a 0 or 1 value
> in a register.
> 
> By using a generic implementation, gcc will generate a branch
> to the unconditional trap generated by BUG().

We don't want to do this on 64s because that will lose the useful CFAR
contents.

Unfortunately the code generation is not great and the registers that 
give some useful information about the condition are often mangled :(

It would be nice if we could have a __builtin_trap_if that gcc would use 
conditional traps with, (and which never assumes following code is 
unreachable even for constant true, so we can use it with WARN and put 
explicit unreachable for BUG).

> 
> As modern powerpc implement branch folding, that's even more efficient.

I think POWER will speculate conditional traps as non faulting always
so it should be just as good if not better than the branch.

Thanks,
Nick


Re: linux-next: build failure after merge of the powerpc tree

2021-02-10 Thread Nicholas Piggin
Excerpts from Stephen Rothwell's message of February 9, 2021 8:19 pm:
> Hi all,
> 
> After merging the powerpc tree, today's linux-next build (powerpc
> allyesconfig) failed like this:
> 
> arch/powerpc/kernel/head_64.o:(__ftr_alt_97+0x0): relocation truncated to 
> fit: R_PPC64_REL24 (OPD) against symbol `do_page_fault' defined in .opd 
> section in arch/powerpc/mm/fault.o
> arch/powerpc/kernel/head_64.o:(__ftr_alt_97+0x8): relocation truncated to 
> fit: R_PPC64_REL24 (OPD) against symbol `do_page_fault' defined in .opd 
> section in arch/powerpc/mm/fault.o
> arch/powerpc/kernel/head_64.o:(__ftr_alt_97+0x28): relocation truncated to 
> fit: R_PPC64_REL24 (OPD) against symbol `unknown_exception' defined in .opd 
> section in arch/powerpc/kernel/traps.o
> 
> Not sure exactly which commit caused this, but it is most likkely part
> of a series in the powerpc tree.
> 
> I have left the allyesconfig build broken for today.

Hey Stephen,

Thanks for that, it's due to .noinstr section being put on the other 
side of .text, so all our interrupt handler asm code can't reach them 
directly anymore since the ppc interrupt wrappers patch added noinstr
attribute.

That's not strictly required though, we've used NOKPROBE_SYMBOL okay
until now. If you can take this patch for now, it should get 
allyesconfig to build again. I'll fix it in the powerpc tree before the 
merge window.

Thanks,
Nick
--

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 4badb3e51c19..fee1e4dd1e84 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -172,6 +172,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
pt_regs *regs, struct inter
 #define DECLARE_INTERRUPT_HANDLER_RAW(func)\
__visible long func(struct pt_regs *regs)
 
+#define ppc_noinstr noinline notrace __no_kcsan __no_sanitize_address
+
 /**
  * DEFINE_INTERRUPT_HANDLER_RAW - Define raw interrupt handler function
  * @func:  Function name of the entry point
@@ -198,7 +200,7 @@ static inline void interrupt_nmi_exit_prepare(struct 
pt_regs *regs, struct inter
 #define DEFINE_INTERRUPT_HANDLER_RAW(func) \
 static __always_inline long ##func(struct pt_regs *regs);  \
\
-__visible noinstr long func(struct pt_regs *regs)  \
+__visible ppc_noinstr long func(struct pt_regs *regs)  \
 {  \
long ret;   \
\
@@ -228,7 +230,7 @@ static __always_inline long ##func(struct pt_regs *regs)
 #define DEFINE_INTERRUPT_HANDLER(func) \
 static __always_inline void ##func(struct pt_regs *regs);  \
\
-__visible noinstr void func(struct pt_regs *regs)  \
+__visible ppc_noinstr void func(struct pt_regs *regs)  \
 {  \
struct interrupt_state state;   \
\
@@ -262,7 +264,7 @@ static __always_inline void ##func(struct pt_regs *regs)
 #define DEFINE_INTERRUPT_HANDLER_RET(func) \
 static __always_inline long ##func(struct pt_regs *regs);  \
\
-__visible noinstr long func(struct pt_regs *regs)  \
+__visible ppc_noinstr long func(struct pt_regs *regs)  \
 {  \
struct interrupt_state state;   \
long ret;   \
@@ -297,7 +299,7 @@ static __always_inline long ##func(struct pt_regs *regs)
 #define DEFINE_INTERRUPT_HANDLER_ASYNC(func)   \
 static __always_inline void ##func(struct pt_regs *regs);  \
\
-__visible noinstr void func(struct pt_regs *regs)  \
+__visible ppc_noinstr void func(struct pt_regs *regs)  \
 {  \
struct interrupt_state state;   \
\
@@ -331,7 +333,7 @@ static __always_inline void ##func(struct pt_regs *regs)
 #define DEFINE_INTERRUPT_HANDLER_NMI(func) \
 static __always_inline long ##func(struct pt_regs *regs);  

Re: [PATCH v5 20/22] powerpc/syscall: Avoid storing 'current' in another pointer

2021-02-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 10, 2021 3:03 am:
> 
> 
> Le 09/02/2021 à 15:31, David Laight a écrit :
>> From: Segher Boessenkool
>>> Sent: 09 February 2021 13:51
>>>
>>> On Tue, Feb 09, 2021 at 12:36:20PM +1000, Nicholas Piggin wrote:
>>>> What if you did this?
>>>
>>>> +static inline struct task_struct *get_current(void)
>>>> +{
>>>> +  register struct task_struct *task asm ("r2");
>>>> +
>>>> +  return task;
>>>> +}
>>>
>>> Local register asm variables are *only* guaranteed to live in that
>>> register as operands to an asm.  See
>>>
>>> https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables
>>> ("The only supported use" etc.)
>>>
>>> You can do something like
>>>
>>> static inline struct task_struct *get_current(void)
>>> {
>>> register struct task_struct *task asm ("r2");
>>>
>>> asm("" : "+r"(task));
>>>
>>> return task;
>>> }
>>>
>>> which makes sure that "task" actually is in r2 at the point of that asm.
>> 
>> If "r2" always contains current (and is never assigned by the compiler)
>> why not use a global register variable for it?
>> 
> 
> 
> The change proposed by Nick doesn't solve the issue.

It seemed to change code generation in a simple test case, oh well.

> 
> The problem is that at the begining of the function we have:
> 
>   unsigned long *ti_flagsp = _thread_info()->flags;
> 
> When the function uses ti_flagsp for the first time, it does use 112(r2)
> 
> Then the function calls some other functions.
> 
> Most likely because the function could update 'current', GCC copies r2 into 
> r30, so that if r2 get 
> changed by the called function, ti_flagsp is still based on the previous 
> value of current.
> 
> Allthough we know r2 wont change, GCC doesn't know it. And in order to save 
> r2 into r30, it needs to 
> save r30 in the stack.
> 
> 
> By using _thread_info()->flags directly instead of this intermediaite 
> ti_flagsp pointer, GCC 
> uses r2 instead instead of doing a copy.
> 
> 
> Nick, I don't understand the reason why you need that 'ti_flagsp' local var.

Just to save typing, I don't mind your patch I was just wondering if 
current could be improved in general.

Thanks,
Nick


Re: [PATCH v5 18/22] powerpc/syscall: Remove FULL_REGS verification in system_call_exception

2021-02-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 10, 2021 12:31 am:
> 
> 
> Le 09/02/2021 à 03:02, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>> For book3s/64, FULL_REGS() is 'true' at all time, so the test voids.
>>> For others, non volatile registers are saved inconditionally.
>>>
>>> So the verification is pointless.
>>>
>>> Should one fail to do it, it would anyway be caught by the
>>> CHECK_FULL_REGS() in copy_thread() as we have removed the
>>> special versions ppc_fork() and friends.
>>>
>>> null_syscall benchmark reduction 4 cycles (332 => 328 cycles)
>> 
>> I wonder if we rather make a CONFIG option for a bunch of these simpler
>> debug checks here (and also in interrupt exit, wrappers, etc) rather
>> than remove them entirely.
> 
> We can drop this patch if you prefer. Anyway, like book3s/64, once ppc32 also 
> do interrupt 
> entry/exit in C, FULL_REGS() will already return true.

Sure let's do that.

Thanks,
Nick



Re: [PATCH v5 16/22] powerpc/syscall: Avoid stack frame in likely part of system_call_exception()

2021-02-09 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 10, 2021 2:13 am:
> 
> 
> Le 09/02/2021 à 02:55, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>> When r3 is not modified, reload it from regs->orig_r3 to free
>>> volatile registers. This avoids a stack frame for the likely part
>>> of system_call_exception()
>> 
>> This doesn't on my 64s build, but it does reduce one non volatile
>> register save/restore. With quite a bit more register pressure
>> reduction 64s can avoid the stack frame as well.
> 
> The stack frame is not due to the registers because on PPC64 you have the 
> redzone that you don't 
> have on PPC32.
> 
> As far as I can see, this is due to a call to .arch_local_irq_restore().
> 
> On ppc32 arch_local_irq_restore() is just a write to MSR.

Oh you're right there. We can actually inline fast paths of that I have 
a patch somewhere, but not sure if it's worthwhile.

>> It's a cool trick but quite code and compiler specific so I don't know
>> how worthwhile it is to keep considering we're calling out into random
>> kernel C code after this.
>> 
>> Maybe just keep it PPC32 specific for the moment, will have to do more
>> tuning for 64 and we have other stuff to do there first.
>> 
>> If you are happy to make it 32-bit only then
> 
> I think we can leave without this, that's only one or two cycles won.

Okay for this round let's drop it for now.

Thanks,
Nick


Re: [PATCH v5 17/22] powerpc/syscall: Do not check unsupported scv vector on PPC32

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 4:13 pm:
> 
> 
> Le 09/02/2021 à 03:00, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>> Only PPC64 has scv. No need to check the 0x7ff0 trap on PPC32.
>>> For that, add a helper trap_is_unsupported_scv() similar to
>>> trap_is_scv().
>>>
>>> And ignore the scv parameter in syscall_exit_prepare (Save 14 cycles
>>> 346 => 332 cycles)
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>> v5: Added a helper trap_is_unsupported_scv()
>>> ---
>>>   arch/powerpc/include/asm/ptrace.h | 5 +
>>>   arch/powerpc/kernel/entry_32.S| 1 -
>>>   arch/powerpc/kernel/interrupt.c   | 7 +--
>>>   3 files changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/ptrace.h 
>>> b/arch/powerpc/include/asm/ptrace.h
>>> index 58f9dc060a7b..2c842b11a924 100644
>>> --- a/arch/powerpc/include/asm/ptrace.h
>>> +++ b/arch/powerpc/include/asm/ptrace.h
>>> @@ -229,6 +229,11 @@ static inline bool trap_is_scv(struct pt_regs *regs)
>>> return (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && TRAP(regs) == 0x3000);
>>>   }
>>>   
>>> +static inline bool trap_is_unsupported_scv(struct pt_regs *regs)
>>> +{
>>> +   return (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && TRAP(regs) == 0x7ff0);
>>> +}
>> 
>> This change is good.
>> 
>>> +
>>>   static inline bool trap_is_syscall(struct pt_regs *regs)
>>>   {
>>> return (trap_is_scv(regs) || TRAP(regs) == 0xc00);
>>> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
>>> index cffe58e63356..7c824e8928d0 100644
>>> --- a/arch/powerpc/kernel/entry_32.S
>>> +++ b/arch/powerpc/kernel/entry_32.S
>>> @@ -344,7 +344,6 @@ transfer_to_syscall:
>>>   
>>>   ret_from_syscall:
>>> addir4,r1,STACK_FRAME_OVERHEAD
>>> -   li  r5,0
>>> bl  syscall_exit_prepare
>> 
>> For this one, I think it would be nice to do the "right" thing and make
>> the function prototypes different on !64S. They could then declare a
>> local const bool scv = 0.
>> 
>> We could have syscall_exit_prepare and syscall_exit_prepare_maybe_scv
>> or something like that, 64s can use the latter one and the former can be
>> a wrapper that passes constant 0 for scv. Then we don't have different
>> prototypes for the same function, but you just have to make the 32-bit
>> version static inline and the 64-bit version exported to asm.
> 
> You can't call a static inline function from ASM, I don't understand you.

I mean

#ifdef CONFIG_PPC_BOOK3S_64
notrace unsigned long syscall_exit_prepare_scv(unsigned long r3,
   struct pt_regs *regs,
   long scv)
#else
static inline long syscall_exit_prepare_scv(unsigned long r3,
   struct pt_regs *regs,
   long scv)
#endif

#ifndef CONFIG_PPC_BOOK3S_64
notrace unsigned long syscall_exit_prepare(unsigned long r3,
   struct pt_regs *regs)
{
return syscall_exit_prepare_scv(r3, regs, 0);
}
#endif


> 
> What is wrong for you really here ? Is that the fact we leave scv random, or 
> is that the below 
> IS_ENABLED() ?

That scv arg is random. I know generated code essentially would be no 
different and no possibility of tracing, but would just prefer to call 
the C "correctly" if possible.

> I don't mind keeping the 'li r5,0' before calling the function if you find it 
> cleaner, the real 
> performance gain is with setting scv to 0 below for PPC32 (and maybe it 
> should be set to zero for 
> book3e/64 too ?).

Yes 64e would like this optimisation.

Thanks,
Nick


Re: [PATCH v5 09/22] powerpc/syscall: Make interrupt.c buildable on PPC32

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 4:02 pm:
> 
> 
> Le 09/02/2021 à 02:27, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>> To allow building interrupt.c on PPC32, ifdef out specific PPC64
>>> code or use helpers which are available on both PP32 and PPC64
>>>
>>> Modify Makefile to always build interrupt.o
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>> v5:
>>> - Also for interrupt exit preparation
>>> - Opted out kuap related code, ppc32 keeps it in ASM for the time being
>>> ---
>>>   arch/powerpc/kernel/Makefile|  4 ++--
>>>   arch/powerpc/kernel/interrupt.c | 31 ---
>>>   2 files changed, 26 insertions(+), 9 deletions(-)
>>>
> 
>>> diff --git a/arch/powerpc/kernel/interrupt.c 
>>> b/arch/powerpc/kernel/interrupt.c
>>> index d6be4f9a67e5..2dac4d2bb1cf 100644
>>> --- a/arch/powerpc/kernel/interrupt.c
>>> +++ b/arch/powerpc/kernel/interrupt.c
>>> @@ -39,7 +39,7 @@ notrace long system_call_exception(long r3, long r4, long 
>>> r5,
>>> BUG_ON(!(regs->msr & MSR_RI));
>>> BUG_ON(!(regs->msr & MSR_PR));
>>> BUG_ON(!FULL_REGS(regs));
>>> -   BUG_ON(regs->softe != IRQS_ENABLED);
>>> +   BUG_ON(arch_irq_disabled_regs(regs));
>>>   
>>>   #ifdef CONFIG_PPC_PKEY
>>> if (mmu_has_feature(MMU_FTR_PKEY)) {
>>> @@ -65,7 +65,9 @@ notrace long system_call_exception(long r3, long r4, long 
>>> r5,
>>> isync();
>>> } else
>>>   #endif
>>> +#ifdef CONFIG_PPC64
>>> kuap_check_amr();
>>> +#endif
>> 
>> Wouldn't mind trying to get rid of these ifdefs at some point, but
>> there's some kuap / keys changes going on recently so I'm happy enough
>> to let this settle then look at whether we can refactor.
> 
> I have a follow up series that implements interrupts entries/exits in C and 
> that removes all kuap 
> assembly, I will likely release it as RFC later today.
> 
>> 
>>>   
>>> account_cpu_user_entry();
>>>   
>>> @@ -318,7 +323,7 @@ notrace unsigned long syscall_exit_prepare(unsigned 
>>> long r3,
>>> return ret;
>>>   }
>>>   
>>> -#ifdef CONFIG_PPC_BOOK3S /* BOOK3E not yet using this */
>>> +#ifndef CONFIG_PPC_BOOK3E_64 /* BOOK3E not yet using this */
>>>   notrace unsigned long interrupt_exit_user_prepare(struct pt_regs *regs, 
>>> unsigned long msr)
>>>   {
>>>   #ifdef CONFIG_PPC_BOOK3E
>> 
>> Why are you building this for 32? I don't mind if it's just to keep
>> things similar and make it build for now, but you're not using it yet,
>> right?
> 
> The series using that will follow, I thought it would be worth doing this at 
> once.

Yeah that's fine by me then.

Thanks,
Nick


Re: [PATCH v5 05/22] powerpc/irq: Add helper to set regs->softe

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 3:57 pm:
> 
> 
> Le 09/02/2021 à 02:11, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>> regs->softe doesn't exist on PPC32.
>>>
>>> Add irq_soft_mask_regs_set_state() helper to set regs->softe.
>>> This helper will void on PPC32.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   arch/powerpc/include/asm/hw_irq.h | 11 +--
>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/hw_irq.h 
>>> b/arch/powerpc/include/asm/hw_irq.h
>>> index 614957f74cee..ed0c3b049dfd 100644
>>> --- a/arch/powerpc/include/asm/hw_irq.h
>>> +++ b/arch/powerpc/include/asm/hw_irq.h
>>> @@ -38,6 +38,8 @@
>>>   #define PACA_IRQ_MUST_HARD_MASK   (PACA_IRQ_EE)
>>>   #endif
>>>   
>>> +#endif /* CONFIG_PPC64 */
>>> +
>>>   /*
>>>* flags for paca->irq_soft_mask
>>>*/
>>> @@ -46,8 +48,6 @@
>>>   #define IRQS_PMI_DISABLED 2
>>>   #define IRQS_ALL_DISABLED (IRQS_DISABLED | IRQS_PMI_DISABLED)
>>>   
>>> -#endif /* CONFIG_PPC64 */
>>> -
>>>   #ifndef __ASSEMBLY__
>>>   
>>>   #ifdef CONFIG_PPC64
>>> @@ -287,6 +287,10 @@ extern void irq_set_pending_from_srr1(unsigned long 
>>> srr1);
>>>   
>>>   extern void force_external_irq_replay(void);
>>>   
>>> +static inline void irq_soft_mask_regs_set_state(struct pt_regs *regs, 
>>> unsigned long val)
>>> +{
>>> +   regs->softe = val;
>>> +}
>>>   #else /* CONFIG_PPC64 */
>>>   
>>>   static inline unsigned long arch_local_save_flags(void)
>>> @@ -355,6 +359,9 @@ static inline bool arch_irq_disabled_regs(struct 
>>> pt_regs *regs)
>>>   
>>>   static inline void may_hard_irq_enable(void) { }
>>>   
>>> +static inline void irq_soft_mask_regs_set_state(struct pt_regs *regs, 
>>> unsigned long val)
>>> +{
>>> +}
>>>   #endif /* CONFIG_PPC64 */
>>>   
>>>   #define ARCH_IRQ_INIT_FLAGS   IRQ_NOREQUEST
>> 
>> What I don't like about this where you use it is it kind of pollutes
>> the ppc32 path with this function which is not valid to use.
>> 
>> I would prefer if you had this purely so it could compile with:
>> 
>>if (IS_ENABLED(CONFIG_PPC64)))
>>irq_soft_mask_regs_set_state(regs, blah);
>> 
>> And then you could make the ppc32 cause a link error if it did not
>> get eliminated at compile time (e.g., call an undefined function).
>> 
>> You could do the same with the kuap_ functions to change some ifdefs
>> to IS_ENABLED.
>> 
>> That's just my preference but if you prefer this way I guess that's
>> okay.
> 
> I see you didn't change your mind since last April :)
> 
> I'll see what I can do.

If you have more patches in the works and will do some cleanup passes I 
don't mind so much.

Thanks,
Nick


Re: [PATCH v5 05/22] powerpc/irq: Add helper to set regs->softe

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 4:18 pm:
> 
> 
> Le 09/02/2021 à 02:11, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
>>> regs->softe doesn't exist on PPC32.
>>>
>>> Add irq_soft_mask_regs_set_state() helper to set regs->softe.
>>> This helper will void on PPC32.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>> 
>> You could do the same with the kuap_ functions to change some ifdefs
>> to IS_ENABLED.
>> 
>> That's just my preference but if you prefer this way I guess that's
>> okay.
>> 
> 
> 
> That's also my preference on the long term.
> 
> Here it is ephemeral, I have a follow up series implementing interrupt 
> exit/entry in C and getting 
> rid of all the assembly kuap hence getting rid of those ifdefs.

I thought it might have been because you hate ifdef more tha most :)
 
> The issue I see when using IS_ENABLED() is that you have to indent to the 
> right, then you interfere 
> with the file history and 'git blame'

Valid point if it's just going to indent back the other way in your next 
series.

> Thanks for reviewing my series and looking forward to your feedback on my 
> series on the interrupt 
> entry/exit that I will likely release later today.

Cool, I'm eager to see them.

Thanks,
Nick


Re: [PATCH v5 20/22] powerpc/syscall: Avoid storing 'current' in another pointer

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> By saving the pointer pointing to thread_info.flags, gcc copies r2
> in a non-volatile register.
> 
> We know 'current' doesn't change, so avoid that intermediaite pointer.
> 
> Reduces null_syscall benchmark by 2 cycles (322 => 320 cycles)
> 
> On PPC64, gcc seems to know that 'current' is not changing, and it keeps
> it in a non volatile register to avoid multiple read of 'current' in paca.
> 
> Signed-off-by: Christophe Leroy 

What if you did this?

---
 arch/powerpc/include/asm/current.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/current.h 
b/arch/powerpc/include/asm/current.h
index bbfb94800415..59ab327972a5 100644
--- a/arch/powerpc/include/asm/current.h
+++ b/arch/powerpc/include/asm/current.h
@@ -23,16 +23,19 @@ static inline struct task_struct *get_current(void)
 
return task;
 }
-#define currentget_current()
 
 #else
 
-/*
- * We keep `current' in r2 for speed.
- */
-register struct task_struct *current asm ("r2");
+static inline struct task_struct *get_current(void)
+{
+   register struct task_struct *task asm ("r2");
+
+   return task;
+}
 
 #endif
 
+#define currentget_current()
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_CURRENT_H */
-- 


Re: [PATCH v5 19/22] powerpc/syscall: Optimise checks in beginning of system_call_exception()

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> Combine all tests of regs->msr into a single logical one.

Okay by me unless we choose to do the config option and put these all 
under it. I think I would prefer that because sometimes the registers
are in a state you can't easily see what the values in the expression
were. In this case it doesn't matter so much because they should be in
regs in the interrupt frame.

Thanks,
Nick

> 
> Before the patch:
> 
>0: 81 6a 00 84 lwz r11,132(r10)
>4: 90 6a 00 88 stw r3,136(r10)
>8: 69 60 00 02 xorir0,r11,2
>c: 54 00 ff fe rlwinm  r0,r0,31,31,31
>   10: 0f 00 00 00 twnei   r0,0
>   14: 69 63 40 00 xorir3,r11,16384
>   18: 54 63 97 fe rlwinm  r3,r3,18,31,31
>   1c: 0f 03 00 00 twnei   r3,0
>   20: 69 6b 80 00 xorir11,r11,32768
>   24: 55 6b 8f fe rlwinm  r11,r11,17,31,31
>   28: 0f 0b 00 00 twnei   r11,0
> 
> After the patch:
> 
>0: 81 6a 00 84 lwz r11,132(r10)
>4: 90 6a 00 88 stw r3,136(r10)
>8: 7d 6b 58 f8 not r11,r11
>c: 71 6b c0 02 andi.   r11,r11,49154
>   10: 0f 0b 00 00 twnei   r11,0
> 
> 6 cycles less on powerpc 8xx (328 => 322 cycles).
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/interrupt.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 55e1aa18cdb9..8c38e8c95be2 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -28,6 +28,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>  unsigned long r0, struct pt_regs *regs)
>  {
>   syscall_fn f;
> + unsigned long expected_msr;
>  
>   regs->orig_gpr3 = r3;
>  
> @@ -39,10 +40,13 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>  
>   trace_hardirqs_off(); /* finish reconciling */
>  
> + expected_msr = MSR_PR;
>   if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x))
> - BUG_ON(!(regs->msr & MSR_RI));
> - BUG_ON(!(regs->msr & MSR_PR));
> - BUG_ON(arch_irq_disabled_regs(regs));
> + expected_msr |= MSR_RI;
> + if (IS_ENABLED(CONFIG_PPC32))
> + expected_msr |= MSR_EE;
> + BUG_ON((regs->msr & expected_msr) ^ expected_msr);
> + BUG_ON(IS_ENABLED(CONFIG_PPC64) && arch_irq_disabled_regs(regs));
>  
>  #ifdef CONFIG_PPC_PKEY
>   if (mmu_has_feature(MMU_FTR_PKEY)) {
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 18/22] powerpc/syscall: Remove FULL_REGS verification in system_call_exception

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> For book3s/64, FULL_REGS() is 'true' at all time, so the test voids.
> For others, non volatile registers are saved inconditionally.
> 
> So the verification is pointless.
> 
> Should one fail to do it, it would anyway be caught by the
> CHECK_FULL_REGS() in copy_thread() as we have removed the
> special versions ppc_fork() and friends.
> 
> null_syscall benchmark reduction 4 cycles (332 => 328 cycles)

I wonder if we rather make a CONFIG option for a bunch of these simpler
debug checks here (and also in interrupt exit, wrappers, etc) rather
than remove them entirely.

Thanks,
Nick

> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/interrupt.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 8fafca727b8b..55e1aa18cdb9 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -42,7 +42,6 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x))
>   BUG_ON(!(regs->msr & MSR_RI));
>   BUG_ON(!(regs->msr & MSR_PR));
> - BUG_ON(!FULL_REGS(regs));
>   BUG_ON(arch_irq_disabled_regs(regs));
>  
>  #ifdef CONFIG_PPC_PKEY
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 17/22] powerpc/syscall: Do not check unsupported scv vector on PPC32

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> Only PPC64 has scv. No need to check the 0x7ff0 trap on PPC32.
> For that, add a helper trap_is_unsupported_scv() similar to
> trap_is_scv().
> 
> And ignore the scv parameter in syscall_exit_prepare (Save 14 cycles
> 346 => 332 cycles)
> 
> Signed-off-by: Christophe Leroy 
> ---
> v5: Added a helper trap_is_unsupported_scv()
> ---
>  arch/powerpc/include/asm/ptrace.h | 5 +
>  arch/powerpc/kernel/entry_32.S| 1 -
>  arch/powerpc/kernel/interrupt.c   | 7 +--
>  3 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/ptrace.h 
> b/arch/powerpc/include/asm/ptrace.h
> index 58f9dc060a7b..2c842b11a924 100644
> --- a/arch/powerpc/include/asm/ptrace.h
> +++ b/arch/powerpc/include/asm/ptrace.h
> @@ -229,6 +229,11 @@ static inline bool trap_is_scv(struct pt_regs *regs)
>   return (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && TRAP(regs) == 0x3000);
>  }
>  
> +static inline bool trap_is_unsupported_scv(struct pt_regs *regs)
> +{
> + return (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && TRAP(regs) == 0x7ff0);
> +}

This change is good.

> +
>  static inline bool trap_is_syscall(struct pt_regs *regs)
>  {
>   return (trap_is_scv(regs) || TRAP(regs) == 0xc00);
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index cffe58e63356..7c824e8928d0 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -344,7 +344,6 @@ transfer_to_syscall:
>  
>  ret_from_syscall:
>   addir4,r1,STACK_FRAME_OVERHEAD
> - li  r5,0
>   bl  syscall_exit_prepare

For this one, I think it would be nice to do the "right" thing and make 
the function prototypes different on !64S. They could then declare a
local const bool scv = 0.

We could have syscall_exit_prepare and syscall_exit_prepare_maybe_scv
or something like that, 64s can use the latter one and the former can be
a wrapper that passes constant 0 for scv. Then we don't have different
prototypes for the same function, but you just have to make the 32-bit
version static inline and the 64-bit version exported to asm.

Thanks,
Nick

>  #if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
>   /* If the process has its own DBCR0 value, load it up.  The internal
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 205902052112..8fafca727b8b 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -88,7 +88,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   local_irq_enable();
>  
>   if (unlikely(current_thread_info()->flags & _TIF_SYSCALL_DOTRACE)) {
> - if (unlikely(regs->trap == 0x7ff0)) {
> + if (unlikely(trap_is_unsupported_scv(regs))) {
>   /* Unsupported scv vector */
>   _exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
>   return regs->gpr[3];
> @@ -111,7 +111,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   r8 = regs->gpr[8];
>  
>   } else if (unlikely(r0 >= NR_syscalls)) {
> - if (unlikely(regs->trap == 0x7ff0)) {
> + if (unlikely(trap_is_unsupported_scv(regs))) {
>   /* Unsupported scv vector */
>   _exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
>   return regs->gpr[3];
> @@ -224,6 +224,9 @@ notrace unsigned long syscall_exit_prepare(unsigned long 
> r3,
>   unsigned long ti_flags;
>   unsigned long ret = 0;
>  
> + if (IS_ENABLED(CONFIG_PPC32))
> + scv = 0;
> +
>   CT_WARN_ON(ct_state() == CONTEXT_USER);
>  
>  #ifdef CONFIG_PPC64
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 16/22] powerpc/syscall: Avoid stack frame in likely part of system_call_exception()

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> When r3 is not modified, reload it from regs->orig_r3 to free
> volatile registers. This avoids a stack frame for the likely part
> of system_call_exception()

This doesn't on my 64s build, but it does reduce one non volatile
register save/restore. With quite a bit more register pressure
reduction 64s can avoid the stack frame as well.

It's a cool trick but quite code and compiler specific so I don't know 
how worthwhile it is to keep considering we're calling out into random
kernel C code after this.

Maybe just keep it PPC32 specific for the moment, will have to do more
tuning for 64 and we have other stuff to do there first.

If you are happy to make it 32-bit only then

Reviewed-by: Nicholas Piggin 

> 
> Before the patch:
> 
> c000b4d4 :
> c000b4d4: 7c 08 02 a6 mflrr0
> c000b4d8: 94 21 ff e0 stwur1,-32(r1)
> c000b4dc: 93 e1 00 1c stw r31,28(r1)
> c000b4e0: 90 01 00 24 stw r0,36(r1)
> c000b4e4: 90 6a 00 88 stw r3,136(r10)
> c000b4e8: 81 6a 00 84 lwz r11,132(r10)
> c000b4ec: 69 6b 00 02 xorir11,r11,2
> c000b4f0: 55 6b ff fe rlwinm  r11,r11,31,31,31
> c000b4f4: 0f 0b 00 00 twnei   r11,0
> c000b4f8: 81 6a 00 a0 lwz r11,160(r10)
> c000b4fc: 55 6b 07 fe clrlwi  r11,r11,31
> c000b500: 0f 0b 00 00 twnei   r11,0
> c000b504: 7c 0c 42 e6 mftbr0
> c000b508: 83 e2 00 08 lwz r31,8(r2)
> c000b50c: 81 82 00 28 lwz r12,40(r2)
> c000b510: 90 02 00 24 stw r0,36(r2)
> c000b514: 7d 8c f8 50 subfr12,r12,r31
> c000b518: 7c 0c 02 14 add r0,r12,r0
> c000b51c: 90 02 00 08 stw r0,8(r2)
> c000b520: 7c 10 13 a6 mtspr   80,r0
> c000b524: 81 62 00 70 lwz r11,112(r2)
> c000b528: 71 60 86 91 andi.   r0,r11,34449
> c000b52c: 40 82 00 34 bne c000b560 
> c000b530: 2b 89 01 b6 cmplwi  cr7,r9,438
> c000b534: 41 9d 00 64 bgt cr7,c000b598 
> 
> c000b538: 3d 40 c0 5c lis r10,-16292
> c000b53c: 55 29 10 3a rlwinm  r9,r9,2,0,29
> c000b540: 39 4a 41 e8 addir10,r10,16872
> c000b544: 80 01 00 24 lwz r0,36(r1)
> c000b548: 7d 2a 48 2e lwzxr9,r10,r9
> c000b54c: 7c 08 03 a6 mtlrr0
> c000b550: 7d 29 03 a6 mtctr   r9
> c000b554: 83 e1 00 1c lwz r31,28(r1)
> c000b558: 38 21 00 20 addir1,r1,32
> c000b55c: 4e 80 04 20 bctr
> 
> After the patch:
> 
> c000b4d4 :
> c000b4d4: 81 6a 00 84 lwz r11,132(r10)
> c000b4d8: 90 6a 00 88 stw r3,136(r10)
> c000b4dc: 69 6b 00 02 xorir11,r11,2
> c000b4e0: 55 6b ff fe rlwinm  r11,r11,31,31,31
> c000b4e4: 0f 0b 00 00 twnei   r11,0
> c000b4e8: 80 6a 00 a0 lwz r3,160(r10)
> c000b4ec: 54 63 07 fe clrlwi  r3,r3,31
> c000b4f0: 0f 03 00 00 twnei   r3,0
> c000b4f4: 7d 6c 42 e6 mftbr11
> c000b4f8: 81 82 00 08 lwz r12,8(r2)
> c000b4fc: 80 02 00 28 lwz r0,40(r2)
> c000b500: 91 62 00 24 stw r11,36(r2)
> c000b504: 7c 00 60 50 subfr0,r0,r12
> c000b508: 7d 60 5a 14 add r11,r0,r11
> c000b50c: 91 62 00 08 stw r11,8(r2)
> c000b510: 7c 10 13 a6 mtspr   80,r0
> c000b514: 80 62 00 70 lwz r3,112(r2)
> c000b518: 70 6b 86 91 andi.   r11,r3,34449
> c000b51c: 40 82 00 28 bne c000b544 
> c000b520: 2b 89 01 b6 cmplwi  cr7,r9,438
> c000b524: 41 9d 00 84 bgt cr7,c000b5a8 
> 
> c000b528: 80 6a 00 88 lwz r3,136(r10)
> c000b52c: 3d 40 c0 5c lis r10,-16292
> c000b530: 55 29 10 3a rlwinm  r9,r9,2,0,29
> c000b534: 39 4a 41 e4 addir10,r10,16868
> c000b538: 7d 2a 48 2e lwzxr9,r10,r9
> c000b53c: 7d 29 03 a6 mtctr   r9
> c000b540: 4e 80 04 20 bctr
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/interrupt.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 107ec39f05cb..205902052112 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -117,6 +117,9 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   return regs->gpr[3];
>   }
>   return -ENOSYS;
> + } else {
> + /* Restore r3 from orig_gpr3 to free up a volatile reg */
> + r3 = regs->orig_gpr3;
>   }
>  
>   /* May be faster to do array_index_nospec? */
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 12/22] powerpc/syscall: Change condition to check MSR_RI

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> In system_call_exception(), MSR_RI also needs to be checked on 8xx.
> Only booke and 40x doesn't have MSR_RI.

Reviewed-by: Nicholas Piggin 

...
> 
> Signed-off-by: Christophe Leroy 
> ---
> v5: Also in interrupt exit prepare
> ---
>  arch/powerpc/kernel/interrupt.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 1a2dec49f811..107ec39f05cb 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -39,7 +39,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>  
>   trace_hardirqs_off(); /* finish reconciling */
>  
> - if (IS_ENABLED(CONFIG_PPC_BOOK3S))
> + if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x))
>   BUG_ON(!(regs->msr & MSR_RI));
>   BUG_ON(!(regs->msr & MSR_PR));
>   BUG_ON(!FULL_REGS(regs));
> @@ -338,7 +338,7 @@ notrace unsigned long interrupt_exit_user_prepare(struct 
> pt_regs *regs, unsigned
>   unsigned long flags;
>   unsigned long ret = 0;
>  
> - if (IS_ENABLED(CONFIG_PPC_BOOK3S))
> + if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x))
>   BUG_ON(!(regs->msr & MSR_RI));
>   BUG_ON(!(regs->msr & MSR_PR));
>   BUG_ON(!FULL_REGS(regs));
> @@ -436,7 +436,8 @@ notrace unsigned long 
> interrupt_exit_kernel_prepare(struct pt_regs *regs, unsign
>   unsigned long amr;
>  #endif
>  
> - if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
> + if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x) &&
> + unlikely(!(regs->msr & MSR_RI)))
>   unrecoverable_exception(regs);
>   BUG_ON(regs->msr & MSR_PR);
>   BUG_ON(!FULL_REGS(regs));
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 11/22] powerpc/syscall: Save r3 in regs->orig_r3

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> Save r3 in regs->orig_r3 in system_call_exception()
> 
> Signed-off-by: Christophe Leroy 

Reviewed-by: Nicholas Piggin 

> ---
> v5: Removed the assembly one on SCV type system call
> ---
>  arch/powerpc/kernel/entry_64.S  | 2 --
>  arch/powerpc/kernel/interrupt.c | 2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 33ddfeef4fe9..a91c2def165d 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -108,7 +108,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
>   li  r11,\trapnr
>   std r11,_TRAP(r1)
>   std r12,_CCR(r1)
> - std r3,ORIG_GPR3(r1)
>   addir10,r1,STACK_FRAME_OVERHEAD
>   ld  r11,exception_marker@toc(r2)
>   std r11,-16(r10)/* "regshere" marker */
> @@ -278,7 +277,6 @@ END_BTB_FLUSH_SECTION
>   std r10,_LINK(r1)
>   std r11,_TRAP(r1)
>   std r12,_CCR(r1)
> - std r3,ORIG_GPR3(r1)
>   addir10,r1,STACK_FRAME_OVERHEAD
>   ld  r11,exception_marker@toc(r2)
>   std r11,-16(r10)/* "regshere" marker */
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 46fd195ca659..1a2dec49f811 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -29,6 +29,8 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>  {
>   syscall_fn f;
>  
> + regs->orig_gpr3 = r3;
> +
>   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
>   BUG_ON(irq_soft_mask_return() != IRQS_ALL_DISABLED);
>  
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 10/22] powerpc/syscall: Use is_compat_task()

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> Instead of hard comparing task flags with _TIF_32BIT, use
> is_compat_task(). The advantage is that it returns 0 on PPC32
> allthough _TIF_32BIT is always set.
> 
> Signed-off-by: Christophe Leroy 

Reviewed-by: Nicholas Piggin 


> ---
>  arch/powerpc/kernel/interrupt.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index 2dac4d2bb1cf..46fd195ca659 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -2,6 +2,8 @@
>  
>  #include 
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -118,7 +120,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   /* May be faster to do array_index_nospec? */
>   barrier_nospec();
>  
> - if (unlikely(is_32bit_task())) {
> + if (unlikely(is_compat_task())) {
>   f = (void *)compat_sys_call_table[r0];
>  
>   r3 &= 0xULL;
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 09/22] powerpc/syscall: Make interrupt.c buildable on PPC32

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> To allow building interrupt.c on PPC32, ifdef out specific PPC64
> code or use helpers which are available on both PP32 and PPC64
> 
> Modify Makefile to always build interrupt.o
> 
> Signed-off-by: Christophe Leroy 
> ---
> v5:
> - Also for interrupt exit preparation
> - Opted out kuap related code, ppc32 keeps it in ASM for the time being
> ---
>  arch/powerpc/kernel/Makefile|  4 ++--
>  arch/powerpc/kernel/interrupt.c | 31 ---
>  2 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 26ff8c6e06b7..163755b1cef4 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -57,10 +57,10 @@ obj-y := cputable.o 
> syscalls.o \
>  prom.o traps.o setup-common.o \
>  udbg.o misc.o io.o misc_$(BITS).o \
>  of_platform.o prom_parse.o firmware.o \
> -hw_breakpoint_constraints.o
> +hw_breakpoint_constraints.o interrupt.o
>  obj-y+= ptrace/
>  obj-$(CONFIG_PPC64)  += setup_64.o \
> -paca.o nvram_64.o note.o interrupt.o
> +paca.o nvram_64.o note.o
>  obj-$(CONFIG_COMPAT) += sys_ppc32.o signal_32.o
>  obj-$(CONFIG_VDSO32) += vdso32_wrapper.o
>  obj-$(CONFIG_PPC_WATCHDOG)   += watchdog.o
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index d6be4f9a67e5..2dac4d2bb1cf 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -39,7 +39,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   BUG_ON(!(regs->msr & MSR_RI));
>   BUG_ON(!(regs->msr & MSR_PR));
>   BUG_ON(!FULL_REGS(regs));
> - BUG_ON(regs->softe != IRQS_ENABLED);
> + BUG_ON(arch_irq_disabled_regs(regs));
>  
>  #ifdef CONFIG_PPC_PKEY
>   if (mmu_has_feature(MMU_FTR_PKEY)) {
> @@ -65,7 +65,9 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>   isync();
>   } else
>  #endif
> +#ifdef CONFIG_PPC64
>   kuap_check_amr();
> +#endif

Wouldn't mind trying to get rid of these ifdefs at some point, but 
there's some kuap / keys changes going on recently so I'm happy enough 
to let this settle then look at whether we can refactor.

>  
>   account_cpu_user_entry();
>  
> @@ -77,7 +79,7 @@ notrace long system_call_exception(long r3, long r4, long 
> r5,
>* frame, or if the unwinder was taught the first stack frame always
>* returns to user with IRQS_ENABLED, this store could be avoided!
>*/
> - regs->softe = IRQS_ENABLED;
> + irq_soft_mask_regs_set_state(regs, IRQS_ENABLED);
>  
>   local_irq_enable();
>  
> @@ -151,6 +153,7 @@ static notrace inline bool 
> __prep_irq_for_enabled_exit(bool clear_ri)
>   __hard_EE_RI_disable();
>   else
>   __hard_irq_disable();
> +#ifdef CONFIG_PPC64
>   if (unlikely(lazy_irq_pending_nocheck())) {
>   /* Took an interrupt, may have more exit work to do. */
>   if (clear_ri)
> @@ -162,7 +165,7 @@ static notrace inline bool 
> __prep_irq_for_enabled_exit(bool clear_ri)
>   }
>   local_paca->irq_happened = 0;
>   irq_soft_mask_set(IRQS_ENABLED);
> -
> +#endif
>   return true;
>  }
>  

Do we prefer space before return except in trivial cases?

> @@ -216,7 +219,9 @@ notrace unsigned long syscall_exit_prepare(unsigned long 
> r3,
>  
>   CT_WARN_ON(ct_state() == CONTEXT_USER);
>  
> +#ifdef CONFIG_PPC64
>   kuap_check_amr();
> +#endif
>  
>   regs->result = r3;
>  
> @@ -309,7 +314,7 @@ notrace unsigned long syscall_exit_prepare(unsigned long 
> r3,
>  
>   account_cpu_user_exit();
>  
> -#ifdef CONFIG_PPC_BOOK3S /* BOOK3E not yet using this */
> +#ifdef CONFIG_PPC_BOOK3S_64 /* BOOK3E and ppc32 not using this */
>   /*
>* We do this at the end so that we do context switch with KERNEL AMR
>*/
> @@ -318,7 +323,7 @@ notrace unsigned long syscall_exit_prepare(unsigned long 
> r3,
>   return ret;
>  }
>  
> -#ifdef CONFIG_PPC_BOOK3S /* BOOK3E not yet using this */
> +#ifndef CONFIG_PPC_BOOK3E_64 /* BOOK3E not yet using this */
>  notrace unsigned long interrupt_exit_user_prepare(struct pt_regs *regs, 
> unsigned long msr)
>  {
>  #ifdef CONFIG

Re: [PATCH v5 08/22] powerpc/syscall: Rename syscall_64.c into interrupt.c

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> syscall_64.c will be reused almost as is for PPC32.
> 
> As this file also contains functions to handle other types
> of interrupts rename it interrupt.c
> 
> Signed-off-by: Christophe Leroy 

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/kernel/Makefile  | 2 +-
>  arch/powerpc/kernel/{syscall_64.c => interrupt.c} | 0
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename arch/powerpc/kernel/{syscall_64.c => interrupt.c} (100%)
> 
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index c173efd66c00..26ff8c6e06b7 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -60,7 +60,7 @@ obj-y   := cputable.o 
> syscalls.o \
>  hw_breakpoint_constraints.o
>  obj-y+= ptrace/
>  obj-$(CONFIG_PPC64)  += setup_64.o \
> -paca.o nvram_64.o note.o syscall_64.o
> +paca.o nvram_64.o note.o interrupt.o
>  obj-$(CONFIG_COMPAT) += sys_ppc32.o signal_32.o
>  obj-$(CONFIG_VDSO32) += vdso32_wrapper.o
>  obj-$(CONFIG_PPC_WATCHDOG)   += watchdog.o
> diff --git a/arch/powerpc/kernel/syscall_64.c 
> b/arch/powerpc/kernel/interrupt.c
> similarity index 100%
> rename from arch/powerpc/kernel/syscall_64.c
> rename to arch/powerpc/kernel/interrupt.c
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 07/22] powerpc/irq: Add stub irq_soft_mask_return() for PPC32

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> To allow building syscall_64.c smoothly on PPC32, add stub version
> of irq_soft_mask_return().
> 
> Signed-off-by: Christophe Leroy 

Same kind of comment as the other soft mask stuff. Again not a big deal 
but there might be a way to improve it. For example make a
debug_syscall_entry(regs) function that ppc64 could put the soft mask
checks into.

No big deal, if you don't make any changes now I might see about doing 
something like that after your series goes in.

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/include/asm/hw_irq.h | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 4739f61e632c..56a98936a6a9 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -330,6 +330,11 @@ static inline void irq_soft_mask_regs_set_state(struct 
> pt_regs *regs, unsigned l
>  }
>  #else /* CONFIG_PPC64 */
>  
> +static inline notrace unsigned long irq_soft_mask_return(void)
> +{
> + return 0;
> +}
> +
>  static inline unsigned long arch_local_save_flags(void)
>  {
>   return mfmsr();
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 06/22] powerpc/irq: Rework helpers that manipulate MSR[EE/RI]

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> In preparation of porting PPC32 to C syscall entry/exit,
> rewrite the following helpers as static inline functions and
> add support for PPC32 in them:
>   __hard_irq_enable()
>   __hard_irq_disable()
>   __hard_EE_RI_disable()
>   __hard_RI_enable()
> 
> Then use them in PPC32 version of arch_local_irq_disable()
> and arch_local_irq_enable() to avoid code duplication.
> 

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/hw_irq.h | 75 +--
>  arch/powerpc/include/asm/reg.h|  1 +
>  2 files changed, 52 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index ed0c3b049dfd..4739f61e632c 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -50,6 +50,55 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +static inline void __hard_irq_enable(void)
> +{
> + if (IS_ENABLED(CONFIG_BOOKE) || IS_ENABLED(CONFIG_40x))
> + wrtee(MSR_EE);
> + else if (IS_ENABLED(CONFIG_PPC_8xx))
> + wrtspr(SPRN_EIE);
> + else if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
> + __mtmsrd(MSR_EE | MSR_RI, 1);
> + else
> + mtmsr(mfmsr() | MSR_EE);
> +}
> +
> +static inline void __hard_irq_disable(void)
> +{
> + if (IS_ENABLED(CONFIG_BOOKE) || IS_ENABLED(CONFIG_40x))
> + wrtee(0);
> + else if (IS_ENABLED(CONFIG_PPC_8xx))
> + wrtspr(SPRN_EID);
> + else if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
> + __mtmsrd(MSR_RI, 1);
> + else
> + mtmsr(mfmsr() & ~MSR_EE);
> +}
> +
> +static inline void __hard_EE_RI_disable(void)
> +{
> + if (IS_ENABLED(CONFIG_BOOKE) || IS_ENABLED(CONFIG_40x))
> + wrtee(0);
> + else if (IS_ENABLED(CONFIG_PPC_8xx))
> + wrtspr(SPRN_NRI);
> + else if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
> + __mtmsrd(0, 1);
> + else
> + mtmsr(mfmsr() & ~(MSR_EE | MSR_RI));
> +}
> +
> +static inline void __hard_RI_enable(void)
> +{
> + if (IS_ENABLED(CONFIG_BOOKE) || IS_ENABLED(CONFIG_40x))
> + return;
> +
> + if (IS_ENABLED(CONFIG_PPC_8xx))
> + wrtspr(SPRN_EID);
> + else if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
> + __mtmsrd(MSR_RI, 1);
> + else
> + mtmsr(mfmsr() | MSR_RI);
> +}
> +
>  #ifdef CONFIG_PPC64
>  #include 
>  
> @@ -212,18 +261,6 @@ static inline bool arch_irqs_disabled(void)
>  
>  #endif /* CONFIG_PPC_BOOK3S */
>  
> -#ifdef CONFIG_PPC_BOOK3E
> -#define __hard_irq_enable()  wrtee(MSR_EE)
> -#define __hard_irq_disable() wrtee(0)
> -#define __hard_EE_RI_disable()   wrtee(0)
> -#define __hard_RI_enable()   do { } while (0)
> -#else
> -#define __hard_irq_enable()  __mtmsrd(MSR_EE|MSR_RI, 1)
> -#define __hard_irq_disable() __mtmsrd(MSR_RI, 1)
> -#define __hard_EE_RI_disable()   __mtmsrd(0, 1)
> -#define __hard_RI_enable()   __mtmsrd(MSR_RI, 1)
> -#endif
> -
>  #define hard_irq_disable()   do {\
>   unsigned long flags;\
>   __hard_irq_disable();   \
> @@ -322,22 +359,12 @@ static inline unsigned long arch_local_irq_save(void)
>  
>  static inline void arch_local_irq_disable(void)
>  {
> - if (IS_ENABLED(CONFIG_BOOKE))
> - wrtee(0);
> - else if (IS_ENABLED(CONFIG_PPC_8xx))
> - wrtspr(SPRN_EID);
> - else
> - mtmsr(mfmsr() & ~MSR_EE);
> + __hard_irq_disable();
>  }
>  
>  static inline void arch_local_irq_enable(void)
>  {
> - if (IS_ENABLED(CONFIG_BOOKE))
> - wrtee(MSR_EE);
> - else if (IS_ENABLED(CONFIG_PPC_8xx))
> - wrtspr(SPRN_EIE);
> - else
> - mtmsr(mfmsr() | MSR_EE);
> + __hard_irq_enable();
>  }
>  
>  static inline bool arch_irqs_disabled_flags(unsigned long flags)
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index c5a3e856191c..bc4305ba00d0 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -1375,6 +1375,7 @@
>  #define mtmsr(v) asm volatile("mtmsr %0" : \
>: "r" ((unsigned long)(v)) \
>: "memory")
> +#define __mtmsrd(v, l)   BUILD_BUG()
>  #define __MTMSR  "mtmsr"
>  #endif
>  
> -- 
> 2.25.0
> 
> 


Re: [PATCH v5 05/22] powerpc/irq: Add helper to set regs->softe

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> regs->softe doesn't exist on PPC32.
> 
> Add irq_soft_mask_regs_set_state() helper to set regs->softe.
> This helper will void on PPC32.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/hw_irq.h | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 614957f74cee..ed0c3b049dfd 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -38,6 +38,8 @@
>  #define PACA_IRQ_MUST_HARD_MASK  (PACA_IRQ_EE)
>  #endif
>  
> +#endif /* CONFIG_PPC64 */
> +
>  /*
>   * flags for paca->irq_soft_mask
>   */
> @@ -46,8 +48,6 @@
>  #define IRQS_PMI_DISABLED2
>  #define IRQS_ALL_DISABLED(IRQS_DISABLED | IRQS_PMI_DISABLED)
>  
> -#endif /* CONFIG_PPC64 */
> -
>  #ifndef __ASSEMBLY__
>  
>  #ifdef CONFIG_PPC64
> @@ -287,6 +287,10 @@ extern void irq_set_pending_from_srr1(unsigned long 
> srr1);
>  
>  extern void force_external_irq_replay(void);
>  
> +static inline void irq_soft_mask_regs_set_state(struct pt_regs *regs, 
> unsigned long val)
> +{
> + regs->softe = val;
> +}
>  #else /* CONFIG_PPC64 */
>  
>  static inline unsigned long arch_local_save_flags(void)
> @@ -355,6 +359,9 @@ static inline bool arch_irq_disabled_regs(struct pt_regs 
> *regs)
>  
>  static inline void may_hard_irq_enable(void) { }
>  
> +static inline void irq_soft_mask_regs_set_state(struct pt_regs *regs, 
> unsigned long val)
> +{
> +}
>  #endif /* CONFIG_PPC64 */
>  
>  #define ARCH_IRQ_INIT_FLAGS  IRQ_NOREQUEST

What I don't like about this where you use it is it kind of pollutes
the ppc32 path with this function which is not valid to use.

I would prefer if you had this purely so it could compile with:

  if (IS_ENABLED(CONFIG_PPC64)))
  irq_soft_mask_regs_set_state(regs, blah);

And then you could make the ppc32 cause a link error if it did not
get eliminated at compile time (e.g., call an undefined function).

You could do the same with the kuap_ functions to change some ifdefs
to IS_ENABLED.

That's just my preference but if you prefer this way I guess that's
okay.

Thanks,
Nick


Re: [PATCH v5 00/22] powerpc/32: Implement C syscall entry/exit

2021-02-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 9, 2021 1:10 am:
> This series implements C syscall entry/exit for PPC32. It reuses
> the work already done for PPC64.
> 
> This series is based on today's merge-test 
> (b6f72fc05389e3fc694bf5a5fa1bbd33f61879e0)
> 
> In terms on performance we have the following number of cycles on an
> 8xx running null_syscall benchmark:
> - mainline: 296 cycles
> - after patch 4: 283 cycles
> - after patch 16: 304 cycles
> - after patch 17: 348 cycles
> - at the end of the series: 320 cycles
> 
> So in summary, we have a degradation of performance of 8% on null_syscall.
> 
> I think it is not a big degradation, it is worth it.

I guess it's 13% from 283. But it's very nice to use the shared C code.

There might be a few more percent speedup in there we can find later.

Thanks,
Nick



Re: [PATCH] powerpc/8xx: Fix software emulation interrupt

2021-02-05 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 5, 2021 6:56 pm:
> For unimplemented instructions or unimplemented SPRs, the 8xx triggers
> a "Software Emulation Exception" (0x1000). That interrupt doesn't set
> reason bits in SRR1 as the "Program Check Exception" does.
> 
> Go through emulation_assist_interrupt() to set REASON_ILLEGAL.
> 
> Fixes: fbbcc3bb139e ("powerpc/8xx: Remove SoftwareEmulation()")
> Signed-off-by: Christophe Leroy 
> ---
> I'm wondering whether it wouldn't be better to set REASON_ILLEGAL
> in the exception prolog and still call program_check_exception.
> And do the same in book3s/64 to avoid the nightmare of an
> INTERRUPT_HANDLER calling another INTERRUPT_HANDLER.

Hmm, I missed this. We just change program_check_exception to
a common function which is called by both.

Thanks,
Nick



Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-05 Thread Nicholas Piggin
Excerpts from Leonardo Bras's message of February 5, 2021 5:01 pm:
> Hey Nick, thanks for reviewing :)
> 
> On Fri, 2021-02-05 at 16:28 +1000, Nicholas Piggin wrote:
>> Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm:
>> > Before guest entry, TBU40 register is changed to reflect guest timebase.
>> > After exitting guest, the register is reverted to it's original value.
>> > 
>> > If one tries to get the timestamp from host between those changes, it
>> > will present an incorrect value.
>> > 
>> > An example would be trying to add a tracepoint in
>> > kvmppc_guest_entry_inject_int(), which depending on last tracepoint
>> > acquired could actually cause the host to crash.
>> > 
>> > Save the Timebase Offset to PACA and use it on sched_clock() to always
>> > get the correct timestamp.
>> 
>> Ouch. Not sure how reasonable it is to half switch into guest registers 
>> and expect to call into the wider kernel, fixing things up as we go. 
>> What if mftb is used in other places?
> 
> IIUC, the CPU is not supposed to call anything as host between guest
> entry and guest exit, except guest-related cases, like

When I say "call", I'm including tracing in that. If a function is not 
marked as no trace, then it will call into the tracing subsystem.

> kvmppc_guest_entry_inject_int(), but anyway, if something calls mftb it
> will still get the same value as before.

Right, so it'll be out of whack again.

> This is only supposed to change stuff that depends on sched_clock, like
> Tracepoints, that can happen in those exceptions.

If they depend on sched_clock that's one thing. Do they definitely have 
no dependencies on mftb from other calls?

>> Especially as it doesn't seem like there is a reason that function _has_
>> to be called after the timebase is switched to guest, that's just how 
>> the code is structured.
> 
> Correct, but if called, like in rb routines, used by tracepoints, the
> difference between last tb and current (lower) tb may cause the CPU to
> trap PROGRAM exception, crashing host. 

Yes, so I agree with Michael any function that is involved when we begin 
to switch into guest context (or have not completed switching back to 
host going the other way) should be marked as no trace (noinstr even, 
perhaps).

>> As a local hack to work out a bug okay. If you really need it upstream 
>> could you put it under a debug config option?
> 
> You mean something that is automatically selected whenever those
> configs are enabled? 
> 
> CONFIG_TRACEPOINT && CONFIG_KVM_BOOK3S_HANDLER && CONFIG_PPC_BOOK3S_64
> 
> Or something the user need to select himself in menuconfig?

Yeah I meant a default n thing under powerpc kernel debugging somewhere.

Thanks,
Nick


Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Nicholas Piggin
Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm:
> Before guest entry, TBU40 register is changed to reflect guest timebase.
> After exitting guest, the register is reverted to it's original value.
> 
> If one tries to get the timestamp from host between those changes, it
> will present an incorrect value.
> 
> An example would be trying to add a tracepoint in
> kvmppc_guest_entry_inject_int(), which depending on last tracepoint
> acquired could actually cause the host to crash.
> 
> Save the Timebase Offset to PACA and use it on sched_clock() to always
> get the correct timestamp.

Ouch. Not sure how reasonable it is to half switch into guest registers 
and expect to call into the wider kernel, fixing things up as we go. 
What if mftb is used in other places?

Especially as it doesn't seem like there is a reason that function _has_
to be called after the timebase is switched to guest, that's just how 
the code is structured.

As a local hack to work out a bug okay. If you really need it upstream 
could you put it under a debug config option?

Thanks,
Nick

> Signed-off-by: Leonardo Bras 
> Suggested-by: Paul Mackerras 
> ---
> Changes since v1:
> - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
>   CONFIG_PPC_BOOK3S_64 are defined.
> ---
>  arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
>  arch/powerpc/kernel/asm-offsets.c | 1 +
>  arch/powerpc/kernel/time.c| 8 +++-
>  arch/powerpc/kvm/book3s_hv.c  | 2 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
>  5 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
> b/arch/powerpc/include/asm/kvm_book3s_asm.h
> index 078f4648ea27..e2c12a10eed2 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> @@ -131,6 +131,7 @@ struct kvmppc_host_state {
>   u64 cfar;
>   u64 ppr;
>   u64 host_fscr;
> + u64 tb_offset;  /* Timebase offset: keeps correct timebase 
> while on guest */
>  #endif
>  };
>  
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index b12d7c049bfe..0beb8fdc6352 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -706,6 +706,7 @@ int main(void)
>   HSTATE_FIELD(HSTATE_CFAR, cfar);
>   HSTATE_FIELD(HSTATE_PPR, ppr);
>   HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
> + HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #else /* CONFIG_PPC_BOOK3S */
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 67feb3524460..f27f0163792b 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
>   */
>  notrace unsigned long long sched_clock(void)
>  {
> - return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
> + u64 tb = get_tb() - boot_tb;
> +
> +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER)
> + tb -= local_paca->kvm_hstate.tb_offset;
> +#endif
> +
> + return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift;
>  }
>  
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index b3731572295e..c08593c63353 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   if ((tb & 0xff) < (new_tb & 0xff))
>   mtspr(SPRN_TBU40, new_tb + 0x100);
>   vc->tb_offset_applied = vc->tb_offset;
> + local_paca->kvm_hstate.tb_offset = vc->tb_offset;
>   }
>  
>   if (vc->pcr)
> @@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   if ((tb & 0xff) < (new_tb & 0xff))
>   mtspr(SPRN_TBU40, new_tb + 0x100);
>   vc->tb_offset_applied = 0;
> + local_paca->kvm_hstate.tb_offset = 0;
>   }
>  
>   mtspr(SPRN_HDEC, 0x7fff);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b73140607875..8f7a9f7f4ee6 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
>   cmpdi   r8,0
>   beq 37f
>   std r8, VCORE_TB_OFFSET_APPL(r5)
> + std r8, HSTATE_TB_OFFSET(r13)
>   mftbr6  /* current host timebase */
>   add r8,r8,r6
>   mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
> @@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   beq 17f
>   li  r0, 0
>   std r0, VCORE_TB_OFFSET_APPL(r5)
> + std r0, HSTATE_TB_OFFSET(r13)
>   mftbr6  /* current guest timebase */
>   subf 

Re: [PATCH v2 0/5] shoot lazy tlbs

2021-02-04 Thread Nicholas Piggin
I'll ask Andrew to put this in -mm if no objections.

The series now doesn't touch other archs in non-trivial ways, and core code
is functionally not changed much / at all if the option is not selected so
it's actually pretty simple aside from the powerpc change.

Thanks,
Nick

Excerpts from Nicholas Piggin's message of December 14, 2020 4:53 pm:
> This is another rebase, on top of mainline now (don't need the
> asm-generic tree), and without any x86 or membarrier changes.
> This makes the series far smaller and more manageable and
> without the controversial bits.
> 
> Thanks,
> Nick
> 
> Nicholas Piggin (5):
>   lazy tlb: introduce lazy mm refcount helper functions
>   lazy tlb: allow lazy tlb mm switching to be configurable
>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>   powerpc: use lazy mm refcount helper functions
>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
> 
>  arch/Kconfig | 30 ++
>  arch/arm/mach-rpc/ecard.c|  2 +-
>  arch/powerpc/Kconfig |  1 +
>  arch/powerpc/kernel/smp.c|  2 +-
>  arch/powerpc/mm/book3s64/radix_tlb.c |  4 +-
>  fs/exec.c|  4 +-
>  include/linux/sched/mm.h | 20 +++
>  kernel/cpu.c |  2 +-
>  kernel/exit.c|  2 +-
>  kernel/fork.c| 52 
>  kernel/kthread.c | 11 ++--
>  kernel/sched/core.c  | 88 
>  kernel/sched/sched.h |  4 +-
>  13 files changed, 184 insertions(+), 38 deletions(-)
> 
> -- 
> 2.23.0
> 
> 


Re: [PATCH] mm/memory.c: Remove pte_sw_mkyoung()

2021-02-03 Thread Nicholas Piggin
Excerpts from Andrew Morton's message of February 4, 2021 10:46 am:
> On Wed,  3 Feb 2021 10:19:44 + (UTC) Christophe Leroy 
>  wrote:
> 
>> Commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF
>> is cleared") introduced arch_faults_on_old_pte() helper to identify
>> platforms that don't set page access bit in HW and require a page
>> fault to set it.
>> 
>> Commit 44bf431b47b4 ("mm/memory.c: Add memory read privilege on page
>> fault handling") added pte_sw_mkyoung() which is yet another way to
>> manage platforms that don't set page access bit in HW and require a
>> page fault to set it.
>> 
>> Remove that pte_sw_mkyoung() helper and use the already existing
>> arch_faults_on_old_pte() helper together with pte_mkyoung() instead.
> 
> This conflicts with mm/memory.c changes in linux-next.  In
> do_set_pte().  Please check my efforts:

I wanted to just get rid of it completely --

https://marc.info/?l=linux-mm=160860750115163=2

Waiting for MIPs to get that patch mentioned merged or nacked but
as yet seems to be no response from maintainers.

https://lore.kernel.org/linux-arch/20201019081257.32127-1-huang...@loongson.cn/

Thanks,
Nick

> 
> --- a/arch/mips/include/asm/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/arch/mips/include/asm/pgtable.h
> @@ -406,8 +406,6 @@ static inline pte_t pte_mkyoung(pte_t pt
>   return pte;
>  }
>  
> -#define pte_sw_mkyoung   pte_mkyoung
> -
>  #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
>  static inline int pte_huge(pte_t pte){ return pte_val(pte) & 
> _PAGE_HUGE; }
>  
> --- a/include/linux/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/include/linux/pgtable.h
> @@ -424,22 +424,6 @@ static inline void ptep_set_wrprotect(st
>  }
>  #endif
>  
> -/*
> - * On some architectures hardware does not set page access bit when accessing
> - * memory page, it is responsibilty of software setting this bit. It brings
> - * out extra page fault penalty to track page access bit. For optimization 
> page
> - * access bit can be set during all page fault flow on these arches.
> - * To be differentiate with macro pte_mkyoung, this macro is used on 
> platforms
> - * where software maintains page access bit.
> - */
> -#ifndef pte_sw_mkyoung
> -static inline pte_t pte_sw_mkyoung(pte_t pte)
> -{
> - return pte;
> -}
> -#define pte_sw_mkyoung   pte_sw_mkyoung
> -#endif
> -
>  #ifndef pte_savedwrite
>  #define pte_savedwrite pte_write
>  #endif
> --- a/mm/memory.c~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/mm/memory.c
> @@ -2902,7 +2902,8 @@ static vm_fault_t wp_page_copy(struct vm
>   }
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>   entry = mk_pte(new_page, vma->vm_page_prot);
> - entry = pte_sw_mkyoung(entry);
> + if (arch_faults_on_old_pte())
> + entry = pte_mkyoung(entry);
>   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  
>   /*
> @@ -3560,7 +3561,8 @@ static vm_fault_t do_anonymous_page(stru
>   __SetPageUptodate(page);
>  
>   entry = mk_pte(page, vma->vm_page_prot);
> - entry = pte_sw_mkyoung(entry);
> + if (arch_faults_on_old_pte())
> + entry = pte_mkyoung(entry);
>   if (vma->vm_flags & VM_WRITE)
>   entry = pte_mkwrite(pte_mkdirty(entry));
>  
> @@ -3745,8 +3747,8 @@ void do_set_pte(struct vm_fault *vmf, st
>  
>   if (prefault && arch_wants_old_prefaulted_pte())
>   entry = pte_mkold(entry);
> - else
> - entry = pte_sw_mkyoung(entry);
> + else if (arch_faults_on_old_pte())
> + entry = pte_mkyoung(entry);
>  
>   if (write)
>   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> _
> 
> 
> 


Re: [PATCH v12 01/14] ARM: mm: add missing pud_page define to 2-level page tables

2021-02-02 Thread Nicholas Piggin
Excerpts from Russell King - ARM Linux admin's message of February 2, 2021 9:13 
pm:
> On Tue, Feb 02, 2021 at 09:05:02PM +1000, Nicholas Piggin wrote:
>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>> index c02f24400369..d63a5bb6bd0c 100644
>> --- a/arch/arm/include/asm/pgtable.h
>> +++ b/arch/arm/include/asm/pgtable.h
>> @@ -166,6 +166,9 @@ extern struct page *empty_zero_page;
>>  
>>  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>>  
>> +#define pud_page(pud)   pmd_page(__pmd(pud_val(pud)))
>> +#define pud_write(pud)  pmd_write(__pmd(pud_val(pud)))
> 
> As there is no PUD, does it really make sense to return a valid
> struct page (which will be the PTE page) for pud_page(), which is
> several tables above?

There is no PUD on 3-level either, and the pgtable-nopud.h which it uses 
also passes down p4d_page to pud_page, so by convention...

Although in this case at least for my next patch it won't acutally use 
pud_page unless it's a leaf entry so maybe it shouldn't get called
anyway.

Thanks,
Nick


[PATCH v12 14/14] powerpc/64s/radix: Enable huge vmalloc mappings

2021-02-02 Thread Nicholas Piggin
This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due
to vfs hashes being allocated with 2MB pages.

Cc: linuxppc-...@lists.ozlabs.org
Acked-by: Michael Ellerman 
Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |  2 ++
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/kernel/module.c  | 21 +++
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index a10b545c2070..d62df53e5200 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3225,6 +3225,8 @@
 
nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
 
+   nohugevmalloc   [PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..781da6829ab7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -181,6 +181,7 @@ config PPC
select GENERIC_GETTIMEOFDAY
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
+   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index a211b0253cdb..07026335d24d 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -87,13 +87,26 @@ int module_finalize(const Elf_Ehdr *hdr,
return 0;
 }
 
-#ifdef MODULES_VADDR
 void *module_alloc(unsigned long size)
 {
+   unsigned long start = VMALLOC_START;
+   unsigned long end = VMALLOC_END;
+
+#ifdef MODULES_VADDR
BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
+   start = MODULES_VADDR;
+   end = MODULES_END;
+#endif
+
+   /*
+* Don't do huge page allocations for modules yet until more testing
+* is done. STRICT_MODULE_RWX may require extra work to support this
+* too.
+*/
 
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, 
GFP_KERNEL,
-   PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
NUMA_NO_NODE,
+   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL,
+   PAGE_KERNEL_EXEC,
+   VM_NO_HUGE_VMAP | VM_FLUSH_RESET_PERMS,
+   NUMA_NO_NODE,
__builtin_return_address(0));
 }
-#endif
-- 
2.23.0



[PATCH v12 13/14] mm/vmalloc: Hugepage vmalloc mappings

2021-02-02 Thread Nicholas Piggin
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
or larger, and fall back to small pages if that was unsuccessful.

Architectures must ensure that any arch specific vmalloc allocations
that require PAGE_SIZE mappings (e.g., module allocations vs strict
module rwx) use the VM_NOHUGE flag to inhibit larger mappings.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig|  11 ++
 include/linux/vmalloc.h |  21 
 mm/page_alloc.c |   5 +-
 mm/vmalloc.c| 215 +++-
 4 files changed, 205 insertions(+), 47 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 24862d15f3a3..eef170e0c9b8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+#
+#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
+#  arch_vmap_pmd_supported() returns true), and they must make no assumptions
+#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
+#  can be used to prohibit arch-specific allocations from using hugepages to
+#  help with this (e.g., modules may require it).
+#
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 99ea72d547dc..93270adf5db5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -25,6 +25,7 @@ struct notifier_block;/* in notifier.h */
 #define VM_NO_GUARD0x0040  /* don't add guard page */
 #define VM_KASAN   0x0080  /* has allocated kasan shadow 
memory */
 #define VM_MAP_PUT_PAGES   0x0100  /* put pages and free array in 
vfree */
+#define VM_NO_HUGE_VMAP0x0200  /* force PAGE_SIZE pte 
mapping */
 
 /*
  * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
@@ -59,6 +60,9 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+   unsigned intpage_order;
+#endif
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
@@ -193,6 +197,22 @@ void free_vm_area(struct vm_struct *area);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
+static inline bool is_vm_area_hugepages(const void *addr)
+{
+   /*
+* This may not 100% tell if the area is mapped with > PAGE_SIZE
+* page table entries, if for some reason the architecture indicates
+* larger sizes are available but decides not to use them, nothing
+* prevents that. This only indicates the size of the physical page
+* allocated in the vmalloc layer.
+*/
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+   return find_vm_area(addr)->page_order > 0;
+#else
+   return false;
+#endif
+}
+
 #ifdef CONFIG_MMU
 int vmap_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
@@ -210,6 +230,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+
 #else
 static inline int
 map_kernel_range_noflush(unsigned long start, unsigned long size,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..1116ce45744b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -8240,6 +8241,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8307,6 +8309,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = is_vm_area_hugepages(table);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8323,7 +8326,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
-  

[PATCH v12 12/14] mm/vmalloc: add vmap_range_noflush variant

2021-02-02 Thread Nicholas Piggin
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f043386bb51d..47ab4338cfff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -240,7 +240,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end,
+static int vmap_range_noflush(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift)
 {
@@ -263,14 +263,24 @@ int vmap_range(unsigned long addr, unsigned long end,
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0



[PATCH v12 11/14] mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c

2021-02-02 Thread Nicholas Piggin
This is a generic kernel virtual memory mapper, not specific to ioremap.

Code is unchanged other than making vmap_range non-static.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   3 +
 mm/ioremap.c| 203 
 mm/vmalloc.c| 202 +++
 3 files changed, 205 insertions(+), 203 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 9f7b8b00101b..99ea72d547dc 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -194,6 +194,9 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 3264d0203785..d1dcc7e744ac 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,209 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_max_page_shift = PAGE_SHIFT;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
-   max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot,
-   max_page_shift)) {
-   *mask |= PGTBL_PUD_MODI

[PATCH v12 10/14] mm/vmalloc: provide fallback arch huge vmap support functions

2021-02-02 Thread Nicholas Piggin
If an architecture doesn't support a particular page table level as
a huge vmap page size then allow it to skip defining the support
query function.

Suggested-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h   |  7 +++
 arch/powerpc/include/asm/vmalloc.h |  7 +++
 arch/x86/include/asm/vmalloc.h | 13 +
 include/linux/vmalloc.h| 24 
 4 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index fc9a12d6cc1a..7a22aeea9bb5 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,11 +4,8 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
 
+#define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
/*
@@ -19,11 +16,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
+#define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
/* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
+
 #endif
 
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index 3f0c153befb0..4c69ece52a31 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -5,21 +5,20 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
 
+#define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
+#define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
+
 #endif
 
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index e714b00fc0ca..49ce331f3ac6 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -6,24 +6,21 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
 
+#ifdef CONFIG_X86_64
+#define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
-#ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
 }
+#endif
 
+#define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return boot_cpu_has(X86_FEATURE_PSE);
 }
+
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 00bd62bd701e..9f7b8b00101b 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -83,10 +83,26 @@ struct vmap_area {
};
 };
 
-#ifndef CONFIG_HAVE_ARCH_HUGE_VMAP
-static inline bool arch_vmap_p4d_supported(pgprot_t prot) { return false; }
-static inline bool arch_vmap_pud_supported(pgprot_t prot) { return false; }
-static inline bool arch_vmap_pmd_supported(pgprot_t prot) { return false; }
+/* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these 
*/
+#ifndef arch_vmap_p4d_supported
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+#endif
+
+#ifndef arch_vmap_pud_supported
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   return false;
+}
+#endif
+
+#ifndef arch_vmap_pmd_supported
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return false;
+}
 #endif
 
 /*
-- 
2.23.0



[PATCH v12 09/14] x86: inline huge vmap supported functions

2021-02-02 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---
 arch/x86/include/asm/vmalloc.h | 22 +++---
 arch/x86/mm/ioremap.c  | 21 -
 arch/x86/mm/pgtable.c  | 13 -
 3 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 094ea2b565f3..e714b00fc0ca 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,13 +1,29 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+#ifdef CONFIG_X86_64
+   return boot_cpu_has(X86_FEATURE_GBPAGES);
+#else
+   return false;
+#endif
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return boot_cpu_has(X86_FEATURE_PSE);
+}
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index fbaf0c447986..12c686c65ea9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,27 +481,6 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-#ifdef CONFIG_X86_64
-   return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return boot_cpu_has(X86_FEATURE_PSE);
-}
-#endif
-
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
  * access
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f6a9e2e36642..d27cf69e811d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,14 +780,6 @@ int pmd_clear_huge(pmd_t *pmd)
return 0;
 }
 
-/*
- * Until we support 512GB pages, skip them in the vmap area.
- */
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 #ifdef CONFIG_X86_64
 /**
  * pud_free_pmd_page - Clear pud entry and free pmd page.
@@ -861,11 +853,6 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #else /* !CONFIG_X86_64 */
 
-int pud_free_pmd_page(pud_t *pud, unsigned long addr)
-{
-   return pud_none(*pud);
-}
-
 /*
  * Disable free page handling on x86-PAE. This assures that ioremap()
  * does not update sync'd pmd entries. See vmalloc_sync_one().
-- 
2.23.0



[PATCH v12 07/14] powerpc: inline huge vmap supported functions

2021-02-02 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: linuxppc-...@lists.ozlabs.org
Acked-by: Michael Ellerman 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/vmalloc.h   | 19 ---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 21 -
 2 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index 105abb73f075..3f0c153befb0 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,12 +1,25 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /* HPT does not cope with large pages in the vmalloc area */
+   return radix_enabled();
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return radix_enabled();
+}
 #endif
 
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 743807fc210f..8da62afccee5 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1082,22 +1082,6 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-   /* HPT does not cope with large pages in the vmalloc area */
-   return radix_enabled();
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return radix_enabled();
-}
-
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
pte_t *ptep = (pte_t *)pud;
@@ -1181,8 +1165,3 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
return 1;
 }
-
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-- 
2.23.0



[PATCH v12 08/14] arm64: inline huge vmap supported functions

2021-02-02 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Acked-by: Catalin Marinas 
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h | 23 ---
 arch/arm64/mm/mmu.c  | 26 --
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 597b40405319..fc9a12d6cc1a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,9 +4,26 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /*
+* Only 4k granule supports level 1 block mappings.
+* SW table walks can't handle removal of intermediate entries.
+*/
+   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   /* See arch_vmap_pud_supported() */
+   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
 #endif
 
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1613d290cbd1..ab9ba7c36dae 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1313,27 +1313,6 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-   /*
-* Only 4k granule supports level 1 block mappings.
-* SW table walks can't handle removal of intermediate entries.
-*/
-   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
-  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   /* See arch_vmap_pud_supported() */
-   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot));
@@ -1425,11 +1404,6 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
return 1;
 }
 
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;   /* Don't attempt a block mapping */
-}
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
 {
-- 
2.23.0



[PATCH v12 05/14] mm/ioremap: rename ioremap_*_range to vmap_*_range

2021-02-02 Thread Nicholas Piggin
This will be used as a generic kernel virtual mapping function, so
re-name it in preparation.

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/ioremap.c | 64 +++-
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..3f4d36f9745a 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,9 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pte_t *pte;
u64 pfn;
@@ -81,9 +81,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +102,9 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
addr,
return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -116,20 +115,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned 
long addr,
do {
next = pmd_addr_end(addr, end);
 
-   if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
 
-   if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +147,9 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
addr,
return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pud_t *pud;
unsigned long next;
@@ -162,20 +160,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned 
long addr,
do {
next = pud_addr_end(addr, end);
 
-   if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
 
-   if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +192,9 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long 
addr,
return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int

[PATCH v12 06/14] mm: HUGE_VMAP arch support cleanup

2021-02-02 Thread Nicholas Piggin
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Cc: linuxppc-...@lists.ozlabs.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Reviewed-by: Ding Tianhong 
Acked-by: Catalin Marinas  [arm64]
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h |  8 ++
 arch/arm64/mm/mmu.c  | 10 +--
 arch/powerpc/include/asm/vmalloc.h   |  8 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +-
 arch/x86/include/asm/vmalloc.h   |  7 ++
 arch/x86/mm/ioremap.c| 12 +--
 include/linux/io.h   |  9 ---
 include/linux/vmalloc.h  |  6 ++
 init/main.c  |  1 -
 mm/debug_vm_pgtable.c|  4 +-
 mm/ioremap.c | 94 ++--
 11 files changed, 87 insertions(+), 80 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 2ca708ab9b20..597b40405319 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_ARM64_VMALLOC_H
 #define _ASM_ARM64_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ae0c3d023824..1613d290cbd1 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1313,12 +1313,12 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1328,9 +1328,9 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index b992dfaaa161..105abb73f075 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 98f0b243c1ab..743807fc210f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1082,13 +1082,13 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
@@ -1182,7 +1182,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 29837740b520..094ea2b565f3 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,6 +1,13 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..fbaf0c447986 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm

[PATCH v12 04/14] mm/vmalloc: rename vmap_*_range vmap_pages_*_range

2021-02-02 Thread Nicholas Piggin
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 62372f9e0167..7f2f36116980 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -189,7 +189,7 @@ void unmap_kernel_range_noflush(unsigned long start, 
unsigned long size)
arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -217,7 +217,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -229,13 +229,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
-   if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -247,13 +247,13 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
-   if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -265,7 +265,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
-   if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -306,7 +306,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
-   err = vmap_p4d_range(pgd, addr, next, prot, pages, , );
+   err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, , 
);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v12 03/14] mm: apply_to_pte_range warn and fail if a large pte is encountered

2021-02-02 Thread Nicholas Piggin
apply_to_pte_range might mistake a large pte for bad, or treat it as a
page table, resulting in a crash or corruption. Add a test to warn and
return error if large entries are found.

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 66 +++--
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..672e39a72788 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2440,13 +2440,21 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
-   if (create || !pmd_none_or_clear_bad(pmd)) {
-   err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-create, mask);
-   if (err)
-   break;
+   if (pmd_none(*pmd) && !create)
+   continue;
+   if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+   return -EINVAL;
+   if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
+   if (!create)
+   continue;
+   pmd_clear_bad(pmd);
}
+   err = apply_to_pte_range(mm, pmd, addr, next,
+fn, data, create, mask);
+   if (err)
+   break;
} while (pmd++, addr = next, addr != end);
+
return err;
 }
 
@@ -2468,13 +2476,21 @@ static int apply_to_pud_range(struct mm_struct *mm, 
p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
-   if (create || !pud_none_or_clear_bad(pud)) {
-   err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-create, mask);
-   if (err)
-   break;
+   if (pud_none(*pud) && !create)
+   continue;
+   if (WARN_ON_ONCE(pud_leaf(*pud)))
+   return -EINVAL;
+   if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
+   if (!create)
+   continue;
+   pud_clear_bad(pud);
}
+   err = apply_to_pmd_range(mm, pud, addr, next,
+fn, data, create, mask);
+   if (err)
+   break;
} while (pud++, addr = next, addr != end);
+
return err;
 }
 
@@ -2496,13 +2512,21 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
-   if (create || !p4d_none_or_clear_bad(p4d)) {
-   err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-create, mask);
-   if (err)
-   break;
+   if (p4d_none(*p4d) && !create)
+   continue;
+   if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+   return -EINVAL;
+   if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) {
+   if (!create)
+   continue;
+   p4d_clear_bad(p4d);
}
+   err = apply_to_pud_range(mm, p4d, addr, next,
+fn, data, create, mask);
+   if (err)
+   break;
} while (p4d++, addr = next, addr != end);
+
return err;
 }
 
@@ -2522,9 +2546,17 @@ static int __apply_to_page_range(struct mm_struct *mm, 
unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
-   if (!create && pgd_none_or_clear_bad(pgd))
+   if (pgd_none(*pgd) && !create)
continue;
-   err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, 
);
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return -EINVAL;
+   if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
+   if (!create)
+   continue;
+   pgd_clear_bad(pgd);
+   }
+   err = apply_to_p4d_range(mm, pgd, addr, next,
+fn, data, create, );
if (err)
break;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v12 02/14] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

2021-02-02 Thread Nicholas Piggin
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Reviewed-by: Miaohe Lin 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e6f352bf0498..62372f9e0167 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -34,7 +34,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 #include 
@@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0



[PATCH v12 00/14] huge vmalloc mappings

2021-02-02 Thread Nicholas Piggin
Should be getting close now. No doubt there will be a few more
things but I can do incremental fixes for small things if this gets
into -mm.

Thanks,
Nick

Since v11:
- ARM compile fix (patch 1)
- debug_vm_pgtable compile fix

Since v10:
- Fixed code style, most > 80 colums, tweak patch titles, etc [thanks Christoph]
- Made huge vmalloc code and data structure compile away if unselected
  [Christoph]
- Archs only have to provide arch_vmap_p?d_supported for levels they
  implement [Christoph]

Since v9:
- Fixed intermediate build breakage on x86-32 !PAE [thanks Ding]
- Fixed small page fallback case vm_struct double-free [thanks Ding]

Since v8:
- Fixed nommu compile.
- Added Kconfig option help text
- Added VM_NOHUGE which should help archs implement it [suggested by Rick]

Since v7:
- Rebase, added some acks, compile fix
- Removed "order=" from vmallocinfo, it's a bit confusing (nr_pages
  is in small page size for compatibility).
- Added arch_vmap_pmd_supported() test before starting to allocate
  the large page, rather than only testing it when doing the map, to
  avoid unsupported configs trying to allocate huge pages for no
  reason.

Since v6:
- Fixed a false positive warning introduced in patch 2, found by
  kbuild test robot.

Since v5:
- Split arch changes out better and make the constant folding work
- Avoid most of the 80 column wrap, fix a reference to lib/ioremap.c
- Fix compile error on some archs

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail

Nicholas Piggin (14):
  ARM: mm: add missing pud_page define to 2-level page tables
  mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in
vmalloc_to_page
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  mm/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  powerpc: inline huge vmap supported functions
  arm64: inline huge vmap supported functions
  x86: inline huge vmap supported functions
  mm/vmalloc: provide fallback arch huge vmap support functions
  mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings
  powerpc/64s/radix: Enable huge vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |  11 +
 arch/arm/include/asm/pgtable-3level.h |   2 -
 arch/arm/include/asm/pgtable.h|   3 +
 arch/arm64/include/asm/vmalloc.h  |  24 +
 arch/arm64/mm/mmu.c   |  26 -
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/vmalloc.h|  20 +
 arch/powerpc/kernel/module.c  |  21 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  21 -
 arch/x86/include/asm/vmalloc.h|  20 +
 arch/x86/mm/ioremap.c |  19 -
 arch/x86/mm/pgtable.c |  13 -
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  46 ++
 init/main.c   |   1 -
 mm/debug_vm_pgtable.c |   4 +-
 mm/ioremap.c  | 225 +---
 mm/memory.c   |  66 ++-
 mm/page_alloc.c   |   5 +-
 mm/vmalloc.c  | 484 +++---
 21 files changed, 619 insertions(+), 404 deletions(-)

-- 
2.23.0



[PATCH v12 01/14] ARM: mm: add missing pud_page define to 2-level page tables

2021-02-02 Thread Nicholas Piggin
ARM uses its own PMD folding scheme which is missing pud_page which
should just pass through to pmd_page. Move this from the 3-level
page table to common header.

Cc: Russell King 
Cc: Ding Tianhong 
Cc: linux-arm-ker...@lists.infradead.org
Signed-off-by: Nicholas Piggin 
---
 arch/arm/include/asm/pgtable-3level.h | 2 --
 arch/arm/include/asm/pgtable.h| 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 2b85d175e999..d4edab51a77c 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -186,8 +186,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 
 #define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
-#define pud_page(pud)  pmd_page(__pmd(pud_val(pud)))
-#define pud_write(pud) pmd_write(__pmd(pud_val(pud)))
 
 #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd))
 #define pmd_thp_or_huge(pmd)   (pmd_huge(pmd) || pmd_trans_huge(pmd))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index c02f24400369..d63a5bb6bd0c 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -166,6 +166,9 @@ extern struct page *empty_zero_page;
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 
+#define pud_page(pud)  pmd_page(__pmd(pud_val(pud)))
+#define pud_write(pud) pmd_write(__pmd(pud_val(pud)))
+
 #define pmd_none(pmd)  (!pmd_val(pmd))
 
 static inline pte_t *pmd_page_vaddr(pmd_t pmd)
-- 
2.23.0



Re: [PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

2021-02-02 Thread Nicholas Piggin
Excerpts from Ding Tianhong's message of January 28, 2021 1:13 pm:
> On 2021/1/26 12:44, Nicholas Piggin wrote:
>> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
>> Whether or not a vmap is huge depends on the architecture details,
>> alignments, boot options, etc., which the caller can not be expected
>> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
>> 
>> This change teaches vmalloc_to_page about larger pages, and returns
>> the struct page that corresponds to the offset within the large page.
>> This makes the API agnostic to mapping implementation details.
>> 
>> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
>> fail gracefully on unexpected huge vmap mappings")
>> 
>> Reviewed-by: Christoph Hellwig 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  mm/vmalloc.c | 41 ++---
>>  1 file changed, 26 insertions(+), 15 deletions(-)
>> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e6f352bf0498..62372f9e0167 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -34,7 +34,7 @@
>>  #include 
>>  #include 
>>  #include 
>> -
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
>>  }
>>  
>>  /*
>> - * Walk a vmap address to the struct page it maps.
>> + * Walk a vmap address to the struct page it maps. Huge vmap mappings will
>> + * return the tail page that corresponds to the base page address, which
>> + * matches small vmap mappings.
>>   */
>>  struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  {
>> @@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  
>>  if (pgd_none(*pgd))
>>  return NULL;
>> +if (WARN_ON_ONCE(pgd_leaf(*pgd)))
>> +return NULL; /* XXX: no allowance for huge pgd */
>> +if (WARN_ON_ONCE(pgd_bad(*pgd)))
>> +return NULL;
>> +
>>  p4d = p4d_offset(pgd, addr);
>>  if (p4d_none(*p4d))
>>  return NULL;
>> -pud = pud_offset(p4d, addr);
>> +if (p4d_leaf(*p4d))
>> +return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>> +if (WARN_ON_ONCE(p4d_bad(*p4d)))
>> +return NULL;
>>  
>> -/*
>> - * Don't dereference bad PUD or PMD (below) entries. This will also
>> - * identify huge mappings, which we may encounter on architectures
>> - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
>> - * identified as vmalloc addresses by is_vmalloc_addr(), but are
>> - * not [unambiguously] associated with a struct page, so there is
>> - * no correct value to return for them.
>> - */
>> -WARN_ON_ONCE(pud_bad(*pud));
>> -if (pud_none(*pud) || pud_bad(*pud))
>> +pud = pud_offset(p4d, addr);
>> +if (pud_none(*pud))
>> +return NULL;
>> +if (pud_leaf(*pud))
>> +return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> 
> Hi Nicho:
> 
> /builds/1mzfdQzleCy69KZFb5qHNSEgabZ/mm/vmalloc.c: In function 
> 'vmalloc_to_page':
> /builds/1mzfdQzleCy69KZFb5qHNSEgabZ/include/asm-generic/pgtable-nop4d-hack.h:48:27:
>  error: implicit declaration of function 'pud_page'; did you mean 'put_page'? 
> [-Werror=implicit-function-declaration]
>48 | #define pgd_page(pgd)(pud_page((pud_t){ pgd }))
>   |   ^~~~
> 
> the pug_page is not defined for aarch32 when enabling 2-level page config, it 
> break the system building.

Hey thanks for finding that, not sure why that didn't trigger any CI.

Anyway newer kernels don't have the ptable-*-hack.h headers, but even so 
it still breaks upstream. arm is using some hand-rolled 2-level folding
of its own (which is fair enough because most 32-bit archs were 2 level
at the time I added pgtable-nopud.h header).

This patch seems to at least make it build.

Thanks,
Nick

---
 arch/arm/include/asm/pgtable-3level.h | 2 --
 arch/arm/include/asm/pgtable.h| 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 2b85d175e999..d4edab51a77c 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -186,8 +186,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 
 #define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
-#

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-02-01 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of February 1, 2021 10:44 pm:
> On Sun, Jan 31, 2021 at 07:57:01AM +, Nadav Amit wrote:
>> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:
> 
>> > I'll go through the patches a bit more closely when they all come 
>> > through. Sparc and powerpc of course need the arch lazy mode to get 
>> > per-page/pte information for operations that are not freeing pages, 
>> > which is what mmu gather is designed for.
>> 
>> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
>> where no previous PTE was set, right?

In cases of increasing permissiveness of access, yes it may want to 
update the "TLB" (read hash table) to avoid taking hash table faults.

But whatever the reason for the flush, there may have to be more
data carried than just the virtual address range and/or physical
pages.

If you clear out the PTE then you have no guarantee of actually being
able to go back and address the the in-memory or in-hardware translation 
structures to update them, depending on what exact scheme is used
(powerpc probably could if all page sizes were the same, but THP or 
64k/4k sub pages would throw a spanner in those works).

> These are the HASH architectures. Their hardware doesn't walk the
> page-tables, but it consults a hash-table to resolve page translations.

Yeah, it's very cool in a masochistic way.

I actually don't know if it's worth doing a big rework of it, as much 
as I'd like to. Rather than just keep it in place and eventually
dismantling some of the go-fast hooks from core code if we can one day
deprecate it in favour of the much easier radix mode.

The whole thing is like a big steam train, years ago Paul and Ben and 
Anton and co got the boiler stoked up and set all the valves just right 
so it runs unbelievably well for what it's actually doing but look at it
the wrong way and the whole thing could blow up. (at least that's what 
it feels like to me probably because I don't know the code that well).

Sparc could probably do the same, not sure about Xen. I don't suppose
vmware is intending to add any kind of paravirt mode related to this stuff?

Thanks,
Nick


Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-01 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
> On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:
> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 427bfcc6cdec..b97136b7010b 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather 
>> *tlb)
>>  
>>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>>  
>> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() 
>> and tlb_end_vma()
>> +#if defined(tlb_flush)
>> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>>  #endif
>>  
>>  /*
>> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, 
>> struct vm_area_struct *vm
>>  
>>  #ifndef tlb_flush
>>  
>> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error Default tlb_flush() relies on default tlb_start_vma() and 
>> tlb_end_vma()
>> -#endif
> 
> #ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
> #error 
> #endif
> 
> goes here...
> 
> 
>>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct 
>> vm_area_struct *vma)
>>  {
>>  if (tlb->fullmm)
>>  return;
>>  
>> +if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
>> +return;
> 
> Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?
> 
> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
> How about:
> 
>   CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

Yes please, have to have descriptive names.

I didn't quite see why this was much of an improvement though. Maybe 
follow up patches take advantage of it? I didn't see how they all fit 
together.

Thanks,
Nick


Re: [PATCH v4 11/23] powerpc/syscall: Rename syscall_64.c into syscall.c

2021-02-01 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of February 2, 2021 4:15 pm:
> 
> 
> Le 28/01/2021 à 00:50, Nicholas Piggin a écrit :
>> Excerpts from David Laight's message of January 26, 2021 8:28 pm:
>>> From: Nicholas Piggin
>>>> Sent: 26 January 2021 10:21
>>>>
>>>> Excerpts from Christophe Leroy's message of January 26, 2021 12:48 am:
>>>>> syscall_64.c will be reused almost as is for PPC32.
>>>>>
>>>>> Rename it syscall.c
>>>>
>>>> Could you rename it to interrupt.c instead? A system call is an
>>>> interrupt, and the file now also has code to return from other
>>>> interrupts as well, and it matches the new asm/interrupt.h from
>>>> the interrupts series.
>>>
>>> Hmmm
>>>
>>> That might make it harder for someone looking for the system call
>>> entry code to find it.
>> 
>> It's very grep'able.
>> 
>>> In some sense interrupts are the simpler case.
>>>
>>> Especially when comparing with other architectures which have
>>> special instructions for syscall entry.
>> 
>> powerpc does have a special instruction for syscall, and it causes a
>> system call interrupt.
>> 
>> I'm not sure about other architectures, but for powerpc its more
>> sensible to call it interrupt.c than syscall.c.
> 
> Many other architectures have a syscall.c but for a different purpose: it 
> contains arch specific 
> system calls. We have that in powerpc as well, it is called syscalls.c
> 
> So to avoid confusion, I'll rename it. But I think "interrupt" is maybe not 
> the right name. An 
> interrupt most of the time refers to IRQ.

That depends what you mean by interrupt and IRQ.

Linux kind of considers any asynchronous maskable interrupt an irq 
(local_irq_disable()). But if you say irq it's more likely to mean
a device interrupt, and "interrupt" usually refres to the asynch
ones.

But Linux doesn't really assign names to synchronous interrupts in
core code. It doesn't say they aren't interrupts, it just doesn't
really have a convention for them at all.

Other architectures e.g., x86 also have things like interrupt
descriptor table for synchronous interrupts as well. That's where
I got the interrupt wrappers code from actually.

So it's really fine to use the proper arch-specific names for things
in arch code. I'm trying to slowly change names from exception to
interrupt.

> For me system call is not an interrupt in the way it 
> doesn't unexpectedly interrupt a program flow. In powerpc manuals it is 
> generally called exceptions, 
> no I'm more inclined to call it exception.c

Actually that's backwards. Powerpc manuals (at least the one I look at) 
calls them all interrupts including system calls, and also the system
call interrupt is actually the only one that doesn't appear to be
associated with an exception.

Also there is no distinction about expecte/unexpected -- a data storage 
interrupt is expected if you access a location without the right access 
permissions for example, but it is still an interrupt.

These handlers very specifically deal with the change to execution flow
(i.e., the interrupt), they do *not* deal with the exception which may 
be associated with it (that is the job of the handler).

And on the other hand you can deal with exceptions in some cases without
taking an interrupt at all. For example if you had MSR[EE]=0 you could
change the decrementer or execute msgclr or change HMER SPR etc to clear
various exceptions without ever taking the interrupt.

Thanks,
Nick


Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Nicholas Piggin
Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
> From: Nadav Amit 
> 
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
> 
> 1. Using mmu_gather (e.g., zap_page_range()).
> 
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>ongoing deferred TLB flush and flushing the entire range eventually
>(e.g., change_protection_range()).
> 
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
> 
> 4. Batching per-table flushes (move_ptes()).
> 
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>flushing when (try_to_unmap_one() on x86).
> 
> It seems that (1)-(4) can be consolidated. In addition, it seems that
> (5) is racy. It also seems there can be many redundant TLB flushes, and
> potentially TLB-shootdown storms, for instance during batched
> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
> defers TLB flushes.
> 
> More aggressive TLB batching may be possible, but this patch-set does
> not add such batching. The proposed changes would enable such batching
> in a later time.
> 
> Admittedly, I do not understand how things are not broken today, which
> frightens me to make further batching before getting things in order.
> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
> for each page-table (but not in greater granularity). Can't
> ClearPageDirty() be called before the flush, causing writes after
> ClearPageDirty() and before the flush to be lost?

Because it's holding the page table lock which stops page_mkclean from 
cleaning the page. Or am I misunderstanding the question?

I'll go through the patches a bit more closely when they all come 
through. Sparc and powerpc of course need the arch lazy mode to get 
per-page/pte information for operations that are not freeing pages, 
which is what mmu gather is designed for.

I wouldn't mind using a similar API so it's less of a black box when 
reading generic code, but it might not quite fit the mmu gather API
exactly (most of these paths don't want a full mmu_gather on stack).

> 
> This patch-set therefore performs the following changes:
> 
> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>instead of {inc|dec}_tlb_flush_pending().
> 
> 2. Avoid TLB flushes if PTE permission is not demoted.
> 
> 3. Cleans up mmu_gather to be less arch-dependant.
> 
> 4. Uses mm's generations to track in finer granularity, either per-VMA
>or per page-table, whether a pending mmu_gather operation is
>outstanding. This should allow to avoid some TLB flushes when KSM or
>memory reclamation takes place while another operation such as
>munmap() or mprotect() is running.
> 
> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>broken to track in a bitmap which CPUs have outstanding TLB flushes
>instead of having a flag.

Putting fixes first, and cleanups and independent patches (like #2) next
would help with getting stuff merged and backported.

Thanks,
Nick


RE: [PATCH v4 11/23] powerpc/syscall: Rename syscall_64.c into syscall.c

2021-01-27 Thread Nicholas Piggin
Excerpts from David Laight's message of January 26, 2021 8:28 pm:
> From: Nicholas Piggin
>> Sent: 26 January 2021 10:21
>> 
>> Excerpts from Christophe Leroy's message of January 26, 2021 12:48 am:
>> > syscall_64.c will be reused almost as is for PPC32.
>> >
>> > Rename it syscall.c
>> 
>> Could you rename it to interrupt.c instead? A system call is an
>> interrupt, and the file now also has code to return from other
>> interrupts as well, and it matches the new asm/interrupt.h from
>> the interrupts series.
> 
> Hmmm
> 
> That might make it harder for someone looking for the system call
> entry code to find it.

It's very grep'able.

> In some sense interrupts are the simpler case.
> 
> Especially when comparing with other architectures which have
> special instructions for syscall entry.

powerpc does have a special instruction for syscall, and it causes a
system call interrupt.

I'm not sure about other architectures, but for powerpc its more 
sensible to call it interrupt.c than syscall.c.

Thanks,
Nick


Re: [PATCH 2/5] kernel/dma: remove unnecessary unmap_kernel_range

2021-01-27 Thread Nicholas Piggin
Excerpts from Christoph Hellwig's message of January 27, 2021 5:10 pm:
> On Tue, Jan 26, 2021 at 05:08:46PM -0500, Konrad Rzeszutek Wilk wrote:
>> On Tue, Jan 26, 2021 at 02:54:01PM +1000, Nicholas Piggin wrote:
>> > vunmap will remove ptes.
>> 
>> Should there be some ASSERT after the vunmap to make sure that is the
>> case? 
> 
> Not really.  removing the PTEs is the whole point of vunmap.  Everything
> else is just house keeping.

Agree. I did double check this and wrote a quick test to check ptes were 
there before the vunmap and cleared after, just to make sure I didn't 
make a silly mistake with the patch. But in general drivers should be 
able to trust code behind the API call will do the right thing. Such 
assertions should go in the vunmap() implementation as appropriate.

Thanks,
Nick


[PATCH v11 00/13] huge vmalloc mappings

2021-01-27 Thread Nicholas Piggin
I think I ended up implementing all Christoph's comments because
they turned out better in the end. Cleanups coming in another
series though.

Thanks,
Nick

Since v10:
- Fixed code style, most > 80 colums, tweak patch titles, etc [thanks Christoph]
- Made huge vmalloc code and data structure compile away if unselected
  [Christoph]
- Archs only have to provide arch_vmap_p?d_supported for levels they
  implement [Christoph]

Since v9:
- Fixed intermediate build breakage on x86-32 !PAE [thanks Ding]
- Fixed small page fallback case vm_struct double-free [thanks Ding]

Since v8:
- Fixed nommu compile.
- Added Kconfig option help text
- Added VM_NOHUGE which should help archs implement it [suggested by Rick]

Since v7:
- Rebase, added some acks, compile fix
- Removed "order=" from vmallocinfo, it's a bit confusing (nr_pages
  is in small page size for compatibility).
- Added arch_vmap_pmd_supported() test before starting to allocate
  the large page, rather than only testing it when doing the map, to
  avoid unsupported configs trying to allocate huge pages for no
  reason.

Since v6:
- Fixed a false positive warning introduced in patch 2, found by
  kbuild test robot.

Since v5:
- Split arch changes out better and make the constant folding work
- Avoid most of the 80 column wrap, fix a reference to lib/ioremap.c
- Fix compile error on some archs

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail

Nicholas Piggin (13):
  mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in
vmalloc_to_page
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  mm/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  powerpc: inline huge vmap supported functions
  arm64: inline huge vmap supported functions
  x86: inline huge vmap supported functions
  mm/vmalloc: provide fallback arch huge vmap support functions
  mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings
  powerpc/64s/radix: Enable huge vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |  11 +
 arch/arm64/include/asm/vmalloc.h  |  24 +
 arch/arm64/mm/mmu.c   |  26 -
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/vmalloc.h|  20 +
 arch/powerpc/kernel/module.c  |  21 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  21 -
 arch/x86/include/asm/vmalloc.h|  20 +
 arch/x86/mm/ioremap.c |  19 -
 arch/x86/mm/pgtable.c |  13 -
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  46 ++
 init/main.c   |   1 -
 mm/ioremap.c  | 225 +---
 mm/memory.c   |  66 ++-
 mm/page_alloc.c   |   5 +-
 mm/vmalloc.c  | 484 +++---
 18 files changed, 614 insertions(+), 400 deletions(-)

-- 
2.23.0



[PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

2021-01-27 Thread Nicholas Piggin
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Reviewed-by: Christoph Hellwig 
Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e6f352bf0498..62372f9e0167 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -34,7 +34,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 #include 
@@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0



[PATCH v11 08/13] x86: inline huge vmap supported functions

2021-01-27 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---
 arch/x86/include/asm/vmalloc.h | 22 +++---
 arch/x86/mm/ioremap.c  | 21 -
 arch/x86/mm/pgtable.c  | 13 -
 3 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 094ea2b565f3..e714b00fc0ca 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,13 +1,29 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+#ifdef CONFIG_X86_64
+   return boot_cpu_has(X86_FEATURE_GBPAGES);
+#else
+   return false;
+#endif
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return boot_cpu_has(X86_FEATURE_PSE);
+}
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index fbaf0c447986..12c686c65ea9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,27 +481,6 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-#ifdef CONFIG_X86_64
-   return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return boot_cpu_has(X86_FEATURE_PSE);
-}
-#endif
-
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
  * access
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f6a9e2e36642..d27cf69e811d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,14 +780,6 @@ int pmd_clear_huge(pmd_t *pmd)
return 0;
 }
 
-/*
- * Until we support 512GB pages, skip them in the vmap area.
- */
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 #ifdef CONFIG_X86_64
 /**
  * pud_free_pmd_page - Clear pud entry and free pmd page.
@@ -861,11 +853,6 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #else /* !CONFIG_X86_64 */
 
-int pud_free_pmd_page(pud_t *pud, unsigned long addr)
-{
-   return pud_none(*pud);
-}
-
 /*
  * Disable free page handling on x86-PAE. This assures that ioremap()
  * does not update sync'd pmd entries. See vmalloc_sync_one().
-- 
2.23.0



[PATCH 4/5] mm/vmalloc: remove unmap_kernel_range

2021-01-27 Thread Nicholas Piggin
This is a shim around vunmap_range, get rid of it.

Move the main API comment from the _noflush variant to the normal
variant, and make _noflush internal to mm/.

Signed-off-by: Nicholas Piggin 
---
 Documentation/core-api/cachetlb.rst |  2 +-
 arch/arm64/mm/init.c|  2 +-
 arch/powerpc/kernel/isa-bridge.c|  4 +-
 arch/powerpc/kernel/pci_64.c|  2 +-
 arch/powerpc/mm/ioremap.c   |  2 +-
 drivers/pci/pci.c   |  2 +-
 include/linux/vmalloc.h |  8 +---
 mm/internal.h   |  1 +
 mm/percpu-vm.c  |  2 +-
 mm/vmalloc.c| 59 ++---
 10 files changed, 38 insertions(+), 46 deletions(-)

diff --git a/Documentation/core-api/cachetlb.rst 
b/Documentation/core-api/cachetlb.rst
index 756f7bcf8191..fe4290e26729 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -215,7 +215,7 @@ Here are the routines, one by one:
 
The first of these two routines is invoked after vmap_range()
has installed the page table entries.  The second is invoked
-   before unmap_kernel_range() deletes the page table entries.
+   before vunmap_range() deletes the page table entries.
 
 There exists another whole class of cpu cache issues which currently
 require a whole different set of interfaces to handle properly.
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 709d98fea90c..7fe0a5074205 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -498,7 +498,7 @@ void free_initmem(void)
 * prevents the region from being reused for kernel modules, which
 * is not supported by kallsyms.
 */
-   unmap_kernel_range((u64)__init_begin, (u64)(__init_end - __init_begin));
+   vunmap_range((u64)__init_begin, (u64)__init_end);
 }
 
 void dump_mem_limit(void)
diff --git a/arch/powerpc/kernel/isa-bridge.c b/arch/powerpc/kernel/isa-bridge.c
index 2257d24e6a26..39c625737c09 100644
--- a/arch/powerpc/kernel/isa-bridge.c
+++ b/arch/powerpc/kernel/isa-bridge.c
@@ -48,7 +48,7 @@ static void remap_isa_base(phys_addr_t pa, unsigned long size)
if (slab_is_available()) {
if (ioremap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa,
pgprot_noncached(PAGE_KERNEL)))
-   unmap_kernel_range(ISA_IO_BASE, size);
+   vunmap_range(ISA_IO_BASE, ISA_IO_BASE + size);
} else {
early_ioremap_range(ISA_IO_BASE, pa, size,
pgprot_noncached(PAGE_KERNEL));
@@ -311,7 +311,7 @@ static void isa_bridge_remove(void)
isa_bridge_pcidev = NULL;
 
/* Unmap the ISA area */
-   unmap_kernel_range(ISA_IO_BASE, 0x1);
+   vunmap_range(ISA_IO_BASE, ISA_IO_BASE + 0x1);
 }
 
 /**
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 9312e6eda7ff..3fb7e572abed 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -140,7 +140,7 @@ void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long 
size)
addr = (unsigned long)area->addr;
if (ioremap_page_range(addr, addr + size, paddr,
pgprot_noncached(PAGE_KERNEL))) {
-   unmap_kernel_range(addr, size);
+   vunmap_range(addr, addr + size);
return NULL;
}
 
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index b1a0aebe8c48..57342154d2b0 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -93,7 +93,7 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, 
unsigned long size,
if (!ret)
return (void __iomem *)area->addr + offset;
 
-   unmap_kernel_range(va, size);
+   vunmap_range(va, va + size);
free_vm_area(area);
 
return NULL;
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b9fecc25d213..d1e5ee09b381 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4107,7 +4107,7 @@ void pci_unmap_iospace(struct resource *res)
 #if defined(PCI_IOBASE) && defined(CONFIG_MMU)
unsigned long vaddr = (unsigned long)PCI_IOBASE + res->start;
 
-   unmap_kernel_range(vaddr, resource_size(res));
+   vunmap_range(vaddr, vaddr + resource_size(res));
 #endif
 }
 EXPORT_SYMBOL(pci_unmap_iospace);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 913c9d4f5e03..b569a13c9960 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -217,8 +217,7 @@ static inline bool is_vm_area_hugepages(const void *addr)
 int vmap_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift);
-extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
-extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+void vunmap_ra

[PATCH 5/5] mm/vmalloc: improve allocation failure error messages

2021-01-26 Thread Nicholas Piggin
There are several reasons why a vmalloc can fail, virtual space
exhausted, page array allocation failure, page allocation failure,
and kernel page table allocation failure.

Add distinct warning messages for the main causes of failure, with
some added information like page order or allocation size where
applicable.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 40 
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5ff190590fe4..4facf582a3be 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2790,6 +2790,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
 
if (!pages) {
free_vm_area(area);
+   warn_alloc(gfp_mask, NULL,
+  "vmalloc size %lu allocation failure: "
+  "page array size %lu allocation failed",
+  area->nr_pages * PAGE_SIZE, array_size);
return NULL;
}
 
@@ -2813,6 +2817,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
/* Successfully allocated i pages, free them in 
__vfree() */
area->nr_pages = i;
atomic_long_add(area->nr_pages, _vmalloc_pages);
+   warn_alloc(gfp_mask, NULL,
+  "vmalloc size %lu allocation failure: "
+  "page order %u allocation failed",
+  area->nr_pages * PAGE_SIZE, page_order);
goto fail;
}
 
@@ -2824,15 +2832,17 @@ static void *__vmalloc_area_node(struct vm_struct 
*area, gfp_t gfp_mask,
}
atomic_long_add(area->nr_pages, _vmalloc_pages);
 
-   if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0)
+   if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0) {
+   warn_alloc(gfp_mask, NULL,
+  "vmalloc size %lu allocation failure: "
+  "failed to map pages",
+  area->nr_pages * PAGE_SIZE);
goto fail;
+   }
 
return area->addr;
 
 fail:
-   warn_alloc(gfp_mask, NULL,
- "vmalloc: allocation failure, allocated %ld of %ld 
bytes",
- (area->nr_pages*PAGE_SIZE), size);
__vfree(area->addr);
return NULL;
 }
@@ -2866,8 +2876,15 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
unsigned long real_align = align;
unsigned int shift = PAGE_SHIFT;
 
-   if (!size || (size >> PAGE_SHIFT) > totalram_pages())
-   goto fail;
+   if (WARN_ON_ONCE(!size))
+   return NULL;
+
+   if ((size >> PAGE_SHIFT) > totalram_pages()) {
+   warn_alloc(gfp_mask, NULL,
+  "vmalloc size %lu allocation failure: "
+  "exceeds total pages", real_size);
+   return NULL;
+   }
 
if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) &&
arch_vmap_pmd_supported(prot)) {
@@ -2894,8 +2911,12 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
size = PAGE_ALIGN(size);
area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
vm_flags, start, end, node, gfp_mask, caller);
-   if (!area)
+   if (!area) {
+   warn_alloc(gfp_mask, NULL,
+  "vmalloc size %lu allocation failure: "
+  "vm_struct allocation failed", real_size);
goto fail;
+   }
 
addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
if (!addr)
@@ -2920,11 +2941,6 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
goto again;
}
 
-   if (!area) {
-   /* Warn for area allocation, page allocations already warn */
-   warn_alloc(gfp_mask, NULL,
- "vmalloc: allocation failure: %lu bytes", real_size);
-   }
return NULL;
 }
 
-- 
2.23.0



[PATCH 3/5] powerpc/xive: remove unnecessary unmap_kernel_range

2021-01-26 Thread Nicholas Piggin
iounmap will remove ptes.

Cc: "Cédric Le Goater" 
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/sysdev/xive/common.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 595310e056f4..d6c2069cc828 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -959,16 +959,12 @@ EXPORT_SYMBOL_GPL(is_xive_irq);
 void xive_cleanup_irq_data(struct xive_irq_data *xd)
 {
if (xd->eoi_mmio) {
-   unmap_kernel_range((unsigned long)xd->eoi_mmio,
-  1u << xd->esb_shift);
iounmap(xd->eoi_mmio);
if (xd->eoi_mmio == xd->trig_mmio)
xd->trig_mmio = NULL;
xd->eoi_mmio = NULL;
}
if (xd->trig_mmio) {
-   unmap_kernel_range((unsigned long)xd->trig_mmio,
-  1u << xd->esb_shift);
iounmap(xd->trig_mmio);
xd->trig_mmio = NULL;
}
-- 
2.23.0



  1   2   3   4   5   6   7   8   9   10   >