Re: [PATCH 7/7] powerpc/kprobes: Allow probing on any ftrace address

2019-06-21 Thread Joe Perches
On Fri, 2019-06-21 at 23:50 +0900, Masami Hiramatsu wrote:
> On Tue, 18 Jun 2019 20:17:06 +0530
> "Naveen N. Rao"  wrote:

trivia:

> > diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
> > b/arch/powerpc/kernel/kprobes-ftrace.c
[]
> > @@ -57,6 +82,11 @@ NOKPROBE_SYMBOL(kprobe_ftrace_handler);
> >  
> >  int arch_prepare_kprobe_ftrace(struct kprobe *p)
> >  {
> > +   if ((unsigned long)p->addr & 0x03) {
> > +   printk("Attempt to register kprobe at an unaligned address\n");

Please use the appropriate KERN_ or pr_




Re: [PATCH 16/16] mm: pass get_user_pages_fast iterator arguments in a structure

2019-06-21 Thread Nicholas Piggin
Christoph Hellwig's on June 21, 2019 6:15 pm:
> On Thu, Jun 20, 2019 at 10:21:46AM -0700, Linus Torvalds wrote:
>> Hmm. Honestly, I've never seen anything like that in any kernel profiles.
>> 
>> Compared to the problems I _do_ see (which is usually the obvious
>> cache misses, and locking), it must either be in the noise or it's
>> some problem specific to whatever CPU you are doing performance work
>> on?
>> 
>> I've occasionally seen pipeline hiccups in profiles, but it's usually
>> been either some serious glass jaw of the core, or it's been something
>> really stupid we did (or occasionally that the compiler did: one in
>> particular I remember was how there was a time when gcc would narrow
>> stores when it could, so if you set a bit in a word, it would do it
>> with a byte store, and then when you read the whole word afterwards
>> you'd get a major pipeline stall and it happened to show up in some
>> really hot paths).
> 
> I've not seen any difference in the GUP bench output here ar all.
> 
> But I'm fine with skipping this patch for now, I have a potential
> series I'm looking into that would benefit a lot from it, but we
> can discusss it in that context and make sure all the other works gets in
> in time.
> 

If you can, that would be good. I don't like to object based on
handwaving so I'll see if I can find any benchmarks that will give
better confidence. Those old TPC-C tests were good, and there was
some DB2 workload that was the reason I added gup fast in the first
place. I'll do some digging.

Thanks,
Nick


Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread Andrew Morton
On Fri, 21 Jun 2019 20:24:59 +0200 David Hildenbrand  wrote:

> @Qian Cai, unfortunately I can't reproduce.
> 
> If you get the chance, it would be great if you could retry with
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 972c5336bebf..742f99ddd148 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -868,6 +868,9 @@ int walk_memory_blocks(unsigned long start, unsigned
> long size,
> unsigned long block_id;
> int ret = 0;
> 
> +   if (!size)
> +   return;
> +
> for (block_id = start_block_id; block_id <= end_block_id;
> block_id++) {
> mem = find_memory_block_by_id(block_id);
> if (!mem)
> 
> 
> 
> If both, start and size are 0, we would get a vry long loop. This
> would mean that we have an online node that does not span any pages at
> all (pgdat->node_start_pfn = 0, start_pfn + pgdat->node_spanned_pages = 0).

I think I'll make that a `return 0' and I won't drop patches 4-6 for
now, as we appear to have this fixed.



From: David Hildenbrand 
Subject: drivers-base-memoryc-get-rid-of-find_memory_block_hinted-v3-fix

handle zero-length walks

Link: http://lkml.kernel.org/r/1c2edc22-afd7-2211-c4c7-40e54e500...@redhat.com
Reported-by: Qian Cai 
Tested-by: Qian Cai 
Signed-off-by: Andrew Morton 
---

 drivers/base/memory.c |3 +++
 1 file changed, 3 insertions(+)

--- 
a/drivers/base/memory.c~drivers-base-memoryc-get-rid-of-find_memory_block_hinted-v3-fix
+++ a/drivers/base/memory.c
@@ -866,6 +866,9 @@ int walk_memory_blocks(unsigned long sta
unsigned long block_id;
int ret = 0;
 
+   if (!size)
+   return 0;
+
for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
mem = find_memory_block_by_id(block_id);
if (!mem)




Re: [PATCH 11/13] powerpc/64s: Save r13 in machine_check_common_early

2019-06-21 Thread Nicholas Piggin
Mahesh J Salgaonkar's on June 21, 2019 9:47 pm:
> On 2019-06-21 06:27:15 Fri, Santosh Sivaraj wrote:
>> From: Reza Arbab 
>> 
>> Testing my memcpy_mcsafe() work in progress with an injected UE, I get
>> an error like this immediately after the function returns:
>> 
>> BUG: Unable to handle kernel data access at 0x7fff84dec8f8
>> Faulting instruction address: 0xc008009c00b0
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
>> Modules linked in: mce(O+) vmx_crypto crc32c_vpmsum
>> CPU: 0 PID: 1375 Comm: modprobe Tainted: G   O  5.1.0-rc6 #267
>> NIP:  c008009c00b0 LR: c008009c00a8 CTR: c0095f90
>> REGS: c000ee197790 TRAP: 0300   Tainted: G   O   (5.1.0-rc6)
>> MSR:  9280b033   CR: 88002826  
>> XER: 0004
>> CFAR: c0095f8c DAR: 7fff84dec8f8 DSISR: 4000 IRQMASK: 0
>> GPR00: 6c6c6568 c000ee197a20 c008009c8400 fff2
>> GPR04: c008009c02e0 0006  c3c834c8
>> GPR08: 0080 776a6681b7fb5100  c008009c01c8
>> GPR12: c0095f90 7fff84debc00 4d071440 
>> GPR16: 00010601 c008009e c0c98dd8 c0c98d98
>> GPR20: c3bba970 c008009c04d0 c008009c0618 c01e5820
>> GPR24:  0100 0001 c3bba958
>> GPR28: c008009c02e8 c008009c0318 c008009c02e0 
>> NIP [c008009c00b0] cause_ue+0xa8/0xe8 [mce]
>> LR [c008009c00a8] cause_ue+0xa0/0xe8 [mce]
>> 
>> To fix, ensure that r13 is properly restored after an MCE.
>> 
>> Signed-off-by: Reza Arbab 
>> ---
>>  arch/powerpc/kernel/exceptions-64s.S | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index 311f1392a2ec..932d8d05892c 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -265,6 +265,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
>>  EXC_REAL_END(machine_check, 0x200, 0x100)
>>  EXC_VIRT_NONE(0x4200, 0x100)
>>  TRAMP_REAL_BEGIN(machine_check_common_early)
>> +SET_SCRATCH0(r13)   /* save r13 */
>>  EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>>  /*
>>   * Register contents:
> 
> We do save r13 before we call machine_check_common_early(). I don't
> think I understood clearly how this change fixes the issue you are
> seeing. What am I missing here ?
> 
> Above change will push the paca pointer to scratch0 overwriting the
> original saved r13.
> 
> EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
>   /* This is moved out of line as it can be patched by FW, but
>* some code path might still want to branch into the original
>* vector
>*/
>   SET_SCRATCH0(r13)   /* save r13 */
>   EXCEPTION_PROLOG_0(PACA_EXMC)
> BEGIN_FTR_SECTION
>   b   machine_check_common_early

Yep, from the stack trace, r13 is corrupted. So r13 must have got
corrupted before the machine check and this just happens to have
corrected it.

How does cause_ue work? It or memcpy_mcsafe must be corrupting
r13.

Thanks,
Nick



[PATCH] powerpc/64s/exception: Fix machine check early corrupting AMR

2019-06-21 Thread Nicholas Piggin
The early machine check runs in real mode, so locking is unnecessary.
Worse, the windup does not restore AMR, so this can result in a false
KUAP fault after a recoverable machine check hits inside a user copy
operation.

Fix this similarly to HMI by just avoiding the kuap lock in the
early machine check handler (it will be set by the late handler that
runs in virtual mode if that runs). If the virtual mode handler is
reached, it will lock and restore the AMR.

Fixes: 890274c2dc4c0 ("powerpc/64s: Implement KUAP for Radix MMU")
Cc: Russell Currey 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6b86055e5251..73ba246ca11d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -315,7 +315,7 @@ TRAMP_REAL_BEGIN(machine_check_common_early)
mfspr   r11,SPRN_DSISR  /* Save DSISR */
std r11,_DSISR(r1)
std r9,_CCR(r1) /* Save CR in stackframe */
-   kuap_save_amr_and_lock r9, r10, cr1
+   /* We don't touch AMR here, we never go to virtual mode */
/* Save r9 through r13 from EXMC save area to stack frame. */
EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
mfmsr   r11 /* get MSR value */
-- 
2.20.1



Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread Qian Cai
On Fri, 2019-06-21 at 20:24 +0200, David Hildenbrand wrote:
> On 21.06.19 17:15, Qian Cai wrote:
> > On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
> > > @Andrew: Only patch 1, 4 and 6 changed compared to v1.
> > > 
> > > Some further cleanups around memory block devices. Especially, clean up
> > > and simplify walk_memory_range(). Including some other minor cleanups.
> > > 
> > > Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
> > > 
> > > v2 -> v3:
> > > - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
> > > -- Avoid warning on ppc.
> > > - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
> > > -- Fixup a comment regarding hinted devices.
> > > 
> > > v1 -> v2:
> > > - "mm: Section numbers use the type "unsigned long""
> > > -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
> > > - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
> > > -- Fix compilation error
> > > -- Get rid of the "hint" parameter completely
> > > 
> > > David Hildenbrand (6):
> > >   mm: Section numbers use the type "unsigned long"
> > >   drivers/base/memory: Use "unsigned long" for block ids
> > >   mm: Make register_mem_sect_under_node() static
> > >   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
> > > instead of pfns
> > >   mm/memory_hotplug: Move and simplify walk_memory_blocks()
> > >   drivers/base/memory.c: Get rid of find_memory_block_hinted()
> > > 
> > >  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
> > >  drivers/acpi/acpi_memhotplug.c|  19 +---
> > >  drivers/base/memory.c | 120 +-
> > >  drivers/base/node.c   |   8 +-
> > >  include/linux/memory.h|   5 +-
> > >  include/linux/memory_hotplug.h|   2 -
> > >  include/linux/mmzone.h|   4 +-
> > >  include/linux/node.h  |   7 --
> > >  mm/memory_hotplug.c   |  57 +-
> > >  mm/sparse.c   |  12 +--
> > >  10 files changed, 106 insertions(+), 151 deletions(-)
> > > 
> > 
> > This series causes a few machines are unable to boot triggering endless soft
> > lockups. Reverted those commits fixed the issue.
> > 
> > 97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and pass
> > start+size instead of pfns"
> > c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
> > startsize-instead-of-pfns-fix"
> > 34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify
> > walk_memory_blocks()"
> > 59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
> > find_memory_block_hinted()"
> > 5cfcd52288b6 Revert "drivers-base-memoryc-get-rid-of-
> > find_memory_block_hinted-
> > v3"
> > 
> > [4.582081][T1] ACPI FADT declares the system doesn't support PCIe
> > ASPM,
> > so disable it
> > [4.590405][T1] ACPI: bus type PCI registered
> > [4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
> > 0x8000-0x8fff] (base 0x8000)
> > [4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved
> > in
> > E820
> > [4.601860][T1] PCI: Using configuration type 1 for base access
> > [   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
> > [swapper/0:1]
> > [   28.671351][   C16] Modules linked in:
> > [   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5-
> > next-20190621+ #1
> > [   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
> > DL385
> > Gen10, BIOS A40 03/09/2018
> > [   28.691334][   C16] RIP: 0010:_raw_spin_unlock_irqrestore+0x2f/0x40
> > [   28.701334][   C16] Code: 55 48 89 e5 41 54 49 89 f4 be 01 00 00 00 53 48
> > 8b
> > 55 08 48 89 fb 48 8d 7f 18 e8 4c 89 7d ff 48 89 df e8 94 f9 7d ff 41 54 9d
> > <65>
> > ff 0d c2 44 8d 48 5b 41 5c 5d c3 0f 1f 44 00 00 0f 1f 44 00 00
> > [   28.711354][   C16] RSP: 0018:888205b27bf8 EFLAGS: 0246 ORIG_RAX:
> > ff13
> > [   28.721372][   C16] RAX:  RBX: 8882053d6138 RCX:
> > b6f2a3b8
> > [   28.7313

Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread David Hildenbrand
On 21.06.19 21:07, Qian Cai wrote:
> On Fri, 2019-06-21 at 20:56 +0200, David Hildenbrand wrote:
>> On 21.06.19 20:24, David Hildenbrand wrote:
>>> On 21.06.19 17:15, Qian Cai wrote:
>>>> On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
>>>>> @Andrew: Only patch 1, 4 and 6 changed compared to v1.
>>>>>
>>>>> Some further cleanups around memory block devices. Especially, clean up
>>>>> and simplify walk_memory_range(). Including some other minor cleanups.
>>>>>
>>>>> Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
>>>>>
>>>>> v2 -> v3:
>>>>> - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
>>>>> -- Avoid warning on ppc.
>>>>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
>>>>> -- Fixup a comment regarding hinted devices.
>>>>>
>>>>> v1 -> v2:
>>>>> - "mm: Section numbers use the type "unsigned long""
>>>>> -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
>>>>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
>>>>> -- Fix compilation error
>>>>> -- Get rid of the "hint" parameter completely
>>>>>
>>>>> David Hildenbrand (6):
>>>>>   mm: Section numbers use the type "unsigned long"
>>>>>   drivers/base/memory: Use "unsigned long" for block ids
>>>>>   mm: Make register_mem_sect_under_node() static
>>>>>   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
>>>>> instead of pfns
>>>>>   mm/memory_hotplug: Move and simplify walk_memory_blocks()
>>>>>   drivers/base/memory.c: Get rid of find_memory_block_hinted()
>>>>>
>>>>>  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
>>>>>  drivers/acpi/acpi_memhotplug.c|  19 +---
>>>>>  drivers/base/memory.c | 120 +-
>>>>>  drivers/base/node.c   |   8 +-
>>>>>  include/linux/memory.h|   5 +-
>>>>>  include/linux/memory_hotplug.h|   2 -
>>>>>  include/linux/mmzone.h|   4 +-
>>>>>  include/linux/node.h  |   7 --
>>>>>  mm/memory_hotplug.c   |  57 +-
>>>>>  mm/sparse.c   |  12 +--
>>>>>  10 files changed, 106 insertions(+), 151 deletions(-)
>>>>>
>>>>
>>>> This series causes a few machines are unable to boot triggering endless
>>>> soft
>>>> lockups. Reverted those commits fixed the issue.
>>>>
>>>> 97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and
>>>> pass
>>>> start+size instead of pfns"
>>>> c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
>>>> startsize-instead-of-pfns-fix"
>>>> 34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify
>>>> walk_memory_blocks()"
>>>> 59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
>>>> find_memory_block_hinted()"
>>>> 5cfcd52288b6 Revert "drivers-base-memoryc-get-rid-of-
>>>> find_memory_block_hinted-
>>>> v3"
>>>>
>>>> [4.582081][T1] ACPI FADT declares the system doesn't support PCIe
>>>> ASPM,
>>>> so disable it
>>>> [4.590405][T1] ACPI: bus type PCI registered
>>>> [4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
>>>> 0x8000-0x8fff] (base 0x8000)
>>>> [4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff]
>>>> reserved in
>>>> E820
>>>> [4.601860][T1] PCI: Using configuration type 1 for base access
>>>> [   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
>>>> [swapper/0:1]
>>>> [   28.671351][   C16] Modules linked in:
>>>> [   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-
>>>> rc5-
>>>> next-20190621+ #1
>>>> [   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
>>>> DL385
&g

Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread Qian Cai
On Fri, 2019-06-21 at 20:56 +0200, David Hildenbrand wrote:
> On 21.06.19 20:24, David Hildenbrand wrote:
> > On 21.06.19 17:15, Qian Cai wrote:
> > > On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
> > > > @Andrew: Only patch 1, 4 and 6 changed compared to v1.
> > > > 
> > > > Some further cleanups around memory block devices. Especially, clean up
> > > > and simplify walk_memory_range(). Including some other minor cleanups.
> > > > 
> > > > Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
> > > > 
> > > > v2 -> v3:
> > > > - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
> > > > -- Avoid warning on ppc.
> > > > - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
> > > > -- Fixup a comment regarding hinted devices.
> > > > 
> > > > v1 -> v2:
> > > > - "mm: Section numbers use the type "unsigned long""
> > > > -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
> > > > - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
> > > > -- Fix compilation error
> > > > -- Get rid of the "hint" parameter completely
> > > > 
> > > > David Hildenbrand (6):
> > > >   mm: Section numbers use the type "unsigned long"
> > > >   drivers/base/memory: Use "unsigned long" for block ids
> > > >   mm: Make register_mem_sect_under_node() static
> > > >   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
> > > > instead of pfns
> > > >   mm/memory_hotplug: Move and simplify walk_memory_blocks()
> > > >   drivers/base/memory.c: Get rid of find_memory_block_hinted()
> > > > 
> > > >  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
> > > >  drivers/acpi/acpi_memhotplug.c|  19 +---
> > > >  drivers/base/memory.c | 120 +-
> > > >  drivers/base/node.c   |   8 +-
> > > >  include/linux/memory.h|   5 +-
> > > >  include/linux/memory_hotplug.h|   2 -
> > > >  include/linux/mmzone.h|   4 +-
> > > >  include/linux/node.h  |   7 --
> > > >  mm/memory_hotplug.c   |  57 +-
> > > >  mm/sparse.c   |  12 +--
> > > >  10 files changed, 106 insertions(+), 151 deletions(-)
> > > > 
> > > 
> > > This series causes a few machines are unable to boot triggering endless
> > > soft
> > > lockups. Reverted those commits fixed the issue.
> > > 
> > > 97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and
> > > pass
> > > start+size instead of pfns"
> > > c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
> > > startsize-instead-of-pfns-fix"
> > > 34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify
> > > walk_memory_blocks()"
> > > 59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
> > > find_memory_block_hinted()"
> > > 5cfcd52288b6 Revert "drivers-base-memoryc-get-rid-of-
> > > find_memory_block_hinted-
> > > v3"
> > > 
> > > [4.582081][T1] ACPI FADT declares the system doesn't support PCIe
> > > ASPM,
> > > so disable it
> > > [4.590405][T1] ACPI: bus type PCI registered
> > > [4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
> > > 0x8000-0x8fff] (base 0x8000)
> > > [4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff]
> > > reserved in
> > > E820
> > > [4.601860][T1] PCI: Using configuration type 1 for base access
> > > [   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
> > > [swapper/0:1]
> > > [   28.671351][   C16] Modules linked in:
> > > [   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-
> > > rc5-
> > > next-20190621+ #1
> > > [   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
> > > DL385
> > > Gen10, BIOS A40 03/09/2018
> > > [   28.691334][   C16] RIP: 0010:_raw_spin_unlock_irqrestore+0x2f/0x40
> > > [   28.701334][   C16] Code

Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread David Hildenbrand
On 21.06.19 20:24, David Hildenbrand wrote:
> On 21.06.19 17:15, Qian Cai wrote:
>> On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
>>> @Andrew: Only patch 1, 4 and 6 changed compared to v1.
>>>
>>> Some further cleanups around memory block devices. Especially, clean up
>>> and simplify walk_memory_range(). Including some other minor cleanups.
>>>
>>> Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
>>>
>>> v2 -> v3:
>>> - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
>>> -- Avoid warning on ppc.
>>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
>>> -- Fixup a comment regarding hinted devices.
>>>
>>> v1 -> v2:
>>> - "mm: Section numbers use the type "unsigned long""
>>> -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
>>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
>>> -- Fix compilation error
>>> -- Get rid of the "hint" parameter completely
>>>
>>> David Hildenbrand (6):
>>>   mm: Section numbers use the type "unsigned long"
>>>   drivers/base/memory: Use "unsigned long" for block ids
>>>   mm: Make register_mem_sect_under_node() static
>>>   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
>>> instead of pfns
>>>   mm/memory_hotplug: Move and simplify walk_memory_blocks()
>>>   drivers/base/memory.c: Get rid of find_memory_block_hinted()
>>>
>>>  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
>>>  drivers/acpi/acpi_memhotplug.c|  19 +---
>>>  drivers/base/memory.c | 120 +-
>>>  drivers/base/node.c   |   8 +-
>>>  include/linux/memory.h|   5 +-
>>>  include/linux/memory_hotplug.h|   2 -
>>>  include/linux/mmzone.h|   4 +-
>>>  include/linux/node.h  |   7 --
>>>  mm/memory_hotplug.c   |  57 +-
>>>  mm/sparse.c   |  12 +--
>>>  10 files changed, 106 insertions(+), 151 deletions(-)
>>>
>>
>> This series causes a few machines are unable to boot triggering endless soft
>> lockups. Reverted those commits fixed the issue.
>>
>> 97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and pass
>> start+size instead of pfns"
>> c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
>> startsize-instead-of-pfns-fix"
>> 34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify 
>> walk_memory_blocks()"
>> 59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
>> find_memory_block_hinted()"
>> 5cfcd52288b6 Revert 
>> "drivers-base-memoryc-get-rid-of-find_memory_block_hinted-
>> v3"
>>
>> [4.582081][T1] ACPI FADT declares the system doesn't support PCIe 
>> ASPM,
>> so disable it
>> [4.590405][T1] ACPI: bus type PCI registered
>> [4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
>> 0x8000-0x8fff] (base 0x8000)
>> [4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved 
>> in
>> E820
>> [4.601860][T1] PCI: Using configuration type 1 for base access
>> [   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
>> [swapper/0:1]
>> [   28.671351][   C16] Modules linked in:
>> [   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5-
>> next-20190621+ #1
>> [   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
>> Gen10, BIOS A40 03/09/2018
>> [   28.691334][   C16] RIP: 0010:_raw_spin_unlock_irqrestore+0x2f/0x40
>> [   28.701334][   C16] Code: 55 48 89 e5 41 54 49 89 f4 be 01 00 00 00 53 48 
>> 8b
>> 55 08 48 89 fb 48 8d 7f 18 e8 4c 89 7d ff 48 89 df e8 94 f9 7d ff 41 54 9d 
>> <65>
>> ff 0d c2 44 8d 48 5b 41 5c 5d c3 0f 1f 44 00 00 0f 1f 44 00 00
>> [   28.711354][   C16] RSP: 0018:888205b27bf8 EFLAGS: 0246 ORIG_RAX:
>> ff13
>> [   28.721372][   C16] RAX:  RBX: 8882053d6138 RCX:
>> b6f2a3b8
>> [   28.731371][   C16] RDX: 111040a7ac27 RSI: dc00 RDI:
>> 8882053d6138
>> [   28.741371][   C16] RBP: 888205b27c08 R08: ed1040a7ac2

Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread David Hildenbrand
On 21.06.19 17:15, Qian Cai wrote:
> On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
>> @Andrew: Only patch 1, 4 and 6 changed compared to v1.
>>
>> Some further cleanups around memory block devices. Especially, clean up
>> and simplify walk_memory_range(). Including some other minor cleanups.
>>
>> Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
>>
>> v2 -> v3:
>> - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
>> -- Avoid warning on ppc.
>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
>> -- Fixup a comment regarding hinted devices.
>>
>> v1 -> v2:
>> - "mm: Section numbers use the type "unsigned long""
>> -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
>> -- Fix compilation error
>> -- Get rid of the "hint" parameter completely
>>
>> David Hildenbrand (6):
>>   mm: Section numbers use the type "unsigned long"
>>   drivers/base/memory: Use "unsigned long" for block ids
>>   mm: Make register_mem_sect_under_node() static
>>   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
>> instead of pfns
>>   mm/memory_hotplug: Move and simplify walk_memory_blocks()
>>   drivers/base/memory.c: Get rid of find_memory_block_hinted()
>>
>>  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
>>  drivers/acpi/acpi_memhotplug.c|  19 +---
>>  drivers/base/memory.c | 120 +-
>>  drivers/base/node.c   |   8 +-
>>  include/linux/memory.h|   5 +-
>>  include/linux/memory_hotplug.h|   2 -
>>  include/linux/mmzone.h|   4 +-
>>  include/linux/node.h  |   7 --
>>  mm/memory_hotplug.c   |  57 +-
>>  mm/sparse.c   |  12 +--
>>  10 files changed, 106 insertions(+), 151 deletions(-)
>>
> 
> This series causes a few machines are unable to boot triggering endless soft
> lockups. Reverted those commits fixed the issue.
> 
> 97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and pass
> start+size instead of pfns"
> c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
> startsize-instead-of-pfns-fix"
> 34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify 
> walk_memory_blocks()"
> 59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
> find_memory_block_hinted()"
> 5cfcd52288b6 Revert "drivers-base-memoryc-get-rid-of-find_memory_block_hinted-
> v3"
> 
> [4.582081][T1] ACPI FADT declares the system doesn't support PCIe 
> ASPM,
> so disable it
> [4.590405][T1] ACPI: bus type PCI registered
> [4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
> 0x8000-0x8fff] (base 0x8000)
> [4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved 
> in
> E820
> [4.601860][T1] PCI: Using configuration type 1 for base access
> [   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
> [swapper/0:1]
> [   28.671351][   C16] Modules linked in:
> [   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5-
> next-20190621+ #1
> [   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
> Gen10, BIOS A40 03/09/2018
> [   28.691334][   C16] RIP: 0010:_raw_spin_unlock_irqrestore+0x2f/0x40
> [   28.701334][   C16] Code: 55 48 89 e5 41 54 49 89 f4 be 01 00 00 00 53 48 
> 8b
> 55 08 48 89 fb 48 8d 7f 18 e8 4c 89 7d ff 48 89 df e8 94 f9 7d ff 41 54 9d 
> <65>
> ff 0d c2 44 8d 48 5b 41 5c 5d c3 0f 1f 44 00 00 0f 1f 44 00 00
> [   28.711354][   C16] RSP: 0018:888205b27bf8 EFLAGS: 0246 ORIG_RAX:
> ff13
> [   28.721372][   C16] RAX:  RBX: 8882053d6138 RCX:
> b6f2a3b8
> [   28.731371][   C16] RDX: 111040a7ac27 RSI: dc00 RDI:
> 8882053d6138
> [   28.741371][   C16] RBP: 888205b27c08 R08: ed1040a7ac28 R09:
> ed1040a7ac27
> [   28.751334][   C16] R10: ed1040a7ac27 R11: 8882053d613b R12:
> 0246
> [   28.751370][   C16] R13: 888205b27c98 R14: 8884504d0a20 R15:
> 
> [   28.761368][   C16] FS:  () GS:88845450()
> knlGS:
> [   28.771373][   C16] CS:  0

Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Jason Gunthorpe
On Fri, Jun 21, 2019 at 09:35:11AM -0600, Khalid Aziz wrote:
> On 6/21/19 7:39 AM, Jason Gunthorpe wrote:
> > On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
> >> This will allow sparc64 to override its ADI tags for
> >> get_user_pages and get_user_pages_fast.
> >>
> >> Signed-off-by: Christoph Hellwig 
> >>  mm/gup.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index ddde097cf9e4..6bb521db67ec 100644
> >> +++ b/mm/gup.c
> >> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
> >> nr_pages, int write,
> >>unsigned long flags;
> >>int nr = 0;
> >>  
> >> -  start &= PAGE_MASK;
> >> +  start = untagged_addr(start) & PAGE_MASK;
> >>len = (unsigned long) nr_pages << PAGE_SHIFT;
> >>end = start + len;
> > 
> > Hmm, this function, and the other, goes on to do:
> > 
> > if (unlikely(!access_ok((void __user *)start, len)))
> > return 0;
> > 
> > and I thought that access_ok takes in the tagged pointer?
> > 
> > How about re-order it a bit?
> 
> access_ok() can handle tagged or untagged pointers. It just strips the
> tag bits from the top bits. Current order doesn't really matter from
> functionality point of view. There might be minor gain in delaying
> untagging in __get_user_pages_fast() but I could go either way.

I understand the current ARM and SPARC implementations don't do much
with the tags, but it feels like a really big assumption for the core
code that all future uses of tags will be fine to have them stripped
out of 'void __user *' pointers. IMHO that is something we should not
be doing in the core kernel..

Jason


Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Khalid Aziz
On 6/21/19 7:39 AM, Jason Gunthorpe wrote:
> On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
>> This will allow sparc64 to override its ADI tags for
>> get_user_pages and get_user_pages_fast.
>>
>> Signed-off-by: Christoph Hellwig 
>>  mm/gup.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index ddde097cf9e4..6bb521db67ec 100644
>> +++ b/mm/gup.c
>> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
>> nr_pages, int write,
>>  unsigned long flags;
>>  int nr = 0;
>>  
>> -start &= PAGE_MASK;
>> +start = untagged_addr(start) & PAGE_MASK;
>>  len = (unsigned long) nr_pages << PAGE_SHIFT;
>>  end = start + len;
> 
> Hmm, this function, and the other, goes on to do:
> 
> if (unlikely(!access_ok((void __user *)start, len)))
> return 0;
> 
> and I thought that access_ok takes in the tagged pointer?
> 
> How about re-order it a bit?

access_ok() can handle tagged or untagged pointers. It just strips the
tag bits from the top bits. Current order doesn't really matter from
functionality point of view. There might be minor gain in delaying
untagging in __get_user_pages_fast() but I could go either way.

--
Khalid

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e410..f48747ced4723b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2148,11 +2148,12 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>  
>   start &= PAGE_MASK;
>   len = (unsigned long) nr_pages << PAGE_SHIFT;
> - end = start + len;
> -
>   if (unlikely(!access_ok((void __user *)start, len)))
>   return 0;
>  
> + start = untagged_ptr(start);
> + end = start + len;
> +
>   /*
>* Disable interrupts.  We use the nested form as we can already have
>* interrupts disabled by get_futex_key.
> 




Re: [PATCH v3 5/6] mm/memory_hotplug: Move and simplify walk_memory_blocks()

2019-06-21 Thread David Hildenbrand
On 20.06.19 20:31, David Hildenbrand wrote:
> Let's move walk_memory_blocks() to the place where memory block logic
> resides and simplify it. While at it, add a type for the callback function.
> 
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: David Hildenbrand 
> Cc: Andrew Morton 
> Cc: Stephen Rothwell 
> Cc: Pavel Tatashin 
> Cc: Andrew Banman 
> Cc: "mike.tra...@hpe.com" 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Arun KS 
> Cc: Qian Cai 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/base/memory.c  | 42 ++
>  include/linux/memory.h |  3 ++
>  include/linux/memory_hotplug.h |  2 --
>  mm/memory_hotplug.c| 55 --
>  4 files changed, 45 insertions(+), 57 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c54e80fd25a8..0204384b4d1d 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -44,6 +44,11 @@ static inline unsigned long pfn_to_block_id(unsigned long 
> pfn)
>   return base_memory_block_id(pfn_to_section_nr(pfn));
>  }
>  
> +static inline unsigned long phys_to_block_id(unsigned long phys)
> +{
> + return pfn_to_block_id(PFN_DOWN(phys));
> +}
> +
>  static int memory_subsys_online(struct device *dev);
>  static int memory_subsys_offline(struct device *dev);
>  
> @@ -851,3 +856,40 @@ int __init memory_dev_init(void)
>   printk(KERN_ERR "%s() failed: %d\n", __func__, ret);
>   return ret;
>  }
> +
> +/**
> + * walk_memory_blocks - walk through all present memory blocks overlapped
> + *   by the range [start, start + size)
> + *
> + * @start: start address of the memory range
> + * @size: size of the memory range
> + * @arg: argument passed to func
> + * @func: callback for each memory section walked
> + *
> + * This function walks through all present memory blocks overlapped by the
> + * range [start, start + size), calling func on each memory block.
> + *
> + * In case func() returns an error, walking is aborted and the error is
> + * returned.
> + */
> +int walk_memory_blocks(unsigned long start, unsigned long size,
> +void *arg, walk_memory_blocks_func_t func)
> +{
> + const unsigned long start_block_id = phys_to_block_id(start);
> + const unsigned long end_block_id = phys_to_block_id(start + size - 1);
> + struct memory_block *mem;
> + unsigned long block_id;
> + int ret = 0;

I *guess* the stall we are seeing is when size = 0.

(via ACPI, if info->length is 0)

if (!size)
return 0;

... but that is just a wild guess. Will have a look after my vacation.

-- 

Thanks,

David / dhildenb


Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread David Hildenbrand
On 21.06.19 17:15, Qian Cai wrote:
> On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
>> @Andrew: Only patch 1, 4 and 6 changed compared to v1.
>>
>> Some further cleanups around memory block devices. Especially, clean up
>> and simplify walk_memory_range(). Including some other minor cleanups.
>>
>> Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
>>
>> v2 -> v3:
>> - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
>> -- Avoid warning on ppc.
>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
>> -- Fixup a comment regarding hinted devices.
>>
>> v1 -> v2:
>> - "mm: Section numbers use the type "unsigned long""
>> -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
>> - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
>> -- Fix compilation error
>> -- Get rid of the "hint" parameter completely
>>
>> David Hildenbrand (6):
>>   mm: Section numbers use the type "unsigned long"
>>   drivers/base/memory: Use "unsigned long" for block ids
>>   mm: Make register_mem_sect_under_node() static
>>   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
>> instead of pfns
>>   mm/memory_hotplug: Move and simplify walk_memory_blocks()
>>   drivers/base/memory.c: Get rid of find_memory_block_hinted()
>>
>>  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
>>  drivers/acpi/acpi_memhotplug.c|  19 +---
>>  drivers/base/memory.c | 120 +-
>>  drivers/base/node.c   |   8 +-
>>  include/linux/memory.h|   5 +-
>>  include/linux/memory_hotplug.h|   2 -
>>  include/linux/mmzone.h|   4 +-
>>  include/linux/node.h  |   7 --
>>  mm/memory_hotplug.c   |  57 +-
>>  mm/sparse.c   |  12 +--
>>  10 files changed, 106 insertions(+), 151 deletions(-)
>>
> 
> This series causes a few machines are unable to boot triggering endless soft
> lockups. Reverted those commits fixed the issue.
> 
> 97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and pass
> start+size instead of pfns"
> c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
> startsize-instead-of-pfns-fix"
> 34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify 
> walk_memory_blocks()"
> 59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
> find_memory_block_hinted()"
> 5cfcd52288b6 Revert "drivers-base-memoryc-get-rid-of-find_memory_block_hinted-
> v3"
> 
> [4.582081][T1] ACPI FADT declares the system doesn't support PCIe 
> ASPM,
> so disable it
> [4.590405][T1] ACPI: bus type PCI registered
> [4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
> 0x8000-0x8fff] (base 0x8000)
> [4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved 
> in
> E820
> [4.601860][T1] PCI: Using configuration type 1 for base access
> [   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
> [swapper/0:1]
> [   28.671351][   C16] Modules linked in:
> [   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5-
> next-20190621+ #1
> [   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
> Gen10, BIOS A40 03/09/2018
> [   28.691334][   C16] RIP: 0010:_raw_spin_unlock_irqrestore+0x2f/0x40
> [   28.701334][   C16] Code: 55 48 89 e5 41 54 49 89 f4 be 01 00 00 00 53 48 
> 8b
> 55 08 48 89 fb 48 8d 7f 18 e8 4c 89 7d ff 48 89 df e8 94 f9 7d ff 41 54 9d 
> <65>
> ff 0d c2 44 8d 48 5b 41 5c 5d c3 0f 1f 44 00 00 0f 1f 44 00 00
> [   28.711354][   C16] RSP: 0018:888205b27bf8 EFLAGS: 0246 ORIG_RAX:
> ff13
> [   28.721372][   C16] RAX:  RBX: 8882053d6138 RCX:
> b6f2a3b8
> [   28.731371][   C16] RDX: 111040a7ac27 RSI: dc00 RDI:
> 8882053d6138
> [   28.741371][   C16] RBP: 888205b27c08 R08: ed1040a7ac28 R09:
> ed1040a7ac27
> [   28.751334][   C16] R10: ed1040a7ac27 R11: 8882053d613b R12:
> 0246
> [   28.751370][   C16] R13: 888205b27c98 R14: 8884504d0a20 R15:
> 
> [   28.761368][   C16] FS:  () GS:88845450()
> knlGS:
> [   28.771373][   C16] CS:  0

Re: [PATCH v3 0/6] mm: Further memory block device cleanups

2019-06-21 Thread Qian Cai
On Thu, 2019-06-20 at 20:31 +0200, David Hildenbrand wrote:
> @Andrew: Only patch 1, 4 and 6 changed compared to v1.
> 
> Some further cleanups around memory block devices. Especially, clean up
> and simplify walk_memory_range(). Including some other minor cleanups.
> 
> Compiled + tested on x86 with DIMMs under QEMU. Compile-tested on ppc64.
> 
> v2 -> v3:
> - "mm/memory_hotplug: Rename walk_memory_range() and pass start+size .."
> -- Avoid warning on ppc.
> - "drivers/base/memory.c: Get rid of find_memory_block_hinted()"
> -- Fixup a comment regarding hinted devices.
> 
> v1 -> v2:
> - "mm: Section numbers use the type "unsigned long""
> -- "unsigned long i" -> "unsigned long nr", in one case -> "int i"
> - "drivers/base/memory.c: Get rid of find_memory_block_hinted("
> -- Fix compilation error
> -- Get rid of the "hint" parameter completely
> 
> David Hildenbrand (6):
>   mm: Section numbers use the type "unsigned long"
>   drivers/base/memory: Use "unsigned long" for block ids
>   mm: Make register_mem_sect_under_node() static
>   mm/memory_hotplug: Rename walk_memory_range() and pass start+size
> instead of pfns
>   mm/memory_hotplug: Move and simplify walk_memory_blocks()
>   drivers/base/memory.c: Get rid of find_memory_block_hinted()
> 
>  arch/powerpc/platforms/powernv/memtrace.c |  23 ++---
>  drivers/acpi/acpi_memhotplug.c|  19 +---
>  drivers/base/memory.c | 120 +-
>  drivers/base/node.c   |   8 +-
>  include/linux/memory.h|   5 +-
>  include/linux/memory_hotplug.h|   2 -
>  include/linux/mmzone.h|   4 +-
>  include/linux/node.h  |   7 --
>  mm/memory_hotplug.c   |  57 +-
>  mm/sparse.c   |  12 +--
>  10 files changed, 106 insertions(+), 151 deletions(-)
> 

This series causes a few machines are unable to boot triggering endless soft
lockups. Reverted those commits fixed the issue.

97f4217d1da0 Revert "mm/memory_hotplug: rename walk_memory_range() and pass
start+size instead of pfns"
c608eebf33c6 Revert "mm-memory_hotplug-rename-walk_memory_range-and-pass-
startsize-instead-of-pfns-fix"
34b5e4ab7558 Revert "mm/memory_hotplug: move and simplify walk_memory_blocks()"
59a9f3eec5d1 Revert "drivers/base/memory.c: Get rid of
find_memory_block_hinted()"
5cfcd52288b6 Revert "drivers-base-memoryc-get-rid-of-find_memory_block_hinted-
v3"

[4.582081][T1] ACPI FADT declares the system doesn't support PCIe ASPM,
so disable it
[4.590405][T1] ACPI: bus type PCI registered
[4.592908][T1] PCI: MMCONFIG for domain  [bus 00-ff] at [mem
0x8000-0x8fff] (base 0x8000)
[4.601860][T1] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in
E820
[4.601860][T1] PCI: Using configuration type 1 for base access
[   28.661336][   C16] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
[swapper/0:1]
[   28.671351][   C16] Modules linked in:
[   28.671354][   C16] CPU: 16 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5-
next-20190621+ #1
[   28.681366][   C16] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 03/09/2018
[   28.691334][   C16] RIP: 0010:_raw_spin_unlock_irqrestore+0x2f/0x40
[   28.701334][   C16] Code: 55 48 89 e5 41 54 49 89 f4 be 01 00 00 00 53 48 8b
55 08 48 89 fb 48 8d 7f 18 e8 4c 89 7d ff 48 89 df e8 94 f9 7d ff 41 54 9d <65>
ff 0d c2 44 8d 48 5b 41 5c 5d c3 0f 1f 44 00 00 0f 1f 44 00 00
[   28.711354][   C16] RSP: 0018:888205b27bf8 EFLAGS: 0246 ORIG_RAX:
ff13
[   28.721372][   C16] RAX:  RBX: 8882053d6138 RCX:
b6f2a3b8
[   28.731371][   C16] RDX: 111040a7ac27 RSI: dc00 RDI:
8882053d6138
[   28.741371][   C16] RBP: 888205b27c08 R08: ed1040a7ac28 R09:
ed1040a7ac27
[   28.751334][   C16] R10: ed1040a7ac27 R11: 8882053d613b R12:
0246
[   28.751370][   C16] R13: 888205b27c98 R14: 8884504d0a20 R15:

[   28.761368][   C16] FS:  () GS:88845450()
knlGS:
[   28.771373][   C16] CS:  0010 DS:  ES:  CR0: 80050033
[   28.781334][   C16] CR2:  CR3: 0007c9012000 CR4:
001406a0
[   28.791333][   C16] Call Trace:
[   28.791374][   C16]  klist_next+0xd8/0x1c0
[   28.791374][   C16]  subsys_find_device_by_id+0x13b/0x1f0
[   28.801334][   C16]  ? bus_find_device_by_name+0x20/0x20
[   28.801370][   C16]  ? kobject_put+0x23/0x250
[   28.811333][   C16]  walk_memory_blocks+0x6c/0xb8
[   28.811353][   C16]  ? write_policy_show+0x40/0x40
[   28.82

Re: [PATCH 7/7] powerpc/kprobes: Allow probing on any ftrace address

2019-06-21 Thread Masami Hiramatsu
On Tue, 18 Jun 2019 20:17:06 +0530
"Naveen N. Rao"  wrote:

> With KPROBES_ON_FTRACE, kprobe is allowed to be inserted on instructions
> that branch to _mcount (referred to as ftrace location). With
> -mprofile-kernel, we now include the preceding 'mflr r0' as being part
> of the ftrace location.
> 
> However, by default, probing on an instruction that is not actually the
> branch to _mcount() is prohibited, as that is considered to not be at an
> instruction boundary. This is not the case on powerpc, so allow the same
> by overriding arch_check_ftrace_location()
> 
> In addition, we update kprobe_ftrace_handler() to detect this scenarios
> and to pass the proper nip to the pre and post probe handlers.
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes-ftrace.c | 30 
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
> b/arch/powerpc/kernel/kprobes-ftrace.c
> index 972cb28174b2..6a0bd3c16cb6 100644
> --- a/arch/powerpc/kernel/kprobes-ftrace.c
> +++ b/arch/powerpc/kernel/kprobes-ftrace.c
> @@ -12,14 +12,34 @@
>  #include 
>  #include 
>  
> +/*
> + * With -mprofile-kernel, we patch two instructions -- the branch to _mcount
> + * as well as the preceding 'mflr r0'. Both these instructions are claimed
> + * by ftrace and we should allow probing on either instruction.
> + */
> +int arch_check_ftrace_location(struct kprobe *p)
> +{
> + if (ftrace_location((unsigned long)p->addr))
> + p->flags |= KPROBE_FLAG_FTRACE;
> + return 0;
> +}
> +
>  /* Ftrace callback handler for kprobes */
>  void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
>  struct ftrace_ops *ops, struct pt_regs *regs)
>  {
>   struct kprobe *p;
> + int mflr_kprobe = 0;
>   struct kprobe_ctlblk *kcb;
>  
>   p = get_kprobe((kprobe_opcode_t *)nip);
> + if (unlikely(!p)) {

Hmm, is this really unlikely? If we put a kprobe on the second instruction 
address,
we will see p == NULL always.

> + p = get_kprobe((kprobe_opcode_t *)(nip - MCOUNT_INSN_SIZE));
> + if (!p)

Here will be unlikely, because we can not find kprobe at both of nip and
nip - MCOUNT_INSN_SIZE.

> + return;
> + mflr_kprobe = 1;
> + }
> +
>   if (unlikely(!p) || kprobe_disabled(p))

"unlikely(!p)" is not needed here.

Thank you,

>   return;
>  
> @@ -33,6 +53,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
> parent_nip,
>*/
>   regs->nip -= MCOUNT_INSN_SIZE;
>  
> + if (mflr_kprobe)
> + regs->nip -= MCOUNT_INSN_SIZE;
> +
>   __this_cpu_write(current_kprobe, p);
>   kcb->kprobe_status = KPROBE_HIT_ACTIVE;
>   if (!p->pre_handler || !p->pre_handler(p, regs)) {
> @@ -45,6 +68,8 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
> parent_nip,
>   kcb->kprobe_status = KPROBE_HIT_SSDONE;
>   p->post_handler(p, regs, 0);
>   }
> + if (mflr_kprobe)
> + regs->nip += MCOUNT_INSN_SIZE;
>   }
>   /*
>* If pre_handler returns !0, it changes regs->nip. We have to
> @@ -57,6 +82,11 @@ NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
>  int arch_prepare_kprobe_ftrace(struct kprobe *p)
>  {
> + if ((unsigned long)p->addr & 0x03) {
> + printk("Attempt to register kprobe at an unaligned address\n");
> + return -EILSEQ;
> + }
> +
>   p->ainsn.insn = NULL;
>   p->ainsn.boostable = -1;
>   return 0;
> -- 
> 2.22.0
> 


-- 
Masami Hiramatsu 


Re: switch the remaining architectures to use generic GUP v3

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:46PM +0200, Christoph Hellwig wrote:
> Hi Linus and maintainers,
> 
> below is a series to switch mips, sh and sparc64 to use the generic
> GUP code so that we only have one codebase to touch for further
> improvements to this code.  I don't have hardware for any of these
> architectures, and generally no clue about their page table
> management, so handle with care.

I like this series, ther are alot of people talking about work for GUP
and this will make any of that so much easier to do.

Jason


Re: [PATCH 6/7] kprobes/ftrace: Use ftrace_location() when [dis]arming probes

2019-06-21 Thread Masami Hiramatsu
On Tue, 18 Jun 2019 20:17:05 +0530
"Naveen N. Rao"  wrote:

> Ftrace location could include more than a single instruction in case of
> some architectures (powerpc64, for now). In this case, kprobe is
> permitted on any of those instructions, and uses ftrace infrastructure
> for functioning.
> 
> However, [dis]arm_kprobe_ftrace() uses the kprobe address when setting
> up ftrace filter IP. This won't work if the address points to any
> instruction apart from the one that has a branch to _mcount(). To
> resolve this, have [dis]arm_kprobe_ftrace() use ftrace_function() to
> identify the filter IP.

This looks good to me.

Acked-by: Masami Hiramatsu 

Thank you!

> 
> Signed-off-by: Naveen N. Rao 
> ---
>  kernel/kprobes.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 445337c107e0..282ee704e2d8 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -978,10 +978,10 @@ static int prepare_kprobe(struct kprobe *p)
>  /* Caller must lock kprobe_mutex */
>  static int arm_kprobe_ftrace(struct kprobe *p)
>  {
> + unsigned long ftrace_ip = ftrace_location((unsigned long)p->addr);
>   int ret = 0;
>  
> - ret = ftrace_set_filter_ip(_ftrace_ops,
> -(unsigned long)p->addr, 0, 0);
> + ret = ftrace_set_filter_ip(_ftrace_ops, ftrace_ip, 0, 0);
>   if (ret) {
>   pr_debug("Failed to arm kprobe-ftrace at %pS (%d)\n",
>p->addr, ret);
> @@ -1005,13 +1005,14 @@ static int arm_kprobe_ftrace(struct kprobe *p)
>* non-empty filter_hash for IPMODIFY ops, we're safe from an accidental
>* empty filter_hash which would undesirably trace all functions.
>*/
> - ftrace_set_filter_ip(_ftrace_ops, (unsigned long)p->addr, 1, 0);
> + ftrace_set_filter_ip(_ftrace_ops, ftrace_ip, 1, 0);
>   return ret;
>  }
>  
>  /* Caller must lock kprobe_mutex */
>  static int disarm_kprobe_ftrace(struct kprobe *p)
>  {
> + unsigned long ftrace_ip = ftrace_location((unsigned long)p->addr);
>   int ret = 0;
>  
>   if (kprobe_ftrace_enabled == 1) {
> @@ -1022,8 +1023,7 @@ static int disarm_kprobe_ftrace(struct kprobe *p)
>  
>   kprobe_ftrace_enabled--;
>  
> - ret = ftrace_set_filter_ip(_ftrace_ops,
> -(unsigned long)p->addr, 1, 0);
> + ret = ftrace_set_filter_ip(_ftrace_ops, ftrace_ip, 1, 0);
>   WARN_ONCE(ret < 0, "Failed to disarm kprobe-ftrace at %pS (%d)\n",
> p->addr, ret);
>   return ret;
> -- 
> 2.22.0
> 


-- 
Masami Hiramatsu 


Re: [PATCH 11/16] mm: consolidate the get_user_pages* implementations

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:57PM +0200, Christoph Hellwig wrote:
> @@ -2168,7 +2221,7 @@ static void gup_pgd_range(unsigned long addr, unsigned 
> long end,
>   */
>  static bool gup_fast_permitted(unsigned long start, unsigned long end)
>  {
> - return true;
> + return IS_ENABLED(CONFIG_HAVE_FAST_GUP) ? true : false;

The ?: is needed with IS_ENABLED?

>  }
>  #endif

Oh, you fixed the util.c this way instead of the headerfile
#ifdef..

I'd suggest to revise this block a tiny bit:

-#ifndef gup_fast_permitted
+#if !IS_ENABLED(CONFIG_HAVE_FAST_GUP) || !defined(gup_fast_permitted)
 /*
  * Check if it's allowed to use __get_user_pages_fast() for the range, or
  * we need to fall back to the slow version:
  */
-bool gup_fast_permitted(unsigned long start, int nr_pages)
+static bool gup_fast_permitted(unsigned long start, int nr_pages)
 {

Just in case some future arch code mismatches the header and kconfig..

Regards,
Jason


Re: [PATCH 10/16] mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:56PM +0200, Christoph Hellwig wrote:
> We only support the generic GUP now, so rename the config option to
> be more clear, and always use the mm/Kconfig definition of the
> symbol and select it from the arch Kconfigs.

Looks OK to me

Reviewed-by: Jason Gunthorpe 

But could you also roll something like this in to the series? There is
no longer any reason for the special __weak stuff that I can see -
just follow the normal pattern for stubbing config controlled
functions through the header file.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b76c..13b1cb573383d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1561,8 +1561,17 @@ long get_user_pages_locked(unsigned long start, unsigned 
long nr_pages,
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
 
+#ifdef CONFIG_HAVE_FAST_GUP
 int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+#else
+static inline int get_user_pages_fast(unsigned long start, int nr_pages,
+ unsigned int gup_flags,
+ struct page **pages)
+{
+   return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
+}
+#endif
 
 /* Container for pinned pfns / pages */
 struct frame_vector {
@@ -1668,8 +1677,17 @@ extern int mprotect_fixup(struct vm_area_struct *vma,
 /*
  * doesn't attempt to fault and will return short.
  */
+#ifdef CONFIG_HAVE_FAST_GUP
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
+#else
+static inline int __get_user_pages_fast(unsigned long start, int nr_pages,
+   int write, struct page **pages)
+{
+   return 0;
+}
+#endif
+
 /*
  * per-process(per-mm_struct) statistics.
  */
diff --git a/mm/util.c b/mm/util.c
index 9834c4ab7d8e86..68575a315dc5ad 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -300,53 +300,6 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
rlimit *rlim_stack)
 }
 #endif
 
-/*
- * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
- * back to the regular GUP.
- * Note a difference with get_user_pages_fast: this always returns the
- * number of pages pinned, 0 if no pages were pinned.
- * If the architecture does not support this function, simply return with no
- * pages pinned.
- */
-int __weak __get_user_pages_fast(unsigned long start,
-int nr_pages, int write, struct page **pages)
-{
-   return 0;
-}
-EXPORT_SYMBOL_GPL(__get_user_pages_fast);
-
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start: starting user address
- * @nr_pages:  number of pages from start to pin
- * @gup_flags: flags modifying pin behaviour
- * @pages: array that receives pointers to the pages pinned.
- * Should be at least nr_pages long.
- *
- * get_user_pages_fast provides equivalent functionality to get_user_pages,
- * operating on current and current->mm, with force=0 and vma=NULL. However
- * unlike get_user_pages, it must be called without mmap_sem held.
- *
- * get_user_pages_fast may take mmap_sem and page table locks, so no
- * assumptions can be made about lack of locking. get_user_pages_fast is to be
- * implemented in a way that is advantageous (vs get_user_pages()) when the
- * user memory area is already faulted in and present in ptes. However if the
- * pages have to be faulted in, it may turn out to be slightly slower so
- * callers need to carefully consider what to use. On many architectures,
- * get_user_pages_fast simply falls back to get_user_pages.
- *
- * Return: number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int __weak get_user_pages_fast(unsigned long start,
-   int nr_pages, unsigned int gup_flags,
-   struct page **pages)
-{
-   return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
-}
-EXPORT_SYMBOL_GPL(get_user_pages_fast);
-
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)


Re: [PATCH 04/16] MIPS: use the generic get_user_pages_fast code

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:50PM +0200, Christoph Hellwig wrote:
> diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
> index 4ccb465ef3f2..7d27194e3b45 100644
> +++ b/arch/mips/include/asm/pgtable.h
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct mm_struct;
>  struct vm_area_struct;
> @@ -626,6 +627,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
> mm_struct *mm,
>  
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +#define gup_fast_permitted(start, end)   (!cpu_has_dc_aliases)
> +

Today this check is only being done on the get_user_pages_fast() -
after this patch it is also done for __get_user_pages_fast().

Which means __get_user_pages_fast is now non-functional on a range of
MIPS CPUs, but that seems OK as far as I can tell, so:

Reviewed-by: Jason Gunthorpe 

However, looks to me like this patch is also a bug fix for this:

commit 5b167c123b3c3582f62cf1896465019bc40fe526
Author: Kamal Dasu 
Date:   Fri Jun 14 17:10:03 2013 +

MIPS: Fix get_user_page_fast() for mips with cache alias

get_user_pages_fast() is missing cache flushes for MIPS platforms with
cache aliases.  Filesystem failures observed with DirectIO operations due
to missing flush_anon_page() that use page coloring logic to work with
cache aliases. This fix falls through to take slow_irqon path that calls
get_user_pages() that has required logic for platforms where
cpu_has_dc_aliases is true.

> - pgdp = pgd_offset(mm, addr);
> - do {
> - pgd_t pgd = *pgdp;
> -
> - next = pgd_addr_end(addr, end);
> - if (pgd_none(pgd))
> - goto slow;
> - if (!gup_pud_range(pgd, addr, next, gup_flags & FOLL_WRITE,
> -pages, ))

This is different too, the core code has a p4d layer, but I see that
whole thing gets NOP'd by the compiler as mips uses pgtable-nop4d.h -
right?

Jason


Re: [PATCH 03/16] mm: lift the x86_32 PAE version of gup_get_pte to common code

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:49PM +0200, Christoph Hellwig wrote:
> The split low/high access is the only non-READ_ONCE version of
> gup_get_pte that did show up in the various arch implemenations.
> Lift it to common code and drop the ifdef based arch override.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/x86/Kconfig  |  1 +
>  arch/x86/include/asm/pgtable-3level.h | 47 
>  arch/x86/kvm/mmu.c|  2 +-
>  mm/Kconfig|  3 ++
>  mm/gup.c  | 51 ---
>  5 files changed, 52 insertions(+), 52 deletions(-)

Yep, the sh and mips conversions look right too.

Reviewed-by: Jason Gunthorpe 
 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f0c76ba47695..fe51f104a9e0 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -762,6 +762,9 @@ config GUP_BENCHMARK
>  
> See tools/testing/selftests/vm/gup_benchmark.c
>
> +config GUP_GET_PTE_LOW_HIGH
> + bool
> +

The config name seems a bit out of place though, should it be prefixed
with GENERIC_ or ARCH_?

Jason


Re: [PATCH 02/16] mm: simplify gup_fast_permitted

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:48PM +0200, Christoph Hellwig wrote:
> Pass in the already calculated end value instead of recomputing it, and
> leave the end > start check in the callers instead of duplicating them
> in the arch code.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/s390/include/asm/pgtable.h   |  8 +---
>  arch/x86/include/asm/pgtable_64.h |  8 +---
>  mm/gup.c  | 17 +++--
>  3 files changed, 9 insertions(+), 24 deletions(-)

Much cleaner

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
> This will allow sparc64 to override its ADI tags for
> get_user_pages and get_user_pages_fast.
> 
> Signed-off-by: Christoph Hellwig 
>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e4..6bb521db67ec 100644
> +++ b/mm/gup.c
> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   unsigned long flags;
>   int nr = 0;
>  
> - start &= PAGE_MASK;
> + start = untagged_addr(start) & PAGE_MASK;
>   len = (unsigned long) nr_pages << PAGE_SHIFT;
>   end = start + len;

Hmm, this function, and the other, goes on to do:

if (unlikely(!access_ok((void __user *)start, len)))
return 0;

and I thought that access_ok takes in the tagged pointer?

How about re-order it a bit?

diff --git a/mm/gup.c b/mm/gup.c
index ddde097cf9e410..f48747ced4723b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2148,11 +2148,12 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
 
start &= PAGE_MASK;
len = (unsigned long) nr_pages << PAGE_SHIFT;
-   end = start + len;
-
if (unlikely(!access_ok((void __user *)start, len)))
return 0;
 
+   start = untagged_ptr(start);
+   end = start + len;
+
/*
 * Disable interrupts.  We use the nested form as we can already have
 * interrupts disabled by get_futex_key.


Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Jason Gunthorpe
On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
> This will allow sparc64 to override its ADI tags for
> get_user_pages and get_user_pages_fast.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 


Re: [PATCH 11/13] powerpc/64s: Save r13 in machine_check_common_early

2019-06-21 Thread Mahesh J Salgaonkar
On 2019-06-21 06:27:15 Fri, Santosh Sivaraj wrote:
> From: Reza Arbab 
> 
> Testing my memcpy_mcsafe() work in progress with an injected UE, I get
> an error like this immediately after the function returns:
> 
> BUG: Unable to handle kernel data access at 0x7fff84dec8f8
> Faulting instruction address: 0xc008009c00b0
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in: mce(O+) vmx_crypto crc32c_vpmsum
> CPU: 0 PID: 1375 Comm: modprobe Tainted: G   O  5.1.0-rc6 #267
> NIP:  c008009c00b0 LR: c008009c00a8 CTR: c0095f90
> REGS: c000ee197790 TRAP: 0300   Tainted: G   O   (5.1.0-rc6)
> MSR:  9280b033   CR: 88002826  
> XER: 0004
> CFAR: c0095f8c DAR: 7fff84dec8f8 DSISR: 4000 IRQMASK: 0
> GPR00: 6c6c6568 c000ee197a20 c008009c8400 fff2
> GPR04: c008009c02e0 0006  c3c834c8
> GPR08: 0080 776a6681b7fb5100  c008009c01c8
> GPR12: c0095f90 7fff84debc00 4d071440 
> GPR16: 00010601 c008009e c0c98dd8 c0c98d98
> GPR20: c3bba970 c008009c04d0 c008009c0618 c01e5820
> GPR24:  0100 0001 c3bba958
> GPR28: c008009c02e8 c008009c0318 c008009c02e0 
> NIP [c008009c00b0] cause_ue+0xa8/0xe8 [mce]
> LR [c008009c00a8] cause_ue+0xa0/0xe8 [mce]
> 
> To fix, ensure that r13 is properly restored after an MCE.
> 
> Signed-off-by: Reza Arbab 
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 311f1392a2ec..932d8d05892c 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -265,6 +265,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
>  EXC_REAL_END(machine_check, 0x200, 0x100)
>  EXC_VIRT_NONE(0x4200, 0x100)
>  TRAMP_REAL_BEGIN(machine_check_common_early)
> + SET_SCRATCH0(r13)   /* save r13 */
>   EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>   /*
>* Register contents:

We do save r13 before we call machine_check_common_early(). I don't
think I understood clearly how this change fixes the issue you are
seeing. What am I missing here ?

Above change will push the paca pointer to scratch0 overwriting the
original saved r13.

EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
/* This is moved out of line as it can be patched by FW, but
 * some code path might still want to branch into the original
 * vector
 */
SET_SCRATCH0(r13)   /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
BEGIN_FTR_SECTION
b   machine_check_common_early

Thanks,
-Mahesh.



Re: [PATCH 2/4] powerpc/powernv: remove the unused tunneling exports

2019-06-21 Thread Christoph Hellwig
On Fri, Jun 21, 2019 at 11:21:38AM +0200, Frederic Barrat wrote:
> The as-notify can be used in both CAPI mode and PCI mode. In capi mode, 
> it's integrated in the capi protocol, so the cxl driver doesn't need to do 
> extra setup, compared to what's already done to activate capi.
> As mentioned in a previous iteration of that patchset, those APIs are to be 
> used by the Mellanox CX5 driver. The in-tree driver is always a step behind 
> their latest, but word is they are working on upstreaming those 
> interactions.

We can review them together with the driver.  Especially as we need to
consider if we even want to support it if there is no generic platform
independent inferface.


Re: [PATCH 2/4] powerpc/powernv: remove the unused tunneling exports

2019-06-21 Thread Frederic Barrat




Le 21/06/2019 à 03:47, Oliver O'Halloran a écrit :

On Thu, May 23, 2019 at 5:51 PM Christoph Hellwig  wrote:


These have been unused ever since they've been added to the kernel.

Signed-off-by: Christoph Hellwig 
---
  arch/powerpc/include/asm/pnv-pci.h|  4 --
  arch/powerpc/platforms/powernv/pci-ioda.c |  4 +-
  arch/powerpc/platforms/powernv/pci.c  | 71 ---
  arch/powerpc/platforms/powernv/pci.h  |  1 -
  4 files changed, 3 insertions(+), 77 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index 9fcb0bc462c6..1ab4b0111abc 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -27,12 +27,8 @@ extern int pnv_pci_get_power_state(uint64_t id, uint8_t 
*state);
  extern int pnv_pci_set_power_state(uint64_t id, uint8_t state,
struct opal_msg *msg);

-extern int pnv_pci_enable_tunnel(struct pci_dev *dev, uint64_t *asnind);
-extern int pnv_pci_disable_tunnel(struct pci_dev *dev);
  extern int pnv_pci_set_tunnel_bar(struct pci_dev *dev, uint64_t addr,
   int enable);
-extern int pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid,
- u32 *pid, u32 *tid);


IIRC as-notify was for CAPI which has an in-tree driver (cxl). Fred or
Andrew (+cc), what's going on with this? Will it ever see the light of
day?



The as-notify can be used in both CAPI mode and PCI mode. In capi mode, 
it's integrated in the capi protocol, so the cxl driver doesn't need to 
do extra setup, compared to what's already done to activate capi.
As mentioned in a previous iteration of that patchset, those APIs are to 
be used by the Mellanox CX5 driver. The in-tree driver is always a step 
behind their latest, but word is they are working on upstreaming those 
interactions.


  Fred


  int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
  int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 126602b4e399..6b0caa2d0425 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -54,6 +54,8 @@
  static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
   "NPU_OCAPI" };

+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+
  void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 const char *fmt, ...)
  {
@@ -2360,7 +2362,7 @@ static long pnv_pci_ioda2_set_window(struct 
iommu_table_group *table_group,
 return 0;
  }

-void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
  {
 uint16_t window_id = (pe->pe_number << 1 ) + 1;
 int64_t rc;
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 8d28f2932c3b..fc69f5611020 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -868,54 +868,6 @@ struct device_node *pnv_pci_get_phb_node(struct pci_dev 
*dev)
  }
  EXPORT_SYMBOL(pnv_pci_get_phb_node);

-int pnv_pci_enable_tunnel(struct pci_dev *dev, u64 *asnind)
-{
-   struct device_node *np;
-   const __be32 *prop;
-   struct pnv_ioda_pe *pe;
-   uint16_t window_id;
-   int rc;
-
-   if (!radix_enabled())
-   return -ENXIO;
-
-   if (!(np = pnv_pci_get_phb_node(dev)))
-   return -ENXIO;
-
-   prop = of_get_property(np, "ibm,phb-indications", NULL);
-   of_node_put(np);
-
-   if (!prop || !prop[1])
-   return -ENXIO;
-
-   *asnind = (u64)be32_to_cpu(prop[1]);
-   pe = pnv_ioda_get_pe(dev);
-   if (!pe)
-   return -ENODEV;
-
-   /* Increase real window size to accept as_notify messages. */
-   window_id = (pe->pe_number << 1 ) + 1;
-   rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id, pe->pe_number,
-window_id, pe->tce_bypass_base,
-(uint64_t)1 << 48);
-   return opal_error_code(rc);
-}
-EXPORT_SYMBOL_GPL(pnv_pci_enable_tunnel);
-
-int pnv_pci_disable_tunnel(struct pci_dev *dev)
-{
-   struct pnv_ioda_pe *pe;
-
-   pe = pnv_ioda_get_pe(dev);
-   if (!pe)
-   return -ENODEV;
-
-   /* Restore default real window size. */
-   pnv_pci_ioda2_set_bypass(pe, true);
-   return 0;
-}
-EXPORT_SYMBOL_GPL(pnv_pci_disable_tunnel);
-
  int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, int enable)
  {
 __be64 val;
@@ -970,29 +922,6 @@ int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, 
int enable)
  }
  

[PATCH] powerpc/lib/xor_vmx: Relax frame size for clang

2019-06-21 Thread Mathieu Malaterre
When building with clang-8 the frame size limit is hit:

  ../arch/powerpc/lib/xor_vmx.c:119:6: error: stack frame size of 1200 bytes in 
function '__xor_altivec_5' [-Werror,-Wframe-larger-than=]

Follow the same approach as commit 9c87156cce5a ("powerpc/xmon: Relax
frame size for clang") until a proper fix is implemented upstream in
clang and relax requirement for clang.

Link: https://github.com/ClangBuiltLinux/linux/issues/563
Cc: Joel Stanley 
Signed-off-by: Mathieu Malaterre 
---
 arch/powerpc/lib/Makefile | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index c55f9c27bf79..b3f7d64caaf0 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -58,5 +58,9 @@ obj-$(CONFIG_FTR_FIXUP_SELFTEST) += feature-fixups-test.o
 
 obj-$(CONFIG_ALTIVEC)  += xor_vmx.o xor_vmx_glue.o
 CFLAGS_xor_vmx.o += -maltivec $(call cc-option,-mabi=altivec)
+ifdef CONFIG_CC_IS_CLANG
+# See https://github.com/ClangBuiltLinux/linux/issues/563
+CFLAGS_xor_vmx.o += -Wframe-larger-than=4096
+endif
 
 obj-$(CONFIG_PPC64) += $(obj64-y)
-- 
2.20.1



Re: [PATCH 16/16] mm: pass get_user_pages_fast iterator arguments in a structure

2019-06-21 Thread Nicholas Piggin
Linus Torvalds's on June 21, 2019 3:21 am:
> On Thu, Jun 20, 2019 at 5:19 AM Nicholas Piggin  wrote:
>>
>> The processor aliasing problem happens because the struct will
>> be initialised with stores using one base register (e.g., stack
>> register), and then same memory is loaded using a different
>> register (e.g., parameter register).
> 
> Hmm. Honestly, I've never seen anything like that in any kernel profiles.
> 
> Compared to the problems I _do_ see (which is usually the obvious
> cache misses, and locking), it must either be in the noise or it's
> some problem specific to whatever CPU you are doing performance work
> on?

No you're right, the performance hit from these flushes is not a
big hit that stands out in cycle counts. I just look at kernel code
for various flushes. Branches not surprisingly are usually the main
culprit, but they're normally not so interesting.

Static alias prediction seems to work well outside this case. It's
interesting, you need both a store ; load sequence that does not
predict well (e.g., using a different base register), and you also
need that load to be executed ahead of the store.

The small stack structure for arguments is the perfect case. Bad
pattern, and load executed right after store. Even then you also need
a reason to delay the store (e.g., source not ready or store queue
full), but those hazards do show up.

Now, even when all that goes wrong, there are dynamic heuristics that
can take over. So if you run a repetitive microbenchmark you won't
see it.

Some CPUs seem to be quite aggressive about giving up and turning off
the alias prediction globally if you take misses (Intel x86 used to do
that IIRC, not sure if they still do). So in that case you wouldn't
even see it show up in one place, everything will just run slightly
slower.

What I worry about is high rate direct IO workloads that see single
flushes in these paths as significant. Or if this thing creeps in to
the kernel too much and just slightly raises global misses enough,
then it will cause disambiguation to be significantly shut down.

Thanks,
Nick



Re: [PATCH 16/16] mm: pass get_user_pages_fast iterator arguments in a structure

2019-06-21 Thread Christoph Hellwig
On Thu, Jun 20, 2019 at 10:21:46AM -0700, Linus Torvalds wrote:
> Hmm. Honestly, I've never seen anything like that in any kernel profiles.
> 
> Compared to the problems I _do_ see (which is usually the obvious
> cache misses, and locking), it must either be in the noise or it's
> some problem specific to whatever CPU you are doing performance work
> on?
> 
> I've occasionally seen pipeline hiccups in profiles, but it's usually
> been either some serious glass jaw of the core, or it's been something
> really stupid we did (or occasionally that the compiler did: one in
> particular I remember was how there was a time when gcc would narrow
> stores when it could, so if you set a bit in a word, it would do it
> with a byte store, and then when you read the whole word afterwards
> you'd get a major pipeline stall and it happened to show up in some
> really hot paths).

I've not seen any difference in the GUP bench output here ar all.

But I'm fine with skipping this patch for now, I have a potential
series I'm looking into that would benefit a lot from it, but we
can discusss it in that context and make sure all the other works gets in
in time.


Re: [PATCH 05/13] powerpc/mce: Allow notifier callback to handle MCE

2019-06-21 Thread Mahesh Jagannath Salgaonkar
On 6/21/19 6:27 AM, Santosh Sivaraj wrote:
> From: Reza Arbab 
> 
> If a notifier returns NOTIFY_STOP, consider the MCE handled, just as we
> do when machine_check_early() returns 1.
> 
> Signed-off-by: Reza Arbab 
> ---
>  arch/powerpc/include/asm/asm-prototypes.h |  2 +-
>  arch/powerpc/kernel/exceptions-64s.S  |  3 +++
>  arch/powerpc/kernel/mce.c | 28 ---
>  3 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
> b/arch/powerpc/include/asm/asm-prototypes.h
> index f66f26ef3ce0..49ee8f08de2a 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -72,7 +72,7 @@ void machine_check_exception(struct pt_regs *regs);
>  void emulation_assist_interrupt(struct pt_regs *regs);
>  long do_slb_fault(struct pt_regs *regs, unsigned long ea);
>  void do_bad_slb_fault(struct pt_regs *regs, unsigned long ea, long err);
> -void machine_check_notify(struct pt_regs *regs);
> +long machine_check_notify(struct pt_regs *regs);
>  
>  /* signals, syscalls and interrupts */
>  long sys_swapcontext(struct ucontext __user *old_ctx,
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 2e56014fca21..c83e38a403fd 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -460,6 +460,9 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
>  
>   addir3,r1,STACK_FRAME_OVERHEAD
>   bl  machine_check_notify
> + ld  r11,RESULT(r1)
> + or  r3,r3,r11
> + std r3,RESULT(r1)
>  
>   ld  r12,_MSR(r1)
>  BEGIN_FTR_SECTION
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index 0ab171b41ede..912efe58e0b1 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -647,16 +647,28 @@ long hmi_exception_realmode(struct pt_regs *regs)
>   return 1;
>  }
>  
> -void machine_check_notify(struct pt_regs *regs)
> +long machine_check_notify(struct pt_regs *regs)
>  {
> - struct machine_check_event evt;
> + int index = __this_cpu_read(mce_nest_count) - 1;
> + struct machine_check_event *evt;
> + int rc;
>  
> - if (!get_mce_event(, MCE_EVENT_DONTRELEASE))
> - return;
> + if (index < 0 || index >= MAX_MC_EVT)
> + return 0;
> +
> + evt = this_cpu_ptr(_event[index]);
>  
> - blocking_notifier_call_chain(_notifier_list, 0, );
> + rc = blocking_notifier_call_chain(_notifier_list, 0, evt);
> + if (rc & NOTIFY_STOP_MASK) {
> + evt->disposition = MCE_DISPOSITION_RECOVERED;
> + regs->msr |= MSR_RI;

What is the reason for setting MSR_RI ? I don't think this is a good
idea. MSR_RI = 0 means system got MCE interrupt when SRR0 and SRR1
contents were live and was overwritten by MCE interrupt. Hence this
interrupt is unrecoverable irrespective of whether machine check handler
recovers from it or not.

Thanks,
-Mahesh.



[PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration

2019-06-21 Thread Nathan Lynch
The protocol for suspending or migrating an LPAR requires all present
processor threads to enter H_JOIN. So if we have threads offline, we
have to temporarily bring them up. This can race with administrator
actions such as SMT state changes. As of dfd718a2ed1f ("powerpc/rtas:
Fix a potential race between CPU-Offline & Migration"),
rtas_ibm_suspend_me() accounts for this, but errors out with -EBUSY
for what almost certainly is a transient condition in any reasonable
scenario.

Callers of rtas_ibm_suspend_me() already retry when -EAGAIN is
returned, and it is typical during a migration for that to happen
repeatedly for several minutes polling the H_VASI_STATE hcall result
before proceeding to the next stage.

So return -EAGAIN instead of -EBUSY when this race is
encountered. Additionally: logging this event is still appropriate but
use pr_info instead of pr_err; and remove use of unlikely() while here
as this is not a hot path at all.

Fixes: dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & 
Migration")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index fbc676160adf..9b4d2a2ffb4f 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -984,10 +984,9 @@ int rtas_ibm_suspend_me(u64 handle)
cpu_hotplug_disable();
 
/* Check if we raced with a CPU-Offline Operation */
-   if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
-   pr_err("%s: Raced against a concurrent CPU-Offline\n",
-  __func__);
-   atomic_set(, -EBUSY);
+   if (!cpumask_equal(cpu_present_mask, cpu_online_mask)) {
+   pr_info("%s: Raced against a concurrent CPU-Offline\n", 
__func__);
+   atomic_set(, -EAGAIN);
goto out_hotplug_enable;
}
 
-- 
2.20.1