date:20210826

[PATCH kernel] KVM: PPC: Fix clearing never mapped TCEs in realmode

2021-08-26 Thread Alexey Kardashevskiy

Since e1a1ef84cd07, pages for TCE tables for KVM guests are allocated
only when needed. This allows skipping any update when clearing TCEs.
This works mostly fine as TCE updates are handled when MMU is enabled.
The realmode handlers fail with H_TOO_HARD when pages are not yet
allocated except when clearing a TCE in which case KVM prints a warning
but proceeds to dereference a NULL pointer which crashes the host OS.

This has not been caught so far as the change is reasonably new,
POWER9 runs mostly radix which does not use realmode handlers.
With hash, the default TCE table is memset() by QEMU the machine reset
which triggers page faults and the KVM TCE device's kvm_spapr_tce_fault()
handles those with MMU on. And the huge DMA windows are not cleared
by VMs whicn instead successfully create a DMA window big enough to map
the VM memory 1:1 and then VMs just map everything without clearing.

This started crashing now as upcoming sriov-under-powervm support added
a mode when a dymanic DMA window not big enough to map the VM memory 1:1
but it is used anyway, and the VM now is the first (i.e. not QEMU) to
clear a just created table. Note that the upstream QEMU needs to be
modified to trigger the VM to trigger the host OS crash.

This replaces WARN_ON_ONCE_RM() with a check and return.
This adds another warning if TCE is not being cleared.

Cc: Leonardo Bras 
Fixes: e1a1ef84cd07 ("KVM: PPC: Book3S: Allocate guest TCEs on demand too")
Signed-off-by: Alexey Kardashevskiy 
---

With recent changes in the printk() department, calling pr_err() when MMU
off causes lockdep lockups which I did not dig any further so we should
start getting rid of the realmode's WARN_ON_ONCE_RM().
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 083a4e037718..e5ba96c41f3f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -173,10 +173,13 @@ static void kvmppc_rm_tce_put(struct 
kvmppc_spapr_tce_table *stt,
idx -= stt->offset;
page = stt->pages[idx / TCES_PER_PAGE];
/*
-* page must not be NULL in real mode,
-* kvmppc_rm_ioba_validate() must have taken care of this.
+* kvmppc_rm_ioba_validate() allows pages not be allocated if TCE is
+* being cleared, otherwise it returns H_TOO_HARD and we skip this.
 */
-   WARN_ON_ONCE_RM(!page);
+   if (!page) {
+   WARN_ON_ONCE_RM(tce != 0);
+   return;
+   }
tbl = kvmppc_page_address(page);
 
tbl[idx % TCES_PER_PAGE] = tce;
-- 
2.30.2

Re: [PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization

2021-08-26 Thread Claire Chang

On Tue, Aug 24, 2021 at 10:26 PM Guenter Roeck  wrote:
>
> Hi Claire,
>
> On Thu, Jun 24, 2021 at 11:55:24PM +0800, Claire Chang wrote:
> > Add the initialization function to create restricted DMA pools from
> > matching reserved-memory nodes.
> >
> > Regardless of swiotlb setting, the restricted DMA pool is preferred if
> > available.
> >
> > The restricted DMA pools provide a basic level of protection against the
> > DMA overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system
> > needs to provide a way to lock down the memory access, e.g., MPU.
> >
> > Signed-off-by: Claire Chang 
> > Reviewed-by: Christoph Hellwig 
> > Tested-by: Stefano Stabellini 
> > Tested-by: Will Deacon 
> > ---
> >  include/linux/swiotlb.h |  3 +-
> >  kernel/dma/Kconfig  | 14 
> >  kernel/dma/swiotlb.c| 76 +
> >  3 files changed, 92 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 3b9454d1e498..39284ff2a6cd 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -73,7 +73,8 @@ extern enum swiotlb_force swiotlb_force;
> >   *   range check to see if the memory was in fact allocated by this
> >   *   API.
> >   * @nslabs:  The number of IO TLB blocks (in groups of 64) between @start 
> > and
> > - *   @end. This is command line adjustable via setup_io_tlb_npages.
> > + *   @end. For default swiotlb, this is command line adjustable via
> > + *   setup_io_tlb_npages.
> >   * @used:The number of used IO TLB block.
> >   * @list:The free list describing the number of free entries available
> >   *   from each index.
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 77b405508743..3e961dc39634 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -80,6 +80,20 @@ config SWIOTLB
> >   bool
> >   select NEED_DMA_MAP_STATE
> >
> > +config DMA_RESTRICTED_POOL
> > + bool "DMA Restricted Pool"
> > + depends on OF && OF_RESERVED_MEM
> > + select SWIOTLB
>
> This makes SWIOTLB user configurable, which in turn results in
>
> mips64-linux-ld: arch/mips/kernel/setup.o: in function `arch_mem_init':
> setup.c:(.init.text+0x19c8): undefined reference to `plat_swiotlb_setup'
> make[1]: *** [Makefile:1280: vmlinux] Error 1
>
> when building mips:allmodconfig.
>
> Should this possibly be "depends on SWIOTLB" ?

Patch is sent here: https://lkml.org/lkml/2021/8/26/932

>
> Thanks,
> Guenter

Thanks,
Claire

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-26 Thread Sean Christopherson

On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com 
> wrote:
> >> >> +   r = sched_setaffinity(0, sizeof(allowed_mask), 
> >> >> _mask);
> >> >> +   TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d 
> >> >> (%s)",
> >> >> +   errno, strerror(errno));
> >> >> +   atomic_inc(_cnt);
> >> >> +
> >> >> +   CPU_CLR(cpu, _mask);
> >> >> +
> >> >> +   /*
> >> >> +* Let the read-side get back into KVM_RUN to improve 
> >> >> the odds
> >> >> +* of task migration coinciding with KVM's run loop.
> >> > 
> >> > This comment should be about increasing the odds of letting the seqlock
> >> > read-side complete. Otherwise, the delay between the two back-to-back
> >> > atomic_inc is so small that the seqlock read-side may never have time to
> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
> >> > retry forever.
> > 
> > Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
> > isn't
> > possible (though that syscall would have to be screaming fast),
> 
> I don't think we have the same understanding of the livelock scenario. AFAIU 
> the livelock
> would be caused by a too-small delay between the two consecutive atomic_inc() 
> from one
> loop iteration to the next compared to the time it takes to perform a 
> read-side critical
> section of the seqlock. Back-to-back atomic_inc can be performed very 
> quickly, so I
> doubt that the sched_getcpu implementation have good odds to be fast enough 
> to complete
> in that narrow window, leading to lots of read seqlock retry.

Ooooh, yeah, brain fart on my side.  I was thinking of the two atomic_inc() in 
the
same loop iteration and completely ignoring the next iteration.  Yes, I 100% 
agree
that a delay to ensure forward progress is needed.  An assertion in main() that 
the
reader complete at least some reasonable number of KVM_RUNs is also probably a 
good
idea, e.g. to rule out a false pass due to the reader never making forward 
progress.

FWIW, the do-while loop does make forward progress without a delay, but at ~50% 
throughput, give or take.

> > but the primary motivation is very much to allow the read-side enough time
> > to get back into KVM proper.
> 
> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then 
> we
> have:
> 
> migration thread KVM_RUN/read-side thread
> ---
>  - ioctl(KVM_RUN)
> - atomic_inc_seq_cst()
> - sched_setaffinity
> - atomic_inc_seq_cst()
>  - a = atomic_load() & ~1
>  - smp_rmb()
>  - b = 
> LOAD_ONCE(__rseq_abi->cpu_id);
>  - sched_getcpu()
>  - smp_rmb()
>  - re-load seqcnt/compare 
> (succeeds)
>- Can only succeed if entire 
> read-side happens while the seqcnt
>  is in an even numbered state.
>  - if (a != b) abort()
>   /* no delay. Even counter state is very
>  short. */
> - atomic_inc_seq_cst()
>   /* Let's suppose the lack of delay causes the
>  setaffinity to complete too early compared
>  with KVM_RUN ioctl */
> - sched_setaffinity
> - atomic_inc_seq_cst()
> 
>   /* no delay. Even counter state is very
>  short. */
> - atomic_inc_seq_cst()
>   /* Then a setaffinity from a following
>  migration thread loop will run
>  concurrently with KVM_RUN */
>  - ioctl(KVM_RUN)
> - sched_setaffinity
> - atomic_inc_seq_cst()
> 
> As pointed out here, if the first setaffinity runs too early compared with 
> KVM_RUN,
> a following setaffinity will run concurrently with it. However, the fact that 
> the even counter state is very short may very well hurt progress of the read 
> seqlock.

*sigh*

Several hours later, I think I finally have my head wrapped around everything.

Due to the way the test is written and because of how KVM's run loop works,
TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM 
actually
enters the guest, otherwise KVM will exit to userspace without touching the 
flag,
i.e. it will be handled by the normal exit_to_user_mode_loop().

Where I got lost was trying to figure out why I couldn't make the bug reproduce 
by
causing the guest to exit to KVM, but not userspace, in which case KVM should
easily trigger the problematic flow as the window for sched_getcpu() to collide
with KVM would be enormous.  The reason I didn't go down this

Re: [PATCH v2 3/4] powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use

2021-08-26 Thread Nicholas Piggin

Excerpts from kernel test robot's message of August 27, 2021 1:04 am:
> Hi Nicholas,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v5.14-rc7 next-20210826]
> [cannot apply to scottwood/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:
> https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-interrupt-speedups/20210825-204209
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-allnoconfig (attached as .config)

Fix for 32-bit:

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 8c78c40c006e..55e3fa44f280 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -437,7 +437,7 @@ static inline bool arch_irq_disabled_regs(struct pt_regs 
*regs)
return !(regs->msr & MSR_EE);
 }
 
-static inline bool may_hard_irq_enable(void)
+static inline bool should_hard_irq_enable(void)
 {
return false;
 }

Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-26 Thread Nicholas Piggin

Excerpts from Segher Boessenkool's message of August 27, 2021 1:30 am:
> On Fri, Aug 27, 2021 at 01:04:36AM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 27, 2021 12:37 am:
>> >> No, they are all dispatched and issue to the BRU for execution. It's 
>> >> trivial to construct a test of a lot of not taken branches in a row
>> >> and time a loop of it to see it executes at 1 cycle per branch.
>> > 
>> > (s/dispatched/issued/)
>> 
>> ?
> 
> Dispatch is from decode to the issue queues.  Issue is from there to
> execution units.  Dispatch is in-order, issue is not.

I know what those mean, I wonder what your s/dispatched/issued means.
I was saying they are dispatched in response to you saying they never
hit the issue queue.

> 
>> >> How could it validate prediction without issuing? It wouldn't know when
>> >> sources are ready.
>> > 
>> > In the backend.  But that is just how it worked on older cores :-/
>> 
>> Okay. I don't know about older cores than POWER9. Backend would normally 
>> include execution though.
>> Only other place you could do it if you don't
>> issue/exec would be after it goes back in order, like completion.
> 
> You do not have to do the verification in-order: the insn cannot finish
> until it is no longer speculative, that takes care of all ordering
> needed.

Branches *can* finish out of order and speculative as they do in P9 and 
P10. Are you talking about these CPUs or something else which can 
verify branches without issuing them?

> 
>> But that would be horrible for mispredict penalty.
> 
> See the previous point.  Also, any insn known to be mispredicted can be
> flushed immediately anyway.

The point is it has to know sources (CR) to verify (aka execute) the 
branch prediction was right, and if it needs sources then it needs to 
either issue and execute in the out of order part, or it needs to wait
until completion which would seem to be prohibitively expensive. I am
interested to know how it works.

> 
>> >> >> The first problem seems like the show stopper though. AFAIKS it would 
>> >> >> need a special builtin support that does something to create the table
>> >> >> entry, or a guarantee that we could put an inline asm right after the
>> >> >> builtin as a recognized pattern and that would give us the instruction
>> >> >> following the trap.
>> >> > 
>> >> > I'm not quite sure what this means.  Can't you always just put a
>> >> > 
>> >> > bla:asm("");
>> >> > 
>> >> > in there, and use the address of "bla"?
>> >> 
>> >> Not AFAIKS. Put it where?
>> > 
>> > After wherever you want to know the address after.  You will have to
>> > make sure they stay together somehow.
>> 
>> I still don't follow.
> 
> some_thing_you_want_to_know_the_address_after_let_us_call_it_A;
> empty_asm_that_we_can_take_the_address_of_known_as_B;
> 
> You have to make sure the compiler keeps A and B together, does not
> insert anything between them, does put them in the assembler output in
> the same fragment, etc.

How does all this help our problem of putting the address of the trap 
into the table?

> 
>> If you could give a built in that put a label at the address of the trap 
>> instruction that could be used later by inline asm then that could work
>> too:
>> 
>> __builtin_labeled_trap("1:");
>> asm (".section __bug_table,\"aw\"  \n\t"
>>  "2:  .4byte 1b - 2b   \n\t"
>>  ".previous");
> 
> How could a compiler do anything like that?!

How could it add a label at the trap instruction it generates? It didn't 
seem like an outlandish thing to do, but I'm not a compiler writer. It was 
just a handwaving idea to show what we want to be able to do.

Thanks,
Nick

Re: [PATCH v2 RESEND] powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC

2021-08-26 Thread Paul Moore

On Thu, Aug 26, 2021 at 10:37 AM Michael Ellerman  wrote:
> Paul Moore  writes:
> > On Tue, Aug 24, 2021 at 1:11 PM Christophe Leroy
> >  wrote:
> >> Le 24/08/2021 à 16:47, Paul Moore a écrit :
> >> > On Tue, Aug 24, 2021 at 9:36 AM Christophe Leroy
> >> >  wrote:
> >> >>
> >> >> Commit e65e1fc2d24b ("[PATCH] syscall class hookup for all normal
> >> >> targets") added generic support for AUDIT but that didn't include
> >> >> support for bi-arch like powerpc.
> >> >>
> >> >> Commit 4b58841149dc ("audit: Add generic compat syscall support")
> >> >> added generic support for bi-arch.
> >> >>
> >> >> Convert powerpc to that bi-arch generic audit support.
> >> >>
> >> >> Cc: Paul Moore 
> >> >> Cc: Eric Paris 
> >> >> Signed-off-by: Christophe Leroy 
> >> >> ---
> >> >> Resending v2 with Audit people in Cc
> >> >>
> >> >> v2:
> >> >> - Missing 'git add' for arch/powerpc/include/asm/unistd32.h
> >> >> - Finalised commit description
> >> >> ---
> >> >>   arch/powerpc/Kconfig|  5 +-
> >> >>   arch/powerpc/include/asm/unistd32.h |  7 +++
> >> >>   arch/powerpc/kernel/Makefile|  3 --
> >> >>   arch/powerpc/kernel/audit.c | 84 -
> >> >>   arch/powerpc/kernel/compat_audit.c  | 44 ---
> >> >>   5 files changed, 8 insertions(+), 135 deletions(-)
> >> >>   create mode 100644 arch/powerpc/include/asm/unistd32.h
> >> >>   delete mode 100644 arch/powerpc/kernel/audit.c
> >> >>   delete mode 100644 arch/powerpc/kernel/compat_audit.c
> >> >
> >> > Can you explain, in detail please, the testing you have done to verify
> >> > this patch?
> >> >
> >>
> >> I built ppc64_defconfig and checked that the generated code is 
> >> functionnaly equivalent.
> >>
> >> ppc32_classify_syscall() is exactly the same as 
> >> audit_classify_compat_syscall() except that the
> >> later takes the syscall as second argument (ie in r4) whereas the former 
> >> takes it as first argument
> >> (ie in r3).
> >>
> >> audit_classify_arch() and powerpc audit_classify_syscall() are slightly 
> >> different between the
> >> powerpc version and the generic version because the powerpc version checks 
> >> whether it is
> >> AUDIT_ARCH_PPC or not (ie value 20), while the generic one checks whether 
> >> it has bit
> >> __AUDIT_ARCH_64BIT set or not (__AUDIT_ARCH_64BIT is the sign bit of a 
> >> word), but taking into
> >> account that the abi is either AUDIT_ARCH_PPC, AUDIT_ARCH_PPC64 or 
> >> AUDIT_ARCH_PPC64LE, the result is
> >> the same.
> >>
> >> If you are asking I guess you saw something wrong ?
> >
> > I was asking because I didn't see any mention of testing, and when you
> > are enabling something significant like this it is nice to see that it
> > has been verified to work :)
> >
> > While binary dumps and comparisons are nice, it is always good to see
> > verification from a test suite.  I don't have access to the necessary
> > hardware to test this, but could you verify that the audit-testsuite
> > passes on your test system with your patches applied?
> >
> >  * https://github.com/linux-audit/audit-testsuite
>
> I tested on ppc64le. Both before and after the patch I get the result
> below.
>
> So I guess the patch is OK, but maybe we have some existing issue.
>
> I had a bit of a look at the test code, but my perl is limited. I think
> it was running the command below, and it returned "", but
> not really sure what that means.

If it makes you feel any better, my perl is *very* limited; thankfully
this isn't my first time looking at that test :)

It's a little odd, but after some basic sanity tests at the top, the
test sets a watch on a file, /tmp/, and tells the kernel
to generate audit records for any syscall that operates on that file.
It then creates, and removes, a series of exclude audit filters to
check if the exclude filtering is working as expected, e.g. when
syscall filtering is excluded there should be no syscall records in
the audit log.

In the case you describe, it looks like it looks like the audit
exclude filter is removed (that's what line 147 does), the
/tmp/ file is removed (line 152), and then we check to
see if any syscall records exist (line 164, and yes, there should be
*something* there for the unlink/rm).

It may be of little consolation, but this test works just fine on
recent kernels running on both x86_64 and aarch64.  I don't have
access to a powerpc system of any vintage, but I added Richard to the
To line above in case he has easier access to a test system (I suspect
the RH/IBM linkage should help in this regard).  Otherwise I would
suggest starting to debug this by simply doing some basic tests using
auditctl to create rules and exclude rules to see what is working, and
what isn't; that might provide some clues.

Sorry I'm not much more help at this point :/

>   $ sudo ausearch -i -m SYSCALL -p 216440 -ui 0 -gi 0 -ul 0 -su unconfined 
> _u:unconfined_r:unconfined_t:s0-s0:c0.c1023 -ts recent
>   
>
> cheers
>
>
>
> Running as   userroot
>

Re: [PATCH v2 2/3] selftests/powerpc: Add test for real address error handling

2021-08-26 Thread Ganesh



On 8/26/21 8:57 AM, Michael Ellerman wrote:

Ganesh  writes:

On 8/24/21 6:18 PM, Michael Ellerman wrote:


Ganesh Goudar  writes:

Add test for real address or control memory address access
error handling, using NX-GZIP engine.

The error is injected by accessing the control memory address
using illegal instruction, on successful handling the process
attempting to access control memory address using illegal
instruction receives SIGBUS.

...


diff --git a/tools/testing/selftests/powerpc/mce/inject-ra-err.sh 
b/tools/testing/selftests/powerpc/mce/inject-ra-err.sh
new file mode 100755
index ..3633cdc651a1
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mce/inject-ra-err.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+if [[ ! -w /dev/crypto/nx-gzip ]]; then
+   echo "WARN: Can't access /dev/crypto/nx-gzip, skipping"
+   exit 0
+fi
+
+timeout 5 ./inject-ra-err
+
+# 128 + 7 (SIGBUS) = 135, 128 is a exit code with special meaning.
+if [ $? -ne 135 ]; then
+   echo "FAILED: Real address or Control memory access error not handled"
+   exit $?
+fi
+
+echo "OK: Real address or Control memory access error is handled"
+exit 0

I don't think we really need the shell script, we should be able to do
all that in the C code.

Can you try this?

it works!, We need to set timeout, with 120 sec timeout we may flood the dmesg.

Hmm. Does it keep faulting? The regs->nip += 4 is meant to avoid that.


Yes, it keeps faulting, if we fail to handle and not send SIGBUS to the process.



cheers

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Nathan Chancellor

On Thu, Aug 26, 2021 at 11:54:17AM -0700, Nathan Chancellor wrote:
> On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
> > Nathan Chancellor  writes:
> > > On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote:
> > >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
> > >> flexibility to GCC.
> > ...
> > >
> > > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
> > > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
> > > klist_add_tail to trigger over and over on boot when compiling with
> > > clang:
> > >
> > > [2.177416][T1] WARNING: CPU: 0 PID: 1 at lib/klist.c:62 
> > > .klist_add_tail+0x3c/0x110
> > > [2.177456][T1] Modules linked in:
> > > [2.177481][T1] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW  
> > >5.14.0-rc7-next-20210825 #1
> > > [2.177520][T1] NIP:  c07ff81c LR: c090a038 CTR: 
> > > 
> > > [2.177557][T1] REGS: c73c32a0 TRAP: 0700   Tainted: G 
> > >W  (5.14.0-rc7-next-20210825)
> > > [2.177593][T1] MSR:  82029032   
> > > CR: 22000a40  XER: 
> > > [2.177667][T1] CFAR: c090a034 IRQMASK: 0
> > > [2.177667][T1] GPR00: c090a038 c73c3540 
> > > c1be3200 0001
> > > [2.177667][T1] GPR04: c72d65c0  
> > > c91ba798 c91bb0a0
> > > [2.177667][T1] GPR08: 0001  
> > > c8581918 fc00
> > > [2.177667][T1] GPR12: 44000240 c1dd 
> > > c0012300 
> > > [2.177667][T1] GPR16:   
> > >  
> > > [2.177667][T1] GPR20:   
> > >  
> > > [2.177667][T1] GPR24:  c17e3200 
> > >  c1a0e778
> > > [2.177667][T1] GPR28: c72d65b0 c72d65a8 
> > > c7de72c8 c73c35d0
> > > [2.178019][T1] NIP [c07ff81c] .klist_add_tail+0x3c/0x110
> > > [2.178058][T1] LR [c090a038] .bus_add_driver+0x148/0x290
> > > [2.178088][T1] Call Trace:
> > > [2.178105][T1] [c73c3540] [c73c35d0] 
> > > 0xc73c35d0 (unreliable)
> > > [2.178150][T1] [c73c35d0] [c090a038] 
> > > .bus_add_driver+0x148/0x290
> > > [2.178190][T1] [c73c3670] [c090fae8] 
> > > .driver_register+0xb8/0x190
> > > [2.178234][T1] [c73c3700] [c0be55c0] 
> > > .__hid_register_driver+0x70/0xd0
> > > [2.178275][T1] [c73c37a0] [c116955c] 
> > > .redragon_driver_init+0x34/0x58
> > > [2.178314][T1] [c73c3820] [c0011ae0] 
> > > .do_one_initcall+0x130/0x3b0
> > > [2.178357][T1] [c73c3bb0] [c11065e0] 
> > > .do_initcall_level+0xd8/0x188
> > > [2.178403][T1] [c73c3c50] [c11064a8] 
> > > .do_initcalls+0x7c/0xdc
> > > [2.178445][T1] [c73c3ce0] [c1106238] 
> > > .kernel_init_freeable+0x178/0x21c
> > > [2.178491][T1] [c73c3d90] [c0012334] 
> > > .kernel_init+0x34/0x220
> > > [2.178530][T1] [c73c3e10] [c000cf50] 
> > > .ret_from_kernel_thread+0x58/0x60
> > > [2.178569][T1] Instruction dump:
> > > [2.178592][T1] fba10078 7c7d1b78 3861 fb810070 3b9d0008 
> > > fbc10080 7c9e2378 389d0018
> > > [2.178662][T1] fb9d0008 fb9d0010 9064 fbdd <0b1e> 
> > > e87e0018 2823 41820024
> > > [2.178728][T1] ---[ end trace 52ed3431f58f1847 ]---
> > >
> > > Is this a bug with clang or is there something wrong with the patch? The
> > > vmlinux image is available at [1] if you want to inspect it and our QEMU
> > > command and the warning at boot can be viewed at [2]. If there is any
> > > other information I can provide, please let me know.
> > >
> > > [1] https://builds.tuxbuild.com/1xDcmp3Tvno0TTGxDVPedRKIKM2/
> > > [2] 
> > > https://github.com/ClangBuiltLinux/continuous-integration2/commit/cee159b66a58eb57fa2359e7888074b9da24126c/checks/3422232736/logs
> > 
> > Thanks.
> > 
> > This is the generated assembly:
> > 
> > c07ff600 <.klist_add_tail>:
> > c07ff600:   7c 08 02 a6 mflrr0
> > c07ff604:   f8 01 00 10 std r0,16(r1)
> > c07ff608:   f8 21 ff 71 stdur1,-144(r1) ^ prolog
> > c07ff60c:   fb a1 00 78 std r29,120(r1) save r29 to 
> > stack
> > c07ff610:   7c 7d 1b 78 mr  r29,r3  r29 = 
> > struct klist_node *n
> > c07ff614:   38 60 00 01 li  r3,1r3 = 1
> > c07ff618:   fb 81 00 70 std r28,112(r1) save r28 to 
> > stack
> > c07ff61c:

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Nathan Chancellor

On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
> Nathan Chancellor  writes:
> > On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote:
> >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
> >> flexibility to GCC.
> ...
> >
> > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
> > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
> > klist_add_tail to trigger over and over on boot when compiling with
> > clang:
> >
> > [2.177416][T1] WARNING: CPU: 0 PID: 1 at lib/klist.c:62 
> > .klist_add_tail+0x3c/0x110
> > [2.177456][T1] Modules linked in:
> > [2.177481][T1] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW
> >  5.14.0-rc7-next-20210825 #1
> > [2.177520][T1] NIP:  c07ff81c LR: c090a038 CTR: 
> > 
> > [2.177557][T1] REGS: c73c32a0 TRAP: 0700   Tainted: G   
> >  W  (5.14.0-rc7-next-20210825)
> > [2.177593][T1] MSR:  82029032   CR: 
> > 22000a40  XER: 
> > [2.177667][T1] CFAR: c090a034 IRQMASK: 0
> > [2.177667][T1] GPR00: c090a038 c73c3540 
> > c1be3200 0001
> > [2.177667][T1] GPR04: c72d65c0  
> > c91ba798 c91bb0a0
> > [2.177667][T1] GPR08: 0001  
> > c8581918 fc00
> > [2.177667][T1] GPR12: 44000240 c1dd 
> > c0012300 
> > [2.177667][T1] GPR16:   
> >  
> > [2.177667][T1] GPR20:   
> >  
> > [2.177667][T1] GPR24:  c17e3200 
> >  c1a0e778
> > [2.177667][T1] GPR28: c72d65b0 c72d65a8 
> > c7de72c8 c73c35d0
> > [2.178019][T1] NIP [c07ff81c] .klist_add_tail+0x3c/0x110
> > [2.178058][T1] LR [c090a038] .bus_add_driver+0x148/0x290
> > [2.178088][T1] Call Trace:
> > [2.178105][T1] [c73c3540] [c73c35d0] 
> > 0xc73c35d0 (unreliable)
> > [2.178150][T1] [c73c35d0] [c090a038] 
> > .bus_add_driver+0x148/0x290
> > [2.178190][T1] [c73c3670] [c090fae8] 
> > .driver_register+0xb8/0x190
> > [2.178234][T1] [c73c3700] [c0be55c0] 
> > .__hid_register_driver+0x70/0xd0
> > [2.178275][T1] [c73c37a0] [c116955c] 
> > .redragon_driver_init+0x34/0x58
> > [2.178314][T1] [c73c3820] [c0011ae0] 
> > .do_one_initcall+0x130/0x3b0
> > [2.178357][T1] [c73c3bb0] [c11065e0] 
> > .do_initcall_level+0xd8/0x188
> > [2.178403][T1] [c73c3c50] [c11064a8] 
> > .do_initcalls+0x7c/0xdc
> > [2.178445][T1] [c73c3ce0] [c1106238] 
> > .kernel_init_freeable+0x178/0x21c
> > [2.178491][T1] [c73c3d90] [c0012334] 
> > .kernel_init+0x34/0x220
> > [2.178530][T1] [c73c3e10] [c000cf50] 
> > .ret_from_kernel_thread+0x58/0x60
> > [2.178569][T1] Instruction dump:
> > [2.178592][T1] fba10078 7c7d1b78 3861 fb810070 3b9d0008 
> > fbc10080 7c9e2378 389d0018
> > [2.178662][T1] fb9d0008 fb9d0010 9064 fbdd <0b1e> 
> > e87e0018 2823 41820024
> > [2.178728][T1] ---[ end trace 52ed3431f58f1847 ]---
> >
> > Is this a bug with clang or is there something wrong with the patch? The
> > vmlinux image is available at [1] if you want to inspect it and our QEMU
> > command and the warning at boot can be viewed at [2]. If there is any
> > other information I can provide, please let me know.
> >
> > [1] https://builds.tuxbuild.com/1xDcmp3Tvno0TTGxDVPedRKIKM2/
> > [2] 
> > https://github.com/ClangBuiltLinux/continuous-integration2/commit/cee159b66a58eb57fa2359e7888074b9da24126c/checks/3422232736/logs
> 
> Thanks.
> 
> This is the generated assembly:
> 
> c07ff600 <.klist_add_tail>:
> c07ff600:   7c 08 02 a6 mflrr0
> c07ff604:   f8 01 00 10 std r0,16(r1)
> c07ff608:   f8 21 ff 71 stdur1,-144(r1)   ^ prolog
> c07ff60c:   fb a1 00 78 std r29,120(r1)   save r29 to 
> stack
> c07ff610:   7c 7d 1b 78 mr  r29,r3r29 = 
> struct klist_node *n
> c07ff614:   38 60 00 01 li  r3,1  r3 = 1
> c07ff618:   fb 81 00 70 std r28,112(r1)   save r28 to 
> stack
> c07ff61c:   3b 9d 00 08 addir28,r29,8 r28 = >n_node
> c07ff620:   fb c1 00 80 std r30,128(r1)   save r30 to 
> stack
> c07ff624:   7c 9e 23 78 mr  r30,r4r30 = 
> struct klist *k
> c07ff628:

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-26 Thread Mathieu Desnoyers

- On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> - On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
>> > wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> + cpu_set_t allowed_mask;
>> >> + int r, i, nr_cpus, cpu;
>> >> +
>> >> + CPU_ZERO(_mask);
>> >> +
>> >> + nr_cpus = CPU_COUNT(_mask);
>> >> +
>> >> + for (i = 0; i < 2; i++) {
>> >> + cpu = i % nr_cpus;
>> >> + if (!CPU_ISSET(cpu, _mask))
>> >> + continue;
>> >> +
>> >> + CPU_SET(cpu, _mask);
>> >> +
>> >> + /*
>> >> +  * Bump the sequence count twice to allow the reader to detect
>> >> +  * that a migration may have occurred in between rseq and sched
>> >> +  * CPU ID reads.  An odd sequence count indicates a migration
>> >> +  * is in-progress, while a completely different count indicates
>> >> +  * a migration occurred since the count was last read.
>> >> +  */
>> >> + atomic_inc(_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(>val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would 
> actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I 
would
recommend adding a comment stating which memory barriers are required alongside 
each
use of atomic_inc here. I would even prefer that we add extra (currently 
unneeded)
write barriers to make extra sure that this stays documented. Performance of 
the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> + r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
>> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> + errno, strerror(errno));
>> >> + atomic_inc(_cnt);
>> >> +
>> >> + CPU_CLR(cpu, _mask);
>> >> +
>> >> + /*
>> >> +  * Let the read-side get back into KVM_RUN to improve the odds
>> >> +  * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
> isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU 
the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() 
from one
loop iteration to the next compared to the time it takes to perform a read-side 
critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, 
so I
doubt that the sched_getcpu implementation have good odds to be fast enough to 
complete
in that narrow

Re: [next-20210820][ppc][nvme/raid] pci unbind WARNS at fs/kernfs/dir.c:1524 kernfs_remove_by_name_ns+

2021-08-26 Thread Christoph Hellwig

Are you sure this is the very latest linux-next?  This should hav been
fixed by:

"block: add back the bd_holder_dir reference in bd_link_disk_holder"

Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-26 Thread Segher Boessenkool

On Fri, Aug 27, 2021 at 01:04:36AM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 27, 2021 12:37 am:
> >> No, they are all dispatched and issue to the BRU for execution. It's 
> >> trivial to construct a test of a lot of not taken branches in a row
> >> and time a loop of it to see it executes at 1 cycle per branch.
> > 
> > (s/dispatched/issued/)
> 
> ?

Dispatch is from decode to the issue queues.  Issue is from there to
execution units.  Dispatch is in-order, issue is not.

> >> How could it validate prediction without issuing? It wouldn't know when
> >> sources are ready.
> > 
> > In the backend.  But that is just how it worked on older cores :-/
> 
> Okay. I don't know about older cores than POWER9. Backend would normally 
> include execution though.
> Only other place you could do it if you don't
> issue/exec would be after it goes back in order, like completion.

You do not have to do the verification in-order: the insn cannot finish
until it is no longer speculative, that takes care of all ordering
needed.

> But that would be horrible for mispredict penalty.

See the previous point.  Also, any insn known to be mispredicted can be
flushed immediately anyway.

> >> >> The first problem seems like the show stopper though. AFAIKS it would 
> >> >> need a special builtin support that does something to create the table
> >> >> entry, or a guarantee that we could put an inline asm right after the
> >> >> builtin as a recognized pattern and that would give us the instruction
> >> >> following the trap.
> >> > 
> >> > I'm not quite sure what this means.  Can't you always just put a
> >> > 
> >> > bla: asm("");
> >> > 
> >> > in there, and use the address of "bla"?
> >> 
> >> Not AFAIKS. Put it where?
> > 
> > After wherever you want to know the address after.  You will have to
> > make sure they stay together somehow.
> 
> I still don't follow.

some_thing_you_want_to_know_the_address_after_let_us_call_it_A;
empty_asm_that_we_can_take_the_address_of_known_as_B;

You have to make sure the compiler keeps A and B together, does not
insert anything between them, does put them in the assembler output in
the same fragment, etc.

> If you could give a built in that put a label at the address of the trap 
> instruction that could be used later by inline asm then that could work
> too:
> 
> __builtin_labeled_trap("1:");
> asm (".section __bug_table,\"aw\"  \n\t"
>  "2:  .4byte 1b - 2b   \n\t"
>  ".previous");

How could a compiler do anything like that?!


Segher

Re: [PATCH v2 3/4] powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use

2021-08-26 Thread kernel test robot

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v5.14-rc7 next-20210826]
[cannot apply to scottwood/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-interrupt-speedups/20210825-204209
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allnoconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/a0eb195f49a01ed45b3f97815470f9c8acaa4991
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Nicholas-Piggin/powerpc-64s-interrupt-speedups/20210825-204209
git checkout a0eb195f49a01ed45b3f97815470f9c8acaa4991
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross 
ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/irq.c: In function '__do_irq':
>> arch/powerpc/kernel/irq.c:742:13: error: implicit declaration of function 
>> 'should_hard_irq_enable'; did you mean 'do_hard_irq_enable'? 
>> [-Werror=implicit-function-declaration]
 742 | if (should_hard_irq_enable())
 | ^~
 | do_hard_irq_enable
   In file included from :
   In function 'do_hard_irq_enable',
   inlined from '__do_irq' at arch/powerpc/kernel/irq.c:743:3:
>> include/linux/compiler_types.h:328:45: error: call to 
>> '__compiletime_assert_34' declared with attribute error: BUILD_BUG failed
 328 | _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
 | ^
   include/linux/compiler_types.h:309:25: note: in definition of macro 
'__compiletime_assert'
 309 | prefix ## suffix();  
   \
 | ^~
   include/linux/compiler_types.h:328:9: note: in expansion of macro 
'_compiletime_assert'
 328 | _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
 | ^~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 
'compiletime_assert'
  39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
 | ^~
   include/linux/build_bug.h:59:21: note: in expansion of macro 
'BUILD_BUG_ON_MSG'
  59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
 | ^~~~
   arch/powerpc/include/asm/hw_irq.h:447:9: note: in expansion of macro 
'BUILD_BUG'
 447 | BUILD_BUG();
 | ^
   cc1: all warnings being treated as errors
--
   arch/powerpc/kernel/time.c: In function 'timer_interrupt':
>> arch/powerpc/kernel/time.c:570:13: error: implicit declaration of function 
>> 'should_hard_irq_enable'; did you mean 'do_hard_irq_enable'? 
>> [-Werror=implicit-function-declaration]
 570 | if (should_hard_irq_enable()) {
 | ^~
 | do_hard_irq_enable
   In file included from :
   In function 'do_hard_irq_enable',
   inlined from 'timer_interrupt' at arch/powerpc/kernel/time.c:584:3,
   inlined from 'timer_interrupt' at arch/powerpc/kernel/time.c:553:1:
>> include/linux/compiler_types.h:328:45: error: call to 
>> '__compiletime_assert_34' declared with attribute error: BUILD_BUG failed
 328 | _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
 | ^
   include/linux/compiler_types.h:309:25: note: in definition of macro 
'__compiletime_assert'
 309 | prefix ## suffix();  
   \
 | ^~
   include/linux/compiler_types.h:328:9: note: in expansion of macro 
'_compiletime_assert'
 328 | _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
 | ^~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 
'compiletime_assert'
  39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
 | ^~
   include/linux/build_bug.h:59:21:

Re: [GIT PULL] retire legacy WR sbc8548 and sbc8641 platforms

2021-08-26 Thread Michael Ellerman

Paul Gortmaker  writes:
> This is unchanged from the original wr_sbc-delete branch sent in January,
> other than to add the Acks from Scott in July, and update the baseline.
>
> Built with ppc64 defconfig and mpc85xx_cds_defconfig and mpc86xx_defconfig
> just to make sure I didn't fat finger anything in the baseline update.

Thanks for following up on this.

I ended up cherry-picking the patches into my branch. I like to keep my
next based on rc2, and merging this would have pulled in everything up
to rc7 into my branch.

I don't think you were planning to merge this branch anywhere else, so
it shouldn't make any difference, but let me know if it's a problem.

It should appear here soon:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=next


cheers
 

> Original v1 text follows below, from:
>
> https://lore.kernel.org/lkml/20210111082823.99562-1-paul.gortma...@windriver.com
>
> It would be nice to get this in and off our collective to-do list.
>
> Thanks,
> Paul.
>
>   ---
>
> In v2.6.27 (2008, 917f0af9e5a9) the sbc8260 support was implicitly
> retired by not being carried forward through the ppc --> powerpc
> device tree transition.
>
> Then, in v3.6 (2012, b048b4e17cbb) we retired the support for the
> sbc8560 boards.
>
> Next, in v4.18 (2017, 3bc6cf5a86e5) we retired the support for the
> 2006 vintage sbc834x boards.
>
> The sbc8548 and sbc8641d boards were maybe 1-2 years newer than the
> sbc834x boards, but it is also 3+ years later, so it makes sense to
> now retire them as well - which is what is done here.
>
> These two remaining WR boards were based on the Freescale MPC8548-CDS
> and the MPC8641D-HPCN reference board implementations.  Having had the
> chance to use these and many other Fsl ref boards, I know this:  The
> Freescale reference boards were typically produced in limited quantity
> and primarily available to BSP developers and hardware designers, and
> not likely to have found a 2nd life with hobbyists and/or collectors.
>
> It was good to have that BSP code subjected to mainline review and
> hence also widely available back in the day. But given the above, we
> should probably also be giving serious consideration to retiring
> additional similar age/type reference board platforms as well.
>
> I've always felt it is important for us to be proactive in retiring
> old code, since it has a genuine non-zero carrying cost, as described
> in the 930d52c012b8 merge log.  But for the here and now, we just
> clean up the remaining BSP code that I had added for SBC platforms.
>
> --- 
>
> The following changes since commit e22ce8eb631bdc47a4a4ea7ecf4e4ba499db4f93:
>
>   Linux 5.14-rc7 (2021-08-22 14:24:56 -0700)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git 
> wr_sbc-delete-v2
>
> for you to fetch changes up to d44e2dc12ea2112e74cdd25090eeda2727ed09cc:
>
>   MAINTAINERS: update for Paul Gortmaker (2021-08-24 08:19:01 -0400)
>
> 
> Paul Gortmaker (3):
>   powerpc: retire sbc8548 board support
>   powerpc: retire sbc8641d board support
>   MAINTAINERS: update for Paul Gortmaker
>
>  MAINTAINERS |   1 -
>  arch/powerpc/boot/Makefile  |   1 -
>  arch/powerpc/boot/dts/fsl/sbc8641d.dts  | 176 -
>  arch/powerpc/boot/dts/sbc8548-altflash.dts  | 111 ---
>  arch/powerpc/boot/dts/sbc8548-post.dtsi | 289 
> 
>  arch/powerpc/boot/dts/sbc8548-pre.dtsi  |  48 -
>  arch/powerpc/boot/dts/sbc8548.dts   | 106 --
>  arch/powerpc/boot/wrapper   |   2 +-
>  arch/powerpc/configs/85xx/sbc8548_defconfig |  50 -
>  arch/powerpc/configs/mpc85xx_base.config|   1 -
>  arch/powerpc/configs/mpc86xx_base.config|   1 -
>  arch/powerpc/configs/ppc6xx_defconfig   |   1 -
>  arch/powerpc/platforms/85xx/Kconfig |   6 -
>  arch/powerpc/platforms/85xx/Makefile|   1 -
>  arch/powerpc/platforms/85xx/sbc8548.c   | 134 -
>  arch/powerpc/platforms/86xx/Kconfig |   8 +-
>  arch/powerpc/platforms/86xx/Makefile|   1 -
>  arch/powerpc/platforms/86xx/sbc8641d.c  |  87 -
>  18 files changed, 2 insertions(+), 1022 deletions(-)
>  delete mode 100644 arch/powerpc/boot/dts/fsl/sbc8641d.dts
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548-altflash.dts
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548-post.dtsi
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548-pre.dtsi
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548.dts
>  delete mode 100644 arch/powerpc/configs/85xx/sbc8548_defconfig
>  delete mode 100644 arch/powerpc/platforms/85xx/sbc8548.c
>  delete mode 100644 arch/powerpc/platforms/86xx/sbc8641d.c

Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-26 Thread Nicholas Piggin

Excerpts from Segher Boessenkool's message of August 27, 2021 12:37 am:
> On Thu, Aug 26, 2021 at 11:57:52PM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
>> > On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
>> >> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
>> >> >> This one possibly the branches end up in predictors, whereas 
>> >> >> conditional 
>> >> >> trap is always just speculated not to hit. Branches may also have a
>> >> >> throughput limit on execution whereas trap could be more (1 per cycle
>> >> >> vs 4 per cycle on POWER9).
>> >> > 
>> >> > I thought only *taken* branches are just one per cycle?
>> >> 
>> >> Taken branches are fetched by the front end at one per cycle (assuming 
>> >> they hit the BTAC), but all branches have to be executed by BR at one 
>> >> per cycle
>> > 
>> > This is not true.  (Simple) predicted not-taken conditional branches are
>> > just folded out, never hit the issue queues.  And they are fetched as
>> > many together as fit in a fetch group, can complete without limits as
>> > well.
>> 
>> No, they are all dispatched and issue to the BRU for execution. It's 
>> trivial to construct a test of a lot of not taken branches in a row
>> and time a loop of it to see it executes at 1 cycle per branch.
> 
> (s/dispatched/issued/)

?

> 
> Huh, this was true on p8 already.  Sorry for my confusion.  In my
> defence, this doesn't matter for performance on "real code".
> 
>> > Correctly predicted simple conditional branches just get their prediction
>> > validated (and that is not done in the execution units).  Incorrectly
>> > predicted branches the same, but those cause a redirect and refetch.
>> 
>> How could it validate prediction without issuing? It wouldn't know when
>> sources are ready.
> 
> In the backend.  But that is just how it worked on older cores :-/

Okay. I don't know about older cores than POWER9. Backend would normally 
include execution though. Only other place you could do it if you don't
issue/exec would be after it goes back in order, like completion. But
that would be horrible for mispredict penalty.

>> >> The first problem seems like the show stopper though. AFAIKS it would 
>> >> need a special builtin support that does something to create the table
>> >> entry, or a guarantee that we could put an inline asm right after the
>> >> builtin as a recognized pattern and that would give us the instruction
>> >> following the trap.
>> > 
>> > I'm not quite sure what this means.  Can't you always just put a
>> > 
>> > bla:   asm("");
>> > 
>> > in there, and use the address of "bla"?
>> 
>> Not AFAIKS. Put it where?
> 
> After wherever you want to know the address after.  You will have to
> make sure they stay together somehow.

I still don't follow.

> It is much easier to get the address of something, not the address after
> it.  If you know it is just one insn anyway, that isn't hard to handle
> either (even if prefixed insns exist).
> 
>> > If not, you need to say a lot
>> > more about what you actually want to do :-/
>> 
>> We need to put the address (or relative offset) of the trap instruction
>> into an entry in a different section. Basically this:
>> 
>>__builtin_trap();
>>asm ("1:   \n\t"
>> ".section __bug_table,\"aw\"  \n\t"
>> "2:  .4byte 1b - 2b - 4   \n\t"
>> ".previous");
>> 
>> Where the 1: label must follow immediately after the trap instruction, 
>> and where the asm must be emitted even for the case of no-return traps 
>> (but anything following the asm could be omitted).
> 
> The address *after* the trap insn?  That is much much harder than the
> address *of* the trap insn.

No the address of the trap instruction, hence the -4. I have the label
after because that is the semantics I have from inline asm.

If you could give a built in that put a label at the address of the trap 
instruction that could be used later by inline asm then that could work
too:

__builtin_labeled_trap("1:");
asm (".section __bug_table,\"aw\"  \n\t"
 "2:  .4byte 1b - 2b   \n\t"
 ".previous");

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Christophe Leroy





Le 26/08/2021 à 16:45, Michael Ellerman a écrit :

Christophe Leroy  writes:

Le 26/08/2021 à 05:21, Michael Ellerman a écrit :

Nathan Chancellor  writes:

On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote:

Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
flexibility to GCC.

...


This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
klist_add_tail to trigger over and over on boot when compiling with
clang:


...



This patch seems to fix it. Not sure if that's just papering over it though.

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1ee0f22313ee..75fcb4370d96 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -119,7 +119,7 @@ __label_warn_on:
\
\
WARN_ENTRY(PPC_TLNEI " %4, 0",\
   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), 
\
-  __label_warn_on, "r" (x)); \
+  __label_warn_on, "r" (!!(x))); \
break;  \
   __label_warn_on: \
__ret_warn_on = true;   \


But for a simple WARN_ON() call:

void test(unsigned long b)
{
WARN_ON(b);
}

Without your change with GCC you get:

12d0 <.test>:
  12d0: 0b 03 00 00 tdnei   r3,0
  12d4: 4e 80 00 20 blr


With the !! change you get:

12d0 <.test>:
  12d0: 31 23 ff ff addic   r9,r3,-1
  12d4: 7d 29 19 10 subfe   r9,r9,r3
  12d8: 0b 09 00 00 tdnei   r9,0
  12dc: 4e 80 00 20 blr


Yeah that's a pity.

We could do something like below, which is ugly, but would be better
than having to revert the whole thing.


Yes looks nice, we already had that kind of stuff in the past, even more ugly.



Although this doesn't fix the strange warning in drivers/net/ethernet/sfc.


What's the warning ?




So possibly we need a CLANG ifdef around the whole thing, and use the
old style warn for clang.



Christophe

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
>> Nathan Chancellor  writes:
>>> On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote:
 Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
 flexibility to GCC.
>> ...
>>>
>>> This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
>>> flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
>>> klist_add_tail to trigger over and over on boot when compiling with
>>> clang:
>
> ...
>
>> 
>> This patch seems to fix it. Not sure if that's just papering over it though.
>> 
>> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
>> index 1ee0f22313ee..75fcb4370d96 100644
>> --- a/arch/powerpc/include/asm/bug.h
>> +++ b/arch/powerpc/include/asm/bug.h
>> @@ -119,7 +119,7 @@ __label_warn_on: 
>> \
>>  \
>>  WARN_ENTRY(PPC_TLNEI " %4, 0",  \
>> BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), 
>> \
>> -   __label_warn_on, "r" (x));   \
>> +   __label_warn_on, "r" (!!(x))); \
>>  break;  \
>>   __label_warn_on:   \
>>  __ret_warn_on = true;   \
>
> But for a simple WARN_ON() call:
>
> void test(unsigned long b)
> {
>   WARN_ON(b);
> }
>
> Without your change with GCC you get:
>
> 12d0 <.test>:
>  12d0:0b 03 00 00 tdnei   r3,0
>  12d4:4e 80 00 20 blr
>
>
> With the !! change you get:
>
> 12d0 <.test>:
>  12d0:31 23 ff ff addic   r9,r3,-1
>  12d4:7d 29 19 10 subfe   r9,r9,r3
>  12d8:0b 09 00 00 tdnei   r9,0
>  12dc:4e 80 00 20 blr

Yeah that's a pity.

We could do something like below, which is ugly, but would be better
than having to revert the whole thing.

Although this doesn't fix the strange warning in drivers/net/ethernet/sfc.

So possibly we need a CLANG ifdef around the whole thing, and use the
old style warn for clang.

cheers


diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1ee0f22313ee..d978d9004d0d 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -106,6 +106,12 @@ __label_warn_on:   
\
}   \
 } while (0)
 
+#ifdef CONFIG_CC_IS_CLANG
+#define __clang_warn_hack(x)   (!!(x))
+#else
+#define __clang_warn_hack(x)   (x)
+#endif
+
 #define WARN_ON(x) ({  \
bool __ret_warn_on = false; \
do {\
@@ -119,7 +125,8 @@ __label_warn_on:
\
\
WARN_ENTRY(PPC_TLNEI " %4, 0",  \
   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), 
\
-  __label_warn_on, "r" (x));   \
+  __label_warn_on, \
+  "r" __clang_warn_hack(x));   \
break;  \
 __label_warn_on:   \
__ret_warn_on = true;   \

[next-20210820][ppc][nvme/raid] pci unbind WARNS at fs/kernfs/dir.c:1524 kernfs_remove_by_name_ns+

2021-08-26 Thread Abdul Haleem


Greeting's

Today's linux-next kernel WARN's while unbind of pci ssd flash device on 
my powerpc box


# lspci -nn
001d:80:00.0 Non-Volatile memory controller [0108]: Seagate Technology 
PLC Nytro Flash Storage [1bb1:0100]
001e:90:00.0 Non-Volatile memory controller [0108]: Seagate Technology 
PLC Nytro Flash Storage [1bb1:0100]


$ echo -n 001d:80:00.0 > /sys/bus/pci/drivers/nvme/unbind

md: md127: raid0 array has a missing/failed member
Buffer I/O error on dev md127, logical block 1498959, async page read
Buffer I/O error on dev md127, logical block 1498959, async page read
md127: detected capacity change from 191866880 to 0
md: md127 stopped.
[ cut here ]
kernfs: can not remove 'md127', no directory
WARNING: CPU: 21 PID: 11006 at fs/kernfs/dir.c:1524 
kernfs_remove_by_name_ns+0xc0/0x110


Modules linked in: dm_mod rpadlpar_io rpaphp kvm_pr kvm nf_tables 
libcrc32c nfnetlink tcp_diag udp_diag inet_diag unix_diag af_packet_diag 
netlink_diag bridge stp llc rfkill sg pseries_rng raid0 xts vmx_crypto 
uio_pdrv_genirq gf128mul uio nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
binfmt_misc sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod ibmvscsi 
ibmvnic ibmveth scsi_transport_srp nvme nvme_core t10_piCPU: 21 PID: 
11006 Comm: mdadm Not tainted 5.14.0-rc6-next-20210820-autotest #1


NIP:  c056dda0 LR: c056dd9c CTR: 007088ec
REGS: c4593680 TRAP: 0700   Not tainted 
(5.14.0-rc6-next-20210820-autotest)
MSR:  8282b033   CR: 48044224  
XER: 000d

CFAR: c0145f00 IRQMASK: 0
GPR00: c056dd9c c4593920 c19b3200 002c
GPR04: 7fff c45935f0 0027 c0077cd07e08
GPR08: 0023 0001 0027 c18694b0
GPR12: 4000 c0001ec40680 7fff84875040 
GPR16:  0001  01001e990280
GPR20:   0066 00030d40
GPR24: c0002bc591e8  c0002bc591c8 c00025765c00
GPR28: c00025765c70 c00025765c00  c00025790640
NIP [c056dda0] kernfs_remove_by_name_ns+0xc0/0x110
LR [c056dd9c] kernfs_remove_by_name_ns+0xbc/0x110
Call Trace:
[c4593920] [c056dd9c] 
kernfs_remove_by_name_ns+0xbc/0x110 (unreliable)

[c45939b0] [c0572368] sysfs_remove_link+0x28/0x70
[c45939d0] [c0678bd4] bd_unlink_disk_holder+0xc4/0x130
[c4593a10] [c098e4e0] unbind_rdev_from_array+0x40/0x1b0
[c4593ac0] [c09921e0] do_md_stop+0x410/0x5a0
[c4593bc0] [c09966ac] md_ioctl+0xc6c/0x1c60
[c4593cd0] [c0654ebc] blkdev_ioctl+0x2fc/0x740
[c4593d40] [c04ce284] block_ioctl+0x74/0x90
[c4593d60] [c047b6e8] sys_ioctl+0xf8/0x150
[c4593db0] [c002ff28] system_call_exception+0x158/0x2c0
[c4593e10] [c000c764] system_call_common+0xf4/0x258
--- interrupt: c00 at 0x7fff84790290
NIP:  7fff84790290 LR: 00013d5c2ef0 CTR: 
REGS: c4593e80 TRAP: 0c00   Not tainted 
(5.14.0-rc6-next-20210820-autotest)
MSR:  8280f033   CR: 
28044288  XER: 

IRQMASK: 0
GPR00: 0036 7fffd0148020 7fff84877300 0003
GPR04: 2932  00030d40 7fffd0148028
GPR08: 0003   
GPR12:  7fff8494a780 7fff84875040 
GPR16:  0001  01001e990280
GPR20:   0066 00030d40
GPR24: 2932 01001e992ac0  01001e990a40
GPR28:  0018 7fffd0148100 0003
NIP [7fff84790290] 0x7fff84790290
LR [00013d5c2ef0] 0x13d5c2ef0
--- interrupt: c00
Instruction dump:
ebe10088 38210090 e8010010 ebc1fff0 7c0803a6 4e800020 6000 6000
3c62ff5c 3863d168 4bbd8109 6000 <0fe0> fba10078 fbe10088 6000
---[ end trace 634fa04d6dac7dfd ]---

--
Regard's

Abdul Haleem
IBM Linux Technology Center

Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-26 Thread Segher Boessenkool

On Thu, Aug 26, 2021 at 11:57:52PM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
> > On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
> >> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
> >> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
> >> >> This one possibly the branches end up in predictors, whereas 
> >> >> conditional 
> >> >> trap is always just speculated not to hit. Branches may also have a
> >> >> throughput limit on execution whereas trap could be more (1 per cycle
> >> >> vs 4 per cycle on POWER9).
> >> > 
> >> > I thought only *taken* branches are just one per cycle?
> >> 
> >> Taken branches are fetched by the front end at one per cycle (assuming 
> >> they hit the BTAC), but all branches have to be executed by BR at one 
> >> per cycle
> > 
> > This is not true.  (Simple) predicted not-taken conditional branches are
> > just folded out, never hit the issue queues.  And they are fetched as
> > many together as fit in a fetch group, can complete without limits as
> > well.
> 
> No, they are all dispatched and issue to the BRU for execution. It's 
> trivial to construct a test of a lot of not taken branches in a row
> and time a loop of it to see it executes at 1 cycle per branch.

(s/dispatched/issued/)

Huh, this was true on p8 already.  Sorry for my confusion.  In my
defence, this doesn't matter for performance on "real code".

> > Correctly predicted simple conditional branches just get their prediction
> > validated (and that is not done in the execution units).  Incorrectly
> > predicted branches the same, but those cause a redirect and refetch.
> 
> How could it validate prediction without issuing? It wouldn't know when
> sources are ready.

In the backend.  But that is just how it worked on older cores :-/

> >> The first problem seems like the show stopper though. AFAIKS it would 
> >> need a special builtin support that does something to create the table
> >> entry, or a guarantee that we could put an inline asm right after the
> >> builtin as a recognized pattern and that would give us the instruction
> >> following the trap.
> > 
> > I'm not quite sure what this means.  Can't you always just put a
> > 
> > bla:asm("");
> > 
> > in there, and use the address of "bla"?
> 
> Not AFAIKS. Put it where?

After wherever you want to know the address after.  You will have to
make sure they stay together somehow.

It is much easier to get the address of something, not the address after
it.  If you know it is just one insn anyway, that isn't hard to handle
either (even if prefixed insns exist).

> > If not, you need to say a lot
> > more about what you actually want to do :-/
> 
> We need to put the address (or relative offset) of the trap instruction
> into an entry in a different section. Basically this:
> 
>__builtin_trap();
>asm ("1:   \n\t"
> ".section __bug_table,\"aw\"  \n\t"
> "2:  .4byte 1b - 2b - 4   \n\t"
> ".previous");
> 
> Where the 1: label must follow immediately after the trap instruction, 
> and where the asm must be emitted even for the case of no-return traps 
> (but anything following the asm could be omitted).

The address *after* the trap insn?  That is much much harder than the
address *of* the trap insn.


Segher

Re: [PATCH v2 RESEND] powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC

2021-08-26 Thread Michael Ellerman

Paul Moore  writes:
> On Tue, Aug 24, 2021 at 1:11 PM Christophe Leroy
>  wrote:
>> Le 24/08/2021 à 16:47, Paul Moore a écrit :
>> > On Tue, Aug 24, 2021 at 9:36 AM Christophe Leroy
>> >  wrote:
>> >>
>> >> Commit e65e1fc2d24b ("[PATCH] syscall class hookup for all normal
>> >> targets") added generic support for AUDIT but that didn't include
>> >> support for bi-arch like powerpc.
>> >>
>> >> Commit 4b58841149dc ("audit: Add generic compat syscall support")
>> >> added generic support for bi-arch.
>> >>
>> >> Convert powerpc to that bi-arch generic audit support.
>> >>
>> >> Cc: Paul Moore 
>> >> Cc: Eric Paris 
>> >> Signed-off-by: Christophe Leroy 
>> >> ---
>> >> Resending v2 with Audit people in Cc
>> >>
>> >> v2:
>> >> - Missing 'git add' for arch/powerpc/include/asm/unistd32.h
>> >> - Finalised commit description
>> >> ---
>> >>   arch/powerpc/Kconfig|  5 +-
>> >>   arch/powerpc/include/asm/unistd32.h |  7 +++
>> >>   arch/powerpc/kernel/Makefile|  3 --
>> >>   arch/powerpc/kernel/audit.c | 84 -
>> >>   arch/powerpc/kernel/compat_audit.c  | 44 ---
>> >>   5 files changed, 8 insertions(+), 135 deletions(-)
>> >>   create mode 100644 arch/powerpc/include/asm/unistd32.h
>> >>   delete mode 100644 arch/powerpc/kernel/audit.c
>> >>   delete mode 100644 arch/powerpc/kernel/compat_audit.c
>> >
>> > Can you explain, in detail please, the testing you have done to verify
>> > this patch?
>> >
>>
>> I built ppc64_defconfig and checked that the generated code is functionnaly 
>> equivalent.
>>
>> ppc32_classify_syscall() is exactly the same as 
>> audit_classify_compat_syscall() except that the
>> later takes the syscall as second argument (ie in r4) whereas the former 
>> takes it as first argument
>> (ie in r3).
>>
>> audit_classify_arch() and powerpc audit_classify_syscall() are slightly 
>> different between the
>> powerpc version and the generic version because the powerpc version checks 
>> whether it is
>> AUDIT_ARCH_PPC or not (ie value 20), while the generic one checks whether it 
>> has bit
>> __AUDIT_ARCH_64BIT set or not (__AUDIT_ARCH_64BIT is the sign bit of a 
>> word), but taking into
>> account that the abi is either AUDIT_ARCH_PPC, AUDIT_ARCH_PPC64 or 
>> AUDIT_ARCH_PPC64LE, the result is
>> the same.
>>
>> If you are asking I guess you saw something wrong ?
>
> I was asking because I didn't see any mention of testing, and when you
> are enabling something significant like this it is nice to see that it
> has been verified to work :)
>
> While binary dumps and comparisons are nice, it is always good to see
> verification from a test suite.  I don't have access to the necessary
> hardware to test this, but could you verify that the audit-testsuite
> passes on your test system with your patches applied?
>
>  * https://github.com/linux-audit/audit-testsuite

I tested on ppc64le. Both before and after the patch I get the result
below.

So I guess the patch is OK, but maybe we have some existing issue.

I had a bit of a look at the test code, but my perl is limited. I think
it was running the command below, and it returned "", but
not really sure what that means.

  $ sudo ausearch -i -m SYSCALL -p 216440 -ui 0 -gi 0 -ul 0 -su unconfined 
_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 -ts recent
  

cheers



Running as   userroot
with context unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
on   system  Fedora

backlog_wait_time_actual_reset/test .. ok
exec_execve/test . ok
exec_name/test ... ok
file_create/test . ok
file_delete/test . ok
file_rename/test . ok
filter_exclude/test .. 1/21
# Test 20 got: "256" (filter_exclude/test at line 167)
#Expected: "0"
#  filter_exclude/test line 167 is: ok( $result, 0 );
# Test 21 got: "0" (filter_exclude/test at line 179)
#Expected: "1"
#  filter_exclude/test line 179 is: ok( $found_msg, 1 );
filter_exclude/test .. Failed 2/21 subtests
filter_saddr_fam/test  ok
filter_sessionid/test  ok
login_tty/test ... ok
lost_reset/test .. ok
netfilter_pkt/test ... ok
syscalls_file/test ... ok
syscall_module/test .. ok
time_change/test . ok
user_msg/test  ok
fanotify/test  ok
bpf/test . ok

Test Summary Report
---
filter_exclude/test(Wstat: 0 Tests: 21 Failed: 2)
  Failed tests:  20-21
Files=18, Tests=202, 45 wallclock secs ( 0.18 usr  0.03 sys + 20.15 cusr  0.92 
csys = 21.28 CPU)
Result: FAIL
Failed 1/18 test programs. 2/202 subtests failed.

Re: [PATCH linux-next] powerpc/tm: remove duplicate include in tm-poison.c

2021-08-26 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 24/08/2021 à 16:40, Shuah Khan a écrit :
>> On 8/5/21 12:52 AM, cgel@gmail.com wrote:
>>> From: yong yiran 
>>>
>>> 'inttypes.h' included in 'tm-poison.c' is duplicated.
>>> Remove all but the first include of inttypes.h from tm-poison.c.
>>>
>>> Reported-by: Zeal Robot 
>>> Signed-off-by: yong yiran 
>>> ---
>>>   tools/testing/selftests/powerpc/tm/tm-poison.c | 1 -
>>>   1 file changed, 1 deletion(-)
>>>
>>> diff --git a/tools/testing/selftests/powerpc/tm/tm-poison.c 
>>> b/tools/testing/selftests/powerpc/tm/tm-poison.c
>>> index 29e5f26af7b9..27c083a03d1f 100644
>>> --- a/tools/testing/selftests/powerpc/tm/tm-poison.c
>>> +++ b/tools/testing/selftests/powerpc/tm/tm-poison.c
>>> @@ -20,7 +20,6 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> -#include 
>>>   #include "tm.h"
>>>
>> 
>> We can't accept this patch. The from and Signed-off-by don't match.
>> 
>
> As far as I can see they match. You have:
>
> From: yong yiran 
> Signed-off-by: yong yiran 

Regardless I already have a patch queued to fix this, from a different
CI bot.

cheers

Re: [PATCH] powerpc: Make set_endian() return EINVAL when not supporting little endian

2021-08-26 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 26/08/2021 à 05:41, Michael Ellerman a écrit :
>> Christophe Leroy  writes:
>>> There is no point in modifying MSR_LE bit on CPUs not supporting
>>> little endian.
>> 
>> Isn't that an ABI break?
>
> Or an ABI fix ? I don't know.

It could break existing applications, even if the new semantics make
more sense. So that's a break IMHO :)

> My first thought was that all other 32 bits architectures were returning 
> -EINVAL, but looking at the 
> man page of prctl, it is explicit that this is powerpc only.

It could be generic, but yeah seems we're the only arch that implements
it.

>> set_endian(PR_ENDIAN_BIG) should work on a big endian CPU, even if it
>> does nothing useful.
>
> Fair enough. But shouldn't in that case get_endian() return PR_ENDIAN_BIG 
> instead of returning EINVAL ?

> We can do one or the other, but I think it should at least be consistant 
> between them, shouldn't it ?

It should be consistent, but it isn't, and if we change it now we
potentially break existing userspace, which is bad.

I don't think it's widely used, and the risk of breakage would be
minimal, but it's not zero.

So I'm not sure it's worth changing it just for the sake of consistency.

cheers

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Segher Boessenkool

On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
> So there seems to be some misunderstanding with clang, it doesn't like
> us passing an expression to the inline asm.
> 
> AFAIK it is legal to pass expressions as inputs to inline asm, ie. it
> doesn't have to just be a variable name.

It certainly is.  That is the whole point of inline asm!  This way, all
of the C code "around" the asm can be optimised.

> This patch seems to fix it. Not sure if that's just papering over it though.

It is, and it makes less optimised code (also on GCC), as Christophe
points out.

Segher

Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-26 Thread Nicholas Piggin

Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
> Hi!
> 
> On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
>> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
>> >> This one possibly the branches end up in predictors, whereas conditional 
>> >> trap is always just speculated not to hit. Branches may also have a
>> >> throughput limit on execution whereas trap could be more (1 per cycle
>> >> vs 4 per cycle on POWER9).
>> > 
>> > I thought only *taken* branches are just one per cycle?
>> 
>> Taken branches are fetched by the front end at one per cycle (assuming 
>> they hit the BTAC), but all branches have to be executed by BR at one 
>> per cycle
> 
> This is not true.  (Simple) predicted not-taken conditional branches are
> just folded out, never hit the issue queues.  And they are fetched as
> many together as fit in a fetch group, can complete without limits as
> well.

No, they are all dispatched and issue to the BRU for execution. It's 
trivial to construct a test of a lot of not taken branches in a row
and time a loop of it to see it executes at 1 cycle per branch.

> The BTAC is a frontend thing, used for target address prediction.  It
> does not limit execution.

I didn't say it did.

> Correctly predicted simple conditional branches just get their prediction
> validated (and that is not done in the execution units).  Incorrectly
> predicted branches the same, but those cause a redirect and refetch.

How could it validate prediction without issuing? It wouldn't know when
sources are ready.

> 
>> > Internally *all* traps are conditional, in GCC.  It also can optimise
>> > them quite well.  There must be something in the kernel macros that
>> > prevents good optimisation.
>> 
>> I did take a look at it at one point.
>> 
>> One problem is that the kernel needs the address of the trap instruction 
>> to create the entry for it. The other problem is that __builtin_trap 
>> does not return so it can't be used for WARN. LLVM at least seems to 
>> have a __builtin_debugtrap which does return.
> 
> This is .

Aha.

>> The first problem seems like the show stopper though. AFAIKS it would 
>> need a special builtin support that does something to create the table
>> entry, or a guarantee that we could put an inline asm right after the
>> builtin as a recognized pattern and that would give us the instruction
>> following the trap.
> 
> I'm not quite sure what this means.  Can't you always just put a
> 
> bla:  asm("");
> 
> in there, and use the address of "bla"?

Not AFAIKS. Put it where?

> If not, you need to say a lot
> more about what you actually want to do :-/

We need to put the address (or relative offset) of the trap instruction
into an entry in a different section. Basically this:

   __builtin_trap();
   asm ("1:   \n\t"
".section __bug_table,\"aw\"  \n\t"
"2:  .4byte 1b - 2b - 4   \n\t"
".previous");

Where the 1: label must follow immediately after the trap instruction, 
and where the asm must be emitted even for the case of no-return traps 
(but anything following the asm could be omitted).

Thanks,
Nick

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Segher Boessenkool

On Thu, Aug 26, 2021 at 08:37:09AM +0200, Christophe Leroy wrote:
> Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
> >This patch seems to fix it. Not sure if that's just papering over it 
> >though.
> >
> >diff --git a/arch/powerpc/include/asm/bug.h 
> >b/arch/powerpc/include/asm/bug.h
> >index 1ee0f22313ee..75fcb4370d96 100644
> >--- a/arch/powerpc/include/asm/bug.h
> >+++ b/arch/powerpc/include/asm/bug.h
> >@@ -119,7 +119,7 @@ __label_warn_on:  \
> > \
> > WARN_ENTRY(PPC_TLNEI " %4, 0",  \
> >BUGFLAG_WARNING | 
> >BUGFLAG_TAINT(TAINT_WARN),   \
> >-   __label_warn_on, "r" (x));   \
> >+   __label_warn_on, "r" (!!(x))); \
> > break;  \
> >  __label_warn_on:   \
> > __ret_warn_on = true;   \
> 
> But for a simple WARN_ON() call:
> 
> void test(unsigned long b)
> {
>   WARN_ON(b);
> }
> 
> Without your change with GCC you get:
> 
> 12d0 <.test>:
> 12d0: 0b 03 00 00 tdnei   r3,0
> 12d4: 4e 80 00 20 blr
> 
> 
> With the !! change you get:
> 
> 12d0 <.test>:
> 12d0: 31 23 ff ff addic   r9,r3,-1
> 12d4: 7d 29 19 10 subfe   r9,r9,r3
> 12d8: 0b 09 00 00 tdnei   r9,0
> 12dc: 4e 80 00 20 blr

That is because the asm (unlike the builtin) cannot be optimised by the
compiler.


Segher

Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-08-26 Thread Michael Ellerman

Srikar Dronamraju  writes:
> Scheduler expects unique number of node distances to be available at
> boot.

I think it needs "the number of unique node distances" ?

> It uses node distance to calculate this unique node distances.

It iterates over all pairs of nodes and records node_distance() for that
pair, and then calculates the set of unique distances.

> On POWER, node distances for offline nodes is not available. However,
> POWER already knows unique possible node distances.

I think it would be more accurate to say PAPR rather than POWER there.
It's PAPR that defines the way we determine distances and imposes that
limitation.

> Fake the offline node's distance_lookup_table entries so that all
> possible node distances are updated.

Does this work if we have a single node offline at boot?

Say we start with:

node distances:
node   0   1
  0:  10  20
  1:  20  10

And node 2 is offline at boot. We can only initialise that nodes entries
in the distance_lookup_table:

while (i--)
distance_lookup_table[node][i] = node;

By filling them all with 2 that causes node_distance(2, X) to return the
maximum distance for all other nodes X, because we won't break out of
the loop in __node_distance():

for (i = 0; i < distance_ref_points_depth; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
break;

/* Double the distance for each NUMA level */
distance *= 2;
}

If distance_ref_points_depth was 4 we'd return 160.

That'd leave us with 3 unique distances at boot, 10, 20, 160.

But when node 2 comes online it might introduce more than 1 new distance
value, eg. it could be that the actual distances are:

node distances:
node   0   1   2
  0:  10  20  40
  1:  20  10  80
  2:  40  80  10

ie. we now have 4 distances, 10, 20, 40, 80.

What am I missing?

> However this only needs to be done if the number of unique node
> distances that can be computed for online nodes is less than the
> number of possible unique node distances as represented by
> distance_ref_points_depth.

Looking at a few machines they all have distance_ref_points_depth = 2.

So maybe that explains it, in practice we only see 10, 20, 40.


> When the node is actually onlined, distance_lookup_table will be
> updated with actual entries.

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nathan Lynch 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Gautham R Shenoy 
> Cc: Vincent Guittot 
> Cc: Geetika Moolchandani 
> Cc: Laurent Dufour 
> Cc: kernel test robot 
> Signed-off-by: Srikar Dronamraju 
> ---
>  arch/powerpc/mm/numa.c | 70 ++
>  1 file changed, 70 insertions(+)
>
> Changelog:
> v1: 
> https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-sri...@linux.vnet.ibm.com/t/#u
> [ Fixed a missing prototype warning Reported-by: kernel test robot 
> ]
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c124928a16d..0ee79a08c9e1 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
>   }
>  }
>  
> +/*
> + * Scheduler expects unique number of node distances to be available at
> + * boot. It uses node distance to calculate this unique node distances. On
> + * POWER, node distances for offline nodes is not available. However, POWER
> + * already knows unique possible node distances. Fake the offline node's
> + * distance_lookup_table entries so that all possible node distances are
> + * updated.
> + */

> +static void __init fake_update_distance_lookup_table(void)
> +{
> + unsigned long distance_map;
> + int i, nr_levels, nr_depth, node;

Are they distances, depths, or levels? :)

Bit more consistency in the variable names might make the code easier to
follow.

> +
> + if (!numa_enabled)
> + return;
> +
> + if (!form1_affinity)
> + return;

That doesn't exist since Aneesh's FORM2 series, so that will need a
rebase, and possibly some more rework to interact with that series.

> + /*
> +  * distance_ref_points_depth lists the unique numa domains
> +  * available. However it ignore LOCAL_DISTANCE. So add +1
> +  * to get the actual number of unique distances.
> +  */
> + nr_depth = distance_ref_points_depth + 1;

num_depths would be a better name IMHO.

> +
> + WARN_ON(nr_depth > sizeof(distance_map));

Warn but then continue, and corrupt something on the stack? Seems like a
bad idea :)

I guess it's too early to use bitmap_alloc(). But can we at least return
if nr_depth is too big.

> +
> + bitmap_zero(_map, nr_depth);
> + bitmap_set(_map, 0, 1);
> +
> + for_each_online_node(node) {
> + int nd, distance = LOCAL_DISTANCE;
> +
> + if (node == first_online_node)
> + continue;
> +
> + nd

Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-08-26 Thread Segher Boessenkool

Hi!

On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
> >> This one possibly the branches end up in predictors, whereas conditional 
> >> trap is always just speculated not to hit. Branches may also have a
> >> throughput limit on execution whereas trap could be more (1 per cycle
> >> vs 4 per cycle on POWER9).
> > 
> > I thought only *taken* branches are just one per cycle?
> 
> Taken branches are fetched by the front end at one per cycle (assuming 
> they hit the BTAC), but all branches have to be executed by BR at one 
> per cycle

This is not true.  (Simple) predicted not-taken conditional branches are
just folded out, never hit the issue queues.  And they are fetched as
many together as fit in a fetch group, can complete without limits as
well.

The BTAC is a frontend thing, used for target address prediction.  It
does not limit execution.

Correctly predicted simple conditional branches just get their prediction
validated (and that is not done in the execution units).  Incorrectly
predicted branches the same, but those cause a redirect and refetch.

> > Internally *all* traps are conditional, in GCC.  It also can optimise
> > them quite well.  There must be something in the kernel macros that
> > prevents good optimisation.
> 
> I did take a look at it at one point.
> 
> One problem is that the kernel needs the address of the trap instruction 
> to create the entry for it. The other problem is that __builtin_trap 
> does not return so it can't be used for WARN. LLVM at least seems to 
> have a __builtin_debugtrap which does return.

This is .

> The first problem seems like the show stopper though. AFAIKS it would 
> need a special builtin support that does something to create the table
> entry, or a guarantee that we could put an inline asm right after the
> builtin as a recognized pattern and that would give us the instruction
> following the trap.

I'm not quite sure what this means.  Can't you always just put a

bla:asm("");

in there, and use the address of "bla"?  If not, you need to say a lot
more about what you actually want to do :-/

Segher

Re: [PATCH v3] PCI: Move pci_dev_is/assign_added() to pci.h

2021-08-26 Thread Niklas Schnelle

On Wed, 2021-08-25 at 14:04 -0500, Bjorn Helgaas wrote:
> On Mon, Aug 23, 2021 at 12:53:39PM +0200, Niklas Schnelle wrote:
> > On Fri, 2021-08-20 at 17:37 -0500, Bjorn Helgaas wrote:
> > > On Tue, Jul 20, 2021 at 05:01:45PM +0200, Niklas Schnelle wrote:
> > > > The helper function pci_dev_is_added() from drivers/pci/pci.h is used in
> > > > PCI arch code of both s390 and powerpc leading to awkward relative
> > > > includes. Move it to the global include/linux/pci.h and get rid of these
> > > > includes just for that one function.
> > > 
> > > I agree the includes are awkward.
> > > 
> > > But the arch code *using* pci_dev_is_added() seems awkward, too.
> > 
> > See below for my interpretation why s390 has some driver like
> > functionality in its arch code which isn't necessarily awkward.
> > 
> > Independent from that I have found pci_dev_is_added() as the only way
> > deal with the case that one might be looking at a struct pci_dev
> > reference that has been removed via pci_stop_and_remove_bus_device() or
> > has never been fully scanned. This is quite useful when handling error
> > events which on s390 are part of the adapter event mechanism shared
> > with channel I/O devices.
> > 
> > > AFAICS, in powerpc, pci_dev_is_added() is only used by
> > > pnv_pci_ioda_fixup_iov() and pseries_pci_fixup_iov_resources().  Those
> > > are only called from pcibios_add_device(), which is only called from
> > > pci_device_add().
> > > 
> > > Is it even possible for pci_dev_is_added() to be true in that path?
> 
> If the pci_dev_is_added() in powerpc is unreachable, we can remove it
> and at least reduce this to an s390-only problem.

Ok. I might be missing something but I agree it does look these are
called from within pcibios_add_device() only so pci_dev_is_added() can
never be true. This looks pretty clear as pci_dev_assign_added() is
called after pcibios_add_device() in pci_bus_add_device().

> 
> > > s390 uses pci_dev_is_added() in recover_store()
> > 
> > I'm actually looking into this as I'm working on an s390 implementation
> > of the PCI recovery flow described in Documentation/PCI/pci-error-
> > recovery.rst that would also call pci_dev_is_added() because when we
> > get a platform notification of a PCI reset done by firmware it may be
> > that the struct pci_dev is going away i.e. we still have a ref count
> > but it is not added to the PCI bus anymore. And pci_dev_is_added() is
> > the only way I've found to check for this state.
> > 
> > > , but I don't know what
> > > that is (looks like a sysfs file, but it's not documented) or why s390
> > > is the only arch that does this.
> > 
> > Good point about this not being documented, I'll look into adding docs.
> > 
> > This is a sysfs attribute that basically removes the pci_dev and re-
> > adds it. This has the complication that since the attribute sits at
> > /sys/bus/pci/devices//recover it deletes its own parent directory
> > which requires extra caution and means concurrent accesses block on
> > pci_lock_rescan_remove() instead of a kernfs lock.
> > Long story short when concurrently triggering the attribute one thread
> > proceeds into the pci_lock_rescan_remove() section and does the
> > removal, while others would block on pci_lock_rescan_remove(). Now when
> > the threads unblock the removal is done. In this case there is a new
> > struct pci_dev found in the rescan but the previously blocked threads
> > still have references to the old struct pci_dev which was removed and
> > as far as I could tell can only be distinguished by checking
> > pci_dev_is_added().
> 
> Is this locking issue different from concurrently writing to
> /sys/.../remove on other architectures?

In principle it is very similar except that we re-scan and thus the
removed pdev may co-exist with a new pdev for the same actual device if
there are other references to the pdev.

There is however also a significant difference in locking that fixes a
possible deadlock that I just confirmed also affects /sys/../remove
where it is hidden by a lockdep ignore, see lockdep splash below. 

For /sys/../recover this was fixed by my commit dd712f0a53ca
("s390/pci: Fix possible deadlock in  recover_store()") which also
introduced the need for pci_dev_is_added().

The change in the above commit is very similar to what is done in
drivers/scsi/scsi_sysfs.c:sdev_store_delete() which also has a self
deletion and in commit 0ee223b2e1f6 ("scsi: core: Avoid that SCSI
device removal through sysfs triggers a deadlock") fixed a very very
similar lock order problem.

Like the SCSI code I use sysfs_break_active_protection() which allows a
concurrent access to /sys/../recover to advance into recover_store().
It may then block on pci_lock_rescan_remove() instead of the kernfs
lock it would block on with just delete_remove_file_self(). Thus we
have to handle unblocking from pci_lock_rescan_remove() while holding
the stale struct pci_dev of the removed device. Together with the re-
scan this means I have to be

[PATCH 3/3] powerpc/configs/microwatt: Enable options for systemd

2021-08-26 Thread Joel Stanley

When booting with systemd these options are required.

This increases the image by about 50KB, or 2%.

Signed-off-by: Joel Stanley 
---
 arch/powerpc/configs/microwatt_defconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index 2f8b1fec9a16..4a4924cd056e 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -5,6 +5,7 @@ CONFIG_PREEMPT_VOLUNTARY=y
 CONFIG_TICK_CPU_ACCOUNTING=y
 CONFIG_LOG_BUF_SHIFT=16
 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=12
+CONFIG_CGROUPS=y
 CONFIG_BLK_DEV_INITRD=y
 CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 CONFIG_KALLSYMS_ALL=y
@@ -77,8 +78,10 @@ CONFIG_SPI_SPIDEV=y
 CONFIG_EXT4_FS=y
 # CONFIG_FILE_LOCKING is not set
 # CONFIG_DNOTIFY is not set
-# CONFIG_INOTIFY_USER is not set
+CONFIG_AUTOFS_FS=y
+CONFIG_TMPFS=y
 # CONFIG_MISC_FILESYSTEMS is not set
+CONFIG_CRYPTO_SHA256=y
 # CONFIG_CRYPTO_HW is not set
 # CONFIG_XZ_DEC_X86 is not set
 # CONFIG_XZ_DEC_IA64 is not set
-- 
2.33.0

[PATCH 2/3] powerpc/configs/microwattt: Enable Liteeth

2021-08-26 Thread Joel Stanley

Liteeth is the network device used by Microwatt.

Signed-off-by: Joel Stanley 
---
 arch/powerpc/configs/microwatt_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index a08b739123da..2f8b1fec9a16 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -53,6 +53,7 @@ CONFIG_MTD_SPI_NOR=y
 CONFIG_BLK_DEV_LOOP=y
 CONFIG_BLK_DEV_RAM=y
 CONFIG_NETDEVICES=y
+CONFIG_LITEX_LITEETH=y
 # CONFIG_WLAN is not set
 # CONFIG_INPUT is not set
 # CONFIG_SERIO is not set
-- 
2.33.0

[PATCH 1/3] powerpc/microwatt: Add Ethernet to device tree

2021-08-26 Thread Joel Stanley

The liteeth network device is used in the Microwatt soc.

Signed-off-by: Joel Stanley 
---
 arch/powerpc/boot/dts/microwatt.dts | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/boot/dts/microwatt.dts 
b/arch/powerpc/boot/dts/microwatt.dts
index 974abbdda249..65b270a90f94 100644
--- a/arch/powerpc/boot/dts/microwatt.dts
+++ b/arch/powerpc/boot/dts/microwatt.dts
@@ -127,6 +127,18 @@ UART0: serial@2000 {
fifo-size = <16>;
interrupts = <0x10 0x1>;
};
+
+   ethernet@802 {
+   compatible = "litex,liteeth";
+   reg = <0x8021000 0x100
+   0x8020800 0x100
+   0x803 0x2000>;
+   reg-names = "mac", "mido", "buffer";
+   litex,rx-slots = <2>;
+   litex,tx-slots = <2>;
+   litex,slot-size = <0x800>;
+   interrupts = <0x11 0x1>;
+   };
};
 
chosen {
-- 
2.33.0

[PATCH 0/3] powerpc/microwatt: Device tree and config updates

2021-08-26 Thread Joel Stanley

This enables the liteeth network device for microwatt which will be
merged in v5.15.

It also turns on some options so the microwatt defconfig can be used to
boot a userspace with systemd.

Joel Stanley (3):
  powerpc/microwatt: Add Ethernet to device tree
  powerpc/configs/microwattt: Enable Liteeth
  powerpc/configs/microwatt: Enable options for systemd

 arch/powerpc/boot/dts/microwatt.dts  | 12 
 arch/powerpc/configs/microwatt_defconfig |  6 +-
 2 files changed, 17 insertions(+), 1 deletion(-)

-- 
2.33.0

[PATCH v5 1/5] powerpc/numa: Drop dbg in favour of pr_debug

2021-08-26 Thread Srikar Dronamraju

PowerPc supported numa=debug which is not documented. This option was
used to print early debug output. However something more flexible can be
achieved by using CONFIG_DYNAMIC_DEBUG.

Hence drop dbg (and numa=debug) in favour of pr_debug

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Suggested-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..5e9b777a1151 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -40,9 +40,6 @@ static int numa_enabled = 1;
 
 static char *cmdline __initdata;
 
-static int numa_debug;
-#define dbg(args...) if (numa_debug) { printk(KERN_INFO args); }
-
 int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
@@ -79,7 +76,7 @@ static void __init setup_node_to_cpumask_map(void)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-   dbg("Node to cpumask map for %u nodes\n", nr_node_ids);
+   pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
 static int __init fake_numa_create_new_node(unsigned long end_pfn,
@@ -123,7 +120,7 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
cmdline = p;
fake_nid++;
*nid = fake_nid;
-   dbg("created new fake_node with id %d\n", fake_nid);
+   pr_debug("created new fake_node with id %d\n", fake_nid);
return 1;
}
return 0;
@@ -141,7 +138,7 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   dbg("adding cpu %d to node %d\n", cpu, node);
+   pr_debug("adding cpu %d to node %d\n", cpu, node);
 
if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
@@ -152,7 +149,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   dbg("removing cpu %lu from node %d\n", cpu, node);
+   pr_debug("removing cpu %lu from node %d\n", cpu, node);
 
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
@@ -313,7 +310,7 @@ static int __init find_min_common_depth(void)
_ref_points_depth);
 
if (!distance_ref_points) {
-   dbg("NUMA: ibm,associativity-reference-points not found.\n");
+   pr_debug("NUMA: ibm,associativity-reference-points not 
found.\n");
goto err;
}
 
@@ -321,7 +318,7 @@ static int __init find_min_common_depth(void)
 
if (firmware_has_feature(FW_FEATURE_OPAL) ||
firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
-   dbg("Using form 1 affinity\n");
+   pr_debug("Using form 1 affinity\n");
form1_affinity = 1;
}
 
@@ -719,7 +716,7 @@ static int __init parse_numa_properties(void)
return min_common_depth;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   pr_debug("NUMA associativity depth for CPU/Memory: %d\n", 
min_common_depth);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -1014,9 +1011,6 @@ static int __init early_numa(char *p)
if (strstr(p, "off"))
numa_enabled = 0;
 
-   if (strstr(p, "debug"))
-   numa_debug = 1;
-
p = strstr(p, "fake=");
if (p)
cmdline = p + strlen("fake=");
@@ -1179,7 +1173,7 @@ static long vphn_get_associativity(unsigned long cpu,
 
switch (rc) {
case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
+   pr_debug("VPHN hcall succeeded. Reset polling...\n");
goto out;
 
case H_FUNCTION:
-- 
2.18.2

[PATCH v3 5/5] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-08-26 Thread Srikar Dronamraju

Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On POWER, node distances for offline nodes is not available. However,
POWER already knows unique possible node distances. Fake the offline
node's distance_lookup_table entries so that all possible node
distances are updated.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Cc: kernel test robot 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 70 ++
 1 file changed, 70 insertions(+)

Changelog:
v1: 
https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-sri...@linux.vnet.ibm.com/t/#u
[ Fixed a missing prototype warning Reported-by: kernel test robot 
]

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87ade2f56f45..afa2ede4ac53 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -849,6 +849,75 @@ void __init dump_numa_cpu_topology(void)
}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+static void __init fake_update_distance_lookup_table(void)
+{
+   unsigned long distance_map;
+   int i, nr_levels, nr_depth, node;
+
+   if (!numa_enabled)
+   return;
+
+   if (!form1_affinity)
+   return;
+
+   /*
+* distance_ref_points_depth lists the unique numa domains
+* available. However it ignore LOCAL_DISTANCE. So add +1
+* to get the actual number of unique distances.
+*/
+   nr_depth = distance_ref_points_depth + 1;
+
+   WARN_ON(nr_depth > sizeof(distance_map));
+
+   bitmap_zero(_map, nr_depth);
+   bitmap_set(_map, 0, 1);
+
+   for_each_online_node(node) {
+   int nd, distance = LOCAL_DISTANCE;
+
+   if (node == first_online_node)
+   continue;
+
+   nd = __node_distance(node, first_online_node);
+   for (i = 0; i < nr_depth; i++, distance *= 2) {
+   if (distance == nd) {
+   bitmap_set(_map, i, 1);
+   break;
+   }
+   }
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+
+   for_each_node(node) {
+   if (node_online(node))
+   continue;
+
+   i = find_first_zero_bit(_map, nr_depth);
+   if (i >= nr_depth || i == 0) {
+   pr_warn("Levels(%d) not matching levels(%d)", 
nr_levels, nr_depth);
+   return;
+   }
+
+   bitmap_set(_map, i, 1);
+   while (i--)
+   distance_lookup_table[node][i] = node;
+
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -964,6 +1033,7 @@ void __init mem_topology_setup(void)
 */
numa_setup_cpu(cpu);
}
+   fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.18.2

[PATCH v3 4/5] powerpc/numa: Update cpu_cpu_map on CPU online/offline

2021-08-26 Thread Srikar Dronamraju

cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER:
0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  |  7 ++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b31b8ca3ae2e..d947a4fd753c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
+   unmap_cpu_from_node(cpu);
+
if (shared_caches)
mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 * This CPU will not be in the online mask yet so we need to manually
 * add it to it's own thread sibling mask.

[PATCH v3 3/5] powerpc/numa: Print debug statements only when required

2021-08-26 Thread Srikar Dronamraju

Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9af38b1c618b..6655ecdeddef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -138,10 +138,10 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   pr_debug("adding cpu %d to node %d\n", cpu, node);
-
-   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   pr_debug("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -149,10 +149,9 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   pr_debug("removing cpu %lu from node %d\n", cpu, node);
-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   pr_debug("removing cpu %lu from node %d\n", cpu, node);
} else {
pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
-- 
2.18.2

[PATCH v5 2/5] powerpc/numa: convert printk to pr_xxx

2021-08-26 Thread Srikar Dronamraju

Convert the remaining printk to pr_xxx
One advantage would be all prints will now have prefix "numa:" from
pr_fmt().

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
[ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ]
Suggested-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e9b777a1151..9af38b1c618b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -154,8 +154,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
@@ -326,8 +325,7 @@ static int __init find_min_common_depth(void)
depth = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
-   printk(KERN_WARNING "NUMA: "
-   "short ibm,associativity-reference-points\n");
+   pr_warn("short ibm,associativity-reference-points\n");
goto err;
}
 
@@ -339,7 +337,7 @@ static int __init find_min_common_depth(void)
 * MAX_DISTANCE_REF_POINTS domains.
 */
if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
-   printk(KERN_WARNING "NUMA: distance array capped at "
+   pr_warn("distance array capped at "
"%d entries\n", MAX_DISTANCE_REF_POINTS);
distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
}
@@ -701,7 +699,7 @@ static int __init parse_numa_properties(void)
unsigned long i;
 
if (numa_enabled == 0) {
-   printk(KERN_WARNING "NUMA disabled by user\n");
+   pr_warn("disabled by user\n");
return -1;
}
 
@@ -716,7 +714,7 @@ static int __init parse_numa_properties(void)
return min_common_depth;
}
 
-   pr_debug("NUMA associativity depth for CPU/Memory: %d\n", 
min_common_depth);
+   pr_debug("associativity depth for CPU/Memory: %d\n", min_common_depth);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -808,10 +806,8 @@ static void __init setup_nonnuma(void)
unsigned int nid = 0;
int i;
 
-   printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
-  top_of_ram, total_ram);
-   printk(KERN_DEBUG "Memory hole size: %ldMB\n",
-  (top_of_ram - total_ram) >> 20);
+   pr_debug("Top of RAM: 0x%lx, Total RAM: 0x%lx\n", top_of_ram, 
total_ram);
+   pr_debug("Memory hole size: %ldMB\n", (top_of_ram - total_ram) >> 20);
 
for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, NULL) {
fake_numa_create_new_node(end_pfn, );
-- 
2.18.2

[PATCH v3 0/5] Updates to powerpc for robust CPU online/offline

2021-08-26 Thread Srikar Dronamraju

Changelog v2 -> v3:
v2: 
https://lore.kernel.org/linuxppc-dev/20210821102535.169643-1-sri...@linux.vnet.ibm.com/t/#u
Add patch 1: to drop dbg and numa=debug (Suggested by Michael Ellerman)
Add patch 2: to convert printk to pr_xxx (Suggested by Michael Ellerman)
Use pr_warn instead of pr_debug(WARNING) (Suggested by Laurent)

Changelog v1 -> v2:
Moved patch to this series: powerpc/numa: Fill distance_lookup_table for 
offline nodes
fixed a missing prototype warning

Scheduler expects unique number of node distances to be available
at boot. It uses node distance to calculate this unique node
distances. On Power Servers, node distances for offline nodes is not
available. However, Power Servers already knows unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

With the faked numa distance at boot, the node distance table will look
like
node   0   1   2
  0:  10  20  40
  1:  20  10  40
  2:  40  40  10

The actual distance will be populated once the nodes are onlined.

Also when simultaneously running CPU online/offline with CPU
add/remove in a loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER: 0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (5):
  powerpc/numa: Drop dbg in favour of pr_debug
  powerpc/numa: convert printk to pr_xxx
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU

[PATCH v2 3/3] powerpc/smp: Enable CACHE domain for shared processor

2021-08-26 Thread Srikar Dronamraju

Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.

In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.

Here are some stats from Power9 shared LPAR with the changes.

$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  32
On-line CPU(s) list: 0-31
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   3
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   16-23
NUMA node1 CPU(s):   0-15,24-31
Physical sockets:2
Physical chips:  1
Physical cores/chip: 10

Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA

After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA

$  awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 
's/,//g'
Before
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain1 00ff
domain1 ff00
domain2 

After
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain1 00ff
domain1 ff00
domain1 00ff
domain1 ff00
domain2 ff00
domain2 
domain3 

(Lower is better)
perf stat -a -r 5 -n perf bench sched pipe  | tail -n 2
Before
   153.798 +- 0.142 seconds time elapsed  ( +-  0.09% )

After
   111.545 +- 0.652 seconds time elapsed  ( +-  0.58% )

which is an improvement of 27.47%

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b5bd5a4708d0..b31b8ca3ae2e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1365,7 +1365,7 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t 
*mask)
l2_cache = cpu_to_l2cache(cpu);
if (!l2_cache || !*mask) {
/* Assume only core siblings share cache with this CPU */
-   for_each_cpu(i, submask_fn(cpu))
+   for_each_cpu(i, cpu_sibling_mask(cpu))
set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
return false;
-- 
2.18.2

[PATCH v2 2/3] powerpc/smp: Update cpu_core_map on all PowerPc systems

2021-08-26 Thread Srikar Dronamraju

lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.

While optimizing the powerpc bootup path, Commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask").  it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().

However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7aa6
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.

However on non-PowerNV and non-pseries KVM guests (i.e not
implementing cpu_to_chip_id(), continued to use a copy of
cpu_cpu_mask().

There are two issues that were noticed on such systems
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):   3<--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.

This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.

Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
   cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
   pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
   upon and we need a clean way to handle PowerPc KVM guest which is
   not a snapshot.
C. Update cpu_core_masks all PowerPc systems like in PowerPc KVM
guests using mask manipulations. This approach is relatively simple
and unifies with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
   could return a nid for the said CPU. However this is not needed at this
   time.

Option C is the preferred approach for now.

While this is somewhat a revert of Commit 4ca234a9cbd7 ("powerpc/smp:
Stop updating cpu_core_mask").

1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.

This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.

With the fix, lscpu o/p would be
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):   2 <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Signed-off-by: Srikar Dronamraju 
---
Changelog : v1 -> v2:
v1:https://lore.kernel.org/linuxppc-dev/20210821092419.167454-3-sri...@linux.vnet.ibm.com/t/#u
Handled comments from Michael Ellerman
[ updated changelog to make it more generic powerpc issue ]

 arch/powerpc/kernel/smp.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git

[PATCH v2 1/3] powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2

2021-08-26 Thread Srikar Dronamraju

Aneesh reported a crash with a fairly recent upstream kernel when
booting kernel whose commandline was appended with nr_cpus=2

1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c8a67bd0]
pc: c002557c: cpu_to_chip_id+0x3c/0x100
lr: c0058380: start_secondary+0x460/0xb00
sp: c8a67e70
   msr: 80001033
   dar: 10
 dsisr: 8
  current = 0xc891bb00
  paca= 0xc018ff981f80   irqmask: 0x03   irq_happened: 0x01
pid   = 0, comm = swapper/1
Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc 
(Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) 
#433 SMP Tue May 25 02:38:49 CDT 2021
1:mon> t
[link register   ] c0058380 start_secondary+0x460/0xb00
[c8a67e70] c8a67eb0 (unreliable)
[c8a67eb0] c00589d4 start_secondary+0xab4/0xb00
[c8a67f90] c000c654 start_secondary_prolog+0x10/0x14

Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus() is not an
exact multiple of  threads_per_core.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
Reported-by: Aneesh Kumar K.V 
Debugged-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
Changelog: v1 -> v2:
v1: - 
https://lore.kernel.org/linuxppc-dev/20210821092419.167454-2-sri...@linux.vnet.ibm.com/t/#u
Handled comment from Gautham Shenoy
[ Updated to use DIV_ROUND_UP instead of max to handle more situations ]

 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..bf11b3c4eb28 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1074,7 +1074,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
}
 
if (cpu_to_chip_id(boot_cpuid) != -1) {
-   int idx = num_possible_cpus() / threads_per_core;
+   int idx = DIV_ROUND_UP(num_possible_cpus(), threads_per_core);
 
/*
 * All threads of a core will all belong to the same core,
-- 
2.18.2

[PATCH v2 0/3] powerpc/smp: Misc fixes

2021-08-26 Thread Srikar Dronamraju

Changelog : v1 -> v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20210821092419.167454-1-sri...@linux.vnet.ibm.com/t/#u``
[ patch 1: Updated to use DIV_ROUND_UP instead of max to handle more situations 
]
[ patch 2: updated changelog to make it more generic powerpc issue ]

The 1st patch fixes a regression which causes a crash when booted with
nr_cpus=2.

The 2nd patch fixes a regression where lscpu on PowerVM reports more
number of sockets that are available.

The 3rd patch updates the fallback when L2-cache properties are not
explicitly exposed to be the cpu_sibling_mask.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 

Srikar Dronamraju (3):
  powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2
  powerpc/smp: Update cpu_core_map on all PowerPc systems
  powerpc/smp: Enable CACHE domain for shared processor

 arch/powerpc/kernel/smp.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

-- 
2.18.2

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-26 Thread Christophe Leroy





Le 26/08/2021 à 05:21, Michael Ellerman a écrit :

Nathan Chancellor  writes:

On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote:

Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
flexibility to GCC.

...


This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
klist_add_tail to trigger over and over on boot when compiling with
clang:


...



This patch seems to fix it. Not sure if that's just papering over it though.

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1ee0f22313ee..75fcb4370d96 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -119,7 +119,7 @@ __label_warn_on:
\
\
WARN_ENTRY(PPC_TLNEI " %4, 0",\
   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), 
\
-  __label_warn_on, "r" (x)); \
+  __label_warn_on, "r" (!!(x))); \
break;  \
  __label_warn_on:  \
__ret_warn_on = true;   \




But for a simple WARN_ON() call:

void test(unsigned long b)
{
WARN_ON(b);
}

Without your change with GCC you get:

12d0 <.test>:
12d0:   0b 03 00 00 tdnei   r3,0
12d4:   4e 80 00 20 blr


With the !! change you get:

12d0 <.test>:
12d0:   31 23 ff ff addic   r9,r3,-1
12d4:   7d 29 19 10 subfe   r9,r9,r3
12d8:   0b 09 00 00 tdnei   r9,0
12dc:   4e 80 00 20 blr

Christophe

43 matches

Mail list logo