Re: Probing nvme disks fails on Upstream kernels on powerpc Maxconfig

2023-04-04 Thread Michael Ellerman
"Linux regression tracking (Thorsten Leemhuis)"  
writes:
> [CCing the regression list, as it should be in the loop for regressions:
> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>
> On 23.03.23 10:53, Srikar Dronamraju wrote:
>> 
>> I am unable to boot upstream kernels from v5.16 to the latest upstream
>> kernel on a maxconfig system. (Machine config details given below)
>> 
>> At boot, we see a series of messages like the below.
>> 
>> dracut-initqueue[13917]: Warning: dracut-initqueue: timeout, still waiting 
>> for following initqueue hooks:
>> dracut-initqueue[13917]: Warning: 
>> /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f93dc0767-18aa-467f-afa7-5b4e9c13108a.sh:
>>  "if ! grep -q After=remote-fs-pre.target 
>> /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
>> dracut-initqueue[13917]: [ -e 
>> "/dev/disk/by-uuid/93dc0767-18aa-467f-afa7-5b4e9c13108a" ]
>> dracut-initqueue[13917]: fi"
>
> Alexey, did you look into this? This is apparently caused by a commit of
> yours (see quoted part below) that Michael applied. Looks like it fell
> through the cracks from here, but maybe I'm missing something.

Unfortunately Alexey is not working at IBM any more, so he won't have
access to any hardware to debug/test this.

Srikar are you debugging this? If not we'll have to find someone else to
look at it.

cheers


Re: [PATCH] powerpc/64: Always build with 128-bit long double

2023-04-04 Thread Michael Ellerman
Segher Boessenkool  writes:
> On Tue, Apr 04, 2023 at 08:28:47PM +1000, Michael Ellerman wrote:
>> The amdgpu driver builds some of its code with hard-float enabled,
>> whereas the rest of the kernel is built with soft-float.
>> 
>> When building with 64-bit long double, if soft-float and hard-float
>> objects are linked together, the build fails due to incompatible ABI
>> tags.
>
>> Currently those build errors are avoided because the amdgpu driver is
>> gated on 128-bit long double being enabled. But that's not a detail the
>> amdgpu driver should need to be aware of, and if another driver starts
>> using hard-float the same problem would occur.
>
> Well.  The kernel driver either has no business using long double (or
> any other floating point even) at all, or it should know exactly what is
> used: double precision, double-double, or quadruple precision.  Both of
> the latter two are 128 bits.

In a perfect world ... :)

>> All versions of the 64-bit ABI specify that long-double is 128-bits.
>> However some compilers, notably the kernel.org ones, are built to use
>> 64-bit long double by default.
>
> Mea culpa, I suppose?  But builddall doesn't force 64 bit explicitly.
> I wonder how this happened?  Is it maybe a problem in the powerpc64le
> config in GCC itself?

Not blaming anyone, just one of those things that happens. The
toolchains the distros (Ubuntu/Fedora) build all seem to use 128, but
possibly that's because someone told them to configure them that way at
some point.

> I have a patch from summer last year (Arnd's
> toolchains are built without it) that does
> +   powerpc64le-*)  TARGET_GCC_CONF=--with-long-double-128
> Unfortunately I don't remember why I did that, and I never investigated
> what the deeper problem is :-/

Last summer (aka winter) is when we first discovered this issue with the
long double size being implicated.

See:
  https://git.kernel.org/torvalds/c/c653c591789b3acfa4bf6ae45d5af4f330e50a91

So I guess that's what prompted your patch?

> In either case, the kernel should always use specific types, not rely on
> the toolchain to pick a type that may or may not work.  The correct size
> floating point type alone is not enough, but it is a step in the right
> direction certainly.
>
> Reviewed-by: Segher Boessenkool 

Thanks.

cheers


[PATCH] powerpc: Remove duplicate SPRN_HSRR definitions

2023-04-04 Thread Joel Stanley
There are two copies of these defines. Keep the older ones as they have
associated bit definitions.

Signed-off-by: Joel Stanley 
---
Today I learnt that if you have two copies of the define, but they are
the same value, the compiler won't warn.

 arch/powerpc/include/asm/reg.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1e8b2e04e626..0bf4c506a1eb 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -382,8 +382,6 @@
 #define SPRN_HIOR  0x137   /* 970 Hypervisor interrupt offset */
 #define SPRN_RMOR  0x138   /* Real mode offset register */
 #define SPRN_HRMOR 0x139   /* Real mode offset register */
-#define SPRN_HSRR0 0x13A   /* Hypervisor Save/Restore 0 */
-#define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_ASDR  0x330   /* Access segment descriptor register */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
-- 
2.39.2



[PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-04 Thread Yair Podemsky
The tlb_remove_table_smp_sync IPI is used to ensure the outdated tlb page
is not currently being accessed and can be cleared.
This occurs once all CPUs have left the lockless gup code section.
If they reenter the page table walk, the pointers will be to the new
pages.
Therefore the IPI is only needed for CPUs in kernel mode.
By preventing the IPI from being sent to CPUs not in kernel mode,
Latencies are reduced.

Race conditions considerations:
The context state check is vulnerable to race conditions between the
moment the context state is read to when the IPI is sent (or not).

Here are these scenarios.
case 1:
CPU-A CPU-B

  state == CONTEXT_KERNEL
int state = atomic_read(>state);
  Kernel-exit:
  state == CONTEXT_USER
if (state & CT_STATE_MASK == CONTEXT_KERNEL)

In this case, the IPI will be sent to CPU-B despite it is no longer in
the kernel. The consequence of which would be an unnecessary IPI being
handled by CPU-B, causing a reduction in latency.
This would have been the case every time without this patch.

case 2:
CPU-A CPU-B

modify pagetables
tlb_flush (memory barrier)
  state == CONTEXT_USER
int state = atomic_read(>state);
  Kernel-enter:
  state == CONTEXT_KERNEL
  READ(pagetable values)
if (state & CT_STATE_MASK == CONTEXT_USER)

In this case, the IPI will not be sent to CPU-B despite it returning to
the kernel and even reading the pagetable.
However since this CPU-B has entered the pagetable after the
modification it is reading the new, safe values.

The only case when this IPI is truly necessary is when CPU-B has entered
the lockless gup code section before the pagetable modifications and
has yet to exit them, in which case it is still in the kernel.

Signed-off-by: Yair Podemsky 
---
 mm/mmu_gather.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 5ea9be6fb87c..731d955e152d 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -191,6 +192,20 @@ static void tlb_remove_table_smp_sync(void *arg)
/* Simply deliver the interrupt */
 }
 
+
+#ifdef CONFIG_CONTEXT_TRACKING
+static bool cpu_in_kernel(int cpu, void *info)
+{
+   struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);
+   int state = atomic_read(>state);
+   /* will return true only for cpus in kernel space */
+   return state & CT_STATE_MASK == CONTEXT_KERNEL;
+}
+#define CONTEXT_PREDICATE cpu_in_kernel
+#else
+#define CONTEXT_PREDICATE NULL
+#endif /* CONFIG_CONTEXT_TRACKING */
+
 #ifdef CONFIG_ARCH_HAS_CPUMASK_BITS
 #define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
 #else
@@ -206,8 +221,8 @@ void tlb_remove_table_sync_one(struct mm_struct *mm)
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */
-   on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
-   NULL, true);
+   on_each_cpu_cond_mask(CONTEXT_PREDICATE, tlb_remove_table_smp_sync,
+   NULL, true, REMOVE_TABLE_IPI_MASK);
 }
 
 static void tlb_remove_table_rcu(struct rcu_head *head)
-- 
2.31.1



[PATCH 2/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-04-04 Thread Yair Podemsky
Currently the tlb_remove_table_smp_sync IPI is sent to all CPUs
indiscriminately, this causes unnecessary work and delays notable in
real-time use-cases and isolated cpus.
This patch will limit this IPI on systems with ARCH_HAS_CPUMASK_BITS,
Where the IPI will only be sent to cpus referencing the affected mm.

Signed-off-by: Yair Podemsky 
Suggested-by: David Hildenbrand 
---
 include/asm-generic/tlb.h |  4 ++--
 mm/khugepaged.c   |  4 ++--
 mm/mmu_gather.c   | 17 -
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b46617207c93..0b6ba17cc8d3 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -222,7 +222,7 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void 
*table);
 #define tlb_needs_table_invalidate() (true)
 #endif
 
-void tlb_remove_table_sync_one(void);
+void tlb_remove_table_sync_one(struct mm_struct *mm);
 
 #else
 
@@ -230,7 +230,7 @@ void tlb_remove_table_sync_one(void);
 #error tlb_needs_table_invalidate() requires MMU_GATHER_RCU_TABLE_FREE
 #endif
 
-static inline void tlb_remove_table_sync_one(void) { }
+static inline void tlb_remove_table_sync_one(struct mm_struct *mm) { }
 
 #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 92e6f56a932d..2b4e6ca1f38e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1070,7 +1070,7 @@ static int collapse_huge_page(struct mm_struct *mm, 
unsigned long address,
_pmd = pmdp_collapse_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end();
-   tlb_remove_table_sync_one();
+   tlb_remove_table_sync_one(mm);
 
spin_lock(pte_ptl);
result =  __collapse_huge_page_isolate(vma, address, pte, cc,
@@ -1427,7 +1427,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, 
struct vm_area_struct *v
addr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start();
pmd = pmdp_collapse_flush(vma, addr, pmdp);
-   tlb_remove_table_sync_one();
+   tlb_remove_table_sync_one(mm);
mmu_notifier_invalidate_range_end();
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 2b93cf6ac9ae..5ea9be6fb87c 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -191,7 +191,13 @@ static void tlb_remove_table_smp_sync(void *arg)
/* Simply deliver the interrupt */
 }
 
-void tlb_remove_table_sync_one(void)
+#ifdef CONFIG_ARCH_HAS_CPUMASK_BITS
+#define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
+#else
+#define REMOVE_TABLE_IPI_MASK NULL
+#endif /* CONFIG_ARCH_HAS_CPUMASK_BITS */
+
+void tlb_remove_table_sync_one(struct mm_struct *mm)
 {
/*
 * This isn't an RCU grace period and hence the page-tables cannot be
@@ -200,7 +206,8 @@ void tlb_remove_table_sync_one(void)
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */
-   smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+   on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
+   NULL, true);
 }
 
 static void tlb_remove_table_rcu(struct rcu_head *head)
@@ -237,9 +244,9 @@ static inline void tlb_table_invalidate(struct mmu_gather 
*tlb)
}
 }
 
-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(struct mm_struct *mm, void *table)
 {
-   tlb_remove_table_sync_one();
+   tlb_remove_table_sync_one(mm);
__tlb_remove_table(table);
 }
 
@@ -262,7 +269,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | 
__GFP_NOWARN);
if (*batch == NULL) {
tlb_table_invalidate(tlb);
-   tlb_remove_table_one(table);
+   tlb_remove_table_one(tlb->mm, table);
return;
}
(*batch)->nr = 0;
-- 
2.31.1



[PATCH 1/3] arch: Introduce ARCH_HAS_CPUMASK_BITS

2023-04-04 Thread Yair Podemsky
Some architectures set and maintain the mm_cpumask bits when loading
or removing process from cpu.
This Kconfig will mark those to allow different behavior between
kernels that maintain the mm_cpumask and those that do not.

Signed-off-by: Yair Podemsky 
---
 arch/Kconfig | 8 
 arch/arm/Kconfig | 1 +
 arch/powerpc/Kconfig | 1 +
 arch/s390/Kconfig| 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 +
 6 files changed, 13 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index e3511afbb7f2..ec5559779e9f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1434,6 +1434,14 @@ config ARCH_HAS_NONLEAF_PMD_YOUNG
  address translations. Page table walkers that clear the accessed bit
  may use this capability to reduce their search space.
 
+config ARCH_HAS_CPUMASK_BITS
+   bool
+   help
+ Architectures that select this option set bits on the mm_cpumask
+ to mark which cpus loaded the mm, The mask can then be used to
+ control mm specific actions such as tlb_flush.
+
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e24a9820e12f..6111059a68a3 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -70,6 +70,7 @@ config ARM
select GENERIC_SCHED_CLOCK
select GENERIC_SMP_IDLE_THREAD
select HARDIRQS_SW_RESEND
+   select ARCH_HAS_CPUMASK_BITS
select HAVE_ARCH_AUDITSYSCALL if AEABI && !OABI_COMPAT
select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a6c4407d3ec8..2fd0160f4f8e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -144,6 +144,7 @@ config PPC
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_CPUMASK_BITS
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 9809c74e1240..b2de5ee07faf 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -86,6 +86,7 @@ config S390
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_VDSO_DATA
+   select ARCH_HAS_CPUMASK_BITS
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_INLINE_READ_LOCK
select ARCH_INLINE_READ_LOCK_BH
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 84437a4c6545..f9e0cf26d447 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -98,6 +98,7 @@ config SPARC64
select ARCH_HAS_PTE_SPECIAL
select PCI_DOMAINS if PCI
select ARCH_HAS_GIGANTIC_PAGE
+   select ARCH_HAS_CPUMASK_BITS
select HAVE_SOFTIRQ_ON_OWN_STACK
select HAVE_SETUP_PER_CPU_AREA
select NEED_PER_CPU_EMBED_FIRST_CHUNK
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..d98dfdf9c6b4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -183,6 +183,7 @@ config X86
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_STACKLEAK
select HAVE_ARCH_TRACEHOOK
+   select ARCH_HAS_CPUMASK_BITS
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
-- 
2.31.1



[PATCH 0/3] send tlb_remove_table_smp_sync IPI only to necessary CPUs

2023-04-04 Thread Yair Podemsky
Currently the tlb_remove_table_smp_sync IPI is sent to all CPUs
indiscriminately, this causes unnecessary work and delays notable in
real-time use-cases and isolated cpus.
By limiting the IPI to only be sent to cpus referencing the effected
mm and in kernel mode latency is improved.
a config to differentiate architectures that support mm_cpumask from
those that don't will allow safe usage of this feature.

Yair Podemsky (3):
  arch: Introduce ARCH_HAS_CPUMASK_BITS
  mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs
  mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in
kernel mode

-- 
2.31.1



[powerpc:topic/ppc-kvm] BUILD SUCCESS a3800ef9c48c4497dafe5ede1b65d91d9ef9cf1e

2023-04-04 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
topic/ppc-kvm
branch HEAD: a3800ef9c48c4497dafe5ede1b65d91d9ef9cf1e  KVM: PPC: Enable 
prefixed instructions for HV KVM and disable for PR KVM

elapsed time: 726m

configs tested: 182
configs skipped: 139

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alphabuildonly-randconfig-r001-20230403   gcc  
alpha   defconfig   gcc  
alpharandconfig-r002-20230403   gcc  
alpharandconfig-r015-20230403   gcc  
alpharandconfig-r021-20230403   gcc  
alpharandconfig-r024-20230403   gcc  
alpharandconfig-r025-20230403   gcc  
alpharandconfig-r026-20230403   gcc  
alpharandconfig-r036-20230403   gcc  
arc  allyesconfig   gcc  
arc  buildonly-randconfig-r006-20230403   gcc  
arc defconfig   gcc  
arc  randconfig-r006-20230403   gcc  
arc  randconfig-r011-20230403   gcc  
arc  randconfig-r013-20230403   gcc  
arc  randconfig-r032-20230403   gcc  
arc  randconfig-r033-20230403   gcc  
arc  randconfig-r034-20230403   gcc  
arc  randconfig-r043-20230403   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm  randconfig-r034-20230403   gcc  
armrealview_defconfig   gcc  
arm   sama5_defconfig   gcc  
armshmobile_defconfig   gcc  
arm   spear13xx_defconfig   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r004-20230403   clang
arm64randconfig-r024-20230403   gcc  
cskydefconfig   gcc  
csky randconfig-r021-20230403   gcc  
hexagon  buildonly-randconfig-r001-20230403   clang
hexagon  buildonly-randconfig-r006-20230404   clang
hexagon  randconfig-r005-20230403   clang
hexagon  randconfig-r024-20230403   clang
i386 allyesconfig   gcc  
i386 buildonly-randconfig-r002-20230403   clang
i386  debian-10.3   gcc  
i386defconfig   gcc  
i386 randconfig-a001-20230403   clang
i386 randconfig-a002-20230403   clang
i386 randconfig-a003-20230403   clang
i386 randconfig-a004-20230403   clang
i386 randconfig-a005-20230403   clang
i386 randconfig-a006-20230403   clang
i386 randconfig-a011-20230403   gcc  
i386 randconfig-a012-20230403   gcc  
i386 randconfig-a013-20230403   gcc  
i386 randconfig-a014-20230403   gcc  
i386 randconfig-a015-20230403   gcc  
i386 randconfig-a016-20230403   gcc  
i386 randconfig-r014-20230403   gcc  
i386 randconfig-r015-20230403   gcc  
i386 randconfig-r022-20230403   gcc  
i386 randconfig-r026-20230403   gcc  
ia64 allmodconfig   gcc  
ia64 buildonly-randconfig-r002-20230403   gcc  
ia64 buildonly-randconfig-r004-20230403   gcc  
ia64defconfig   gcc  
ia64 randconfig-r012-20230403   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarch   defconfig   gcc  
loongarch loongson3_defconfig   gcc  
loongarchrandconfig-r013-20230403   gcc  
m68k allmodconfig   gcc  
m68kdefconfig   gcc  
m68km5307c3_defconfig   gcc  
m68k randconfig-r012-20230403   gcc  
m68k randconfig-r016-20230403   gcc  
m68k randconfig-r023-20230403   gcc  
m68k randconfig-r033-20230403   gcc  
m68kstmark2_defconfig   gcc  
microblaze   buildonly-randconfig-r005-20230403   gcc  
microblaze   randconfig-r002-20230403   gcc  
microblaze   randconfig-r004-20230403   gcc  
microblaze   randconfig-r011-20230403   gcc  
microblaze   randconfig-r023-20230403   gcc  
microblaze   randconfig-r031-20230403   gcc  
mips allmodconfig   gcc  
mips allyesconfig   gcc  
mips buildonly-randconfig-r001

[powerpc:next-test] BUILD SUCCESS 9709343cb567c7ff0dd93e583810ec29a9b497d5

2023-04-04 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next-test
branch HEAD: 9709343cb567c7ff0dd93e583810ec29a9b497d5  powerpc/64: Always build 
with 128-bit long double

elapsed time: 726m

configs tested: 189
configs skipped: 11

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alphabuildonly-randconfig-r001-20230403   gcc  
alpha   defconfig   gcc  
alpharandconfig-r015-20230403   gcc  
alpharandconfig-r016-20230403   gcc  
alpharandconfig-r024-20230403   gcc  
alpharandconfig-r025-20230403   gcc  
alpharandconfig-r036-20230403   gcc  
arc  allyesconfig   gcc  
arc  buildonly-randconfig-r006-20230403   gcc  
arc defconfig   gcc  
arc  randconfig-r011-20230403   gcc  
arc  randconfig-r013-20230403   gcc  
arc  randconfig-r022-20230403   gcc  
arc  randconfig-r032-20230403   gcc  
arc  randconfig-r033-20230403   gcc  
arc  randconfig-r034-20230403   gcc  
arc  randconfig-r043-20230403   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm  randconfig-r025-20230403   clang
arm  randconfig-r034-20230403   gcc  
arm  randconfig-r046-20230403   clang
armrealview_defconfig   gcc  
arm   sama5_defconfig   gcc  
armshmobile_defconfig   gcc  
arm   spear13xx_defconfig   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r004-20230403   clang
arm64randconfig-r024-20230403   gcc  
csky buildonly-randconfig-r002-20230403   gcc  
cskydefconfig   gcc  
csky randconfig-r021-20230403   gcc  
csky randconfig-r031-20230403   gcc  
hexagon  buildonly-randconfig-r006-20230404   clang
hexagon  randconfig-r005-20230403   clang
hexagon  randconfig-r041-20230403   clang
hexagon  randconfig-r045-20230403   clang
i386 allyesconfig   gcc  
i386 buildonly-randconfig-r002-20230403   clang
i386  debian-10.3   gcc  
i386defconfig   gcc  
i386 randconfig-a001-20230403   clang
i386 randconfig-a002-20230403   clang
i386 randconfig-a003-20230403   clang
i386 randconfig-a004-20230403   clang
i386 randconfig-a005-20230403   clang
i386 randconfig-a006-20230403   clang
i386 randconfig-a011-20230403   gcc  
i386 randconfig-a012-20230403   gcc  
i386 randconfig-a013-20230403   gcc  
i386 randconfig-a014-20230403   gcc  
i386 randconfig-a015-20230403   gcc  
i386 randconfig-a016-20230403   gcc  
i386 randconfig-r014-20230403   gcc  
i386 randconfig-r015-20230403   gcc  
i386 randconfig-r021-20230403   gcc  
i386 randconfig-r023-20230403   gcc  
i386 randconfig-r026-20230403   gcc  
ia64 allmodconfig   gcc  
ia64 buildonly-randconfig-r004-20230403   gcc  
ia64 buildonly-randconfig-r004-20230404   gcc  
ia64 buildonly-randconfig-r006-20230403   gcc  
ia64defconfig   gcc  
ia64 randconfig-r012-20230403   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarch   defconfig   gcc  
loongarch loongson3_defconfig   gcc  
loongarchrandconfig-r013-20230403   gcc  
m68k allmodconfig   gcc  
m68k buildonly-randconfig-r005-20230403   gcc  
m68kdefconfig   gcc  
m68km5307c3_defconfig   gcc  
m68k randconfig-r012-20230403   gcc  
m68k randconfig-r016-20230403   gcc  
m68k randconfig-r023-20230403   gcc  
m68k randconfig-r024-20230403   gcc  
m68k randconfig-r032-20230403   gcc  
m68k randconfig-r033-20230403   gcc  
m68kstmark2_defconfig   gcc  
microblaze   buildonly-randconfig-r005-20230403   gcc  
microblaze   randconfig-r011-20230403   gcc  
microblaze

[powerpc:fixes-test] BUILD SUCCESS b277fc793daf258877b4c0744b52f69d6e6ba22e

2023-04-04 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: b277fc793daf258877b4c0744b52f69d6e6ba22e  powerpc/papr_scm: Update 
the NUMA distance table for the target node

elapsed time: 728m

configs tested: 182
configs skipped: 139

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alphabuildonly-randconfig-r001-20230403   gcc  
alpha   defconfig   gcc  
alpharandconfig-r002-20230403   gcc  
alpharandconfig-r015-20230403   gcc  
alpharandconfig-r021-20230403   gcc  
alpharandconfig-r024-20230403   gcc  
alpharandconfig-r025-20230403   gcc  
alpharandconfig-r026-20230403   gcc  
alpharandconfig-r036-20230403   gcc  
arc  allyesconfig   gcc  
arc  buildonly-randconfig-r006-20230403   gcc  
arc defconfig   gcc  
arc  randconfig-r006-20230403   gcc  
arc  randconfig-r011-20230403   gcc  
arc  randconfig-r013-20230403   gcc  
arc  randconfig-r032-20230403   gcc  
arc  randconfig-r033-20230403   gcc  
arc  randconfig-r034-20230403   gcc  
arc  randconfig-r043-20230403   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm  randconfig-r034-20230403   gcc  
armrealview_defconfig   gcc  
arm   sama5_defconfig   gcc  
armshmobile_defconfig   gcc  
arm   spear13xx_defconfig   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r004-20230403   clang
arm64randconfig-r024-20230403   gcc  
cskydefconfig   gcc  
csky randconfig-r021-20230403   gcc  
hexagon  buildonly-randconfig-r001-20230403   clang
hexagon  buildonly-randconfig-r006-20230404   clang
hexagon  randconfig-r005-20230403   clang
hexagon  randconfig-r024-20230403   clang
i386 allyesconfig   gcc  
i386 buildonly-randconfig-r002-20230403   clang
i386  debian-10.3   gcc  
i386defconfig   gcc  
i386 randconfig-a001-20230403   clang
i386 randconfig-a002-20230403   clang
i386 randconfig-a003-20230403   clang
i386 randconfig-a004-20230403   clang
i386 randconfig-a005-20230403   clang
i386 randconfig-a006-20230403   clang
i386 randconfig-a011-20230403   gcc  
i386 randconfig-a012-20230403   gcc  
i386 randconfig-a013-20230403   gcc  
i386 randconfig-a014-20230403   gcc  
i386 randconfig-a015-20230403   gcc  
i386 randconfig-a016-20230403   gcc  
i386 randconfig-r014-20230403   gcc  
i386 randconfig-r015-20230403   gcc  
i386 randconfig-r022-20230403   gcc  
i386 randconfig-r026-20230403   gcc  
ia64 allmodconfig   gcc  
ia64 buildonly-randconfig-r002-20230403   gcc  
ia64 buildonly-randconfig-r004-20230403   gcc  
ia64defconfig   gcc  
ia64 randconfig-r012-20230403   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarch   defconfig   gcc  
loongarch loongson3_defconfig   gcc  
loongarchrandconfig-r013-20230403   gcc  
m68k allmodconfig   gcc  
m68kdefconfig   gcc  
m68km5307c3_defconfig   gcc  
m68k randconfig-r012-20230403   gcc  
m68k randconfig-r016-20230403   gcc  
m68k randconfig-r023-20230403   gcc  
m68k randconfig-r033-20230403   gcc  
m68kstmark2_defconfig   gcc  
microblaze   buildonly-randconfig-r005-20230403   gcc  
microblaze   randconfig-r002-20230403   gcc  
microblaze   randconfig-r004-20230403   gcc  
microblaze   randconfig-r011-20230403   gcc  
microblaze   randconfig-r023-20230403   gcc  
microblaze   randconfig-r031-20230403   gcc  
mips allmodconfig   gcc  
mips allyesconfig   gcc  
mips buildonly-randconfig-r001-20230403

[powerpc:merge] BUILD SUCCESS WITH WARNING 639e8992872c632f27b130b403e263eae966231e

2023-04-04 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: 639e8992872c632f27b130b403e263eae966231e  powerpc/ci: Add smart 
sparse diffing

Warning reports:

https://lore.kernel.org/oe-kbuild-all/202304042327.blhf5ncp-...@intel.com

Warning: (recently discovered and may have been fixed)

.github/problem-matchers/sparse.json: warning: ignored by one of the .gitignore 
files

Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allyesconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- alpha-defconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- alpha-randconfig-r021-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- alpha-randconfig-r024-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- alpha-randconfig-r026-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- alpha-randconfig-r036-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arc-allyesconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arc-buildonly-randconfig-r003-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arc-defconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arc-randconfig-r022-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arc-randconfig-r033-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arc-randconfig-r043-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arm-allmodconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arm-allyesconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arm-defconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arm-randconfig-r034-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arm64-allyesconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- arm64-defconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- csky-defconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- i386-allyesconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- i386-debian-10.3
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- i386-defconfig
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- i386-randconfig-a011-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- i386-randconfig-a012-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files
|-- i386-randconfig-a013-20230403
|   `-- 
github-problem-matchers-sparse.json:warning:ignored-by-one-of-the-.gitignore-files

elapsed time: 727m

configs tested: 139
configs skipped: 8

tested configs:
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
alpharandconfig-r015-20230403   gcc  
alpharandconfig-r021-20230403   gcc  
alpharandconfig-r024-20230403   gcc  
alpharandconfig-r026-20230403   gcc  
alpharandconfig-r036-20230403   gcc  
arc  allyesconfig   gcc  
arc  buildonly-randconfig-r003-20230403   gcc  
arc defconfig   gcc  
arc  randconfig-r011-20230403   gcc  
arc  randconfig-r013-20230403   gcc  
arc  randconfig-r022-20230403   gcc  
arc  randconfig-r033-20230403   gcc  
arc  randconfig-r043-20230403   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm  randconfig-r034-20230403   gcc  
arm  randconfig-r046-20230403   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
cskydefconfig   gcc  
csky randconfig-r031-20230403   gcc  
hexagon  randconfig-r041-20230403   clang
hexagon  randconfig-r045-20230403   clang
i386 allyesconfig   gcc  
i386  

[PATCH v7 2/7] PCI: Execute `quirk_enable_clear_retrain_link' earlier

2023-04-04 Thread Maciej W. Rozycki
Make `quirk_enable_clear_retrain_link' `pci_fixup_early' so that any later 
fixups can rely on `clear_retrain_link' to have been already initialised.

Signed-off-by: Maciej W. Rozycki 
---
No change from v6.

No change from v5.

New change in v5.
---
 drivers/pci/quirks.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

linux-pcie-clear-retrain-link-early.diff
Index: linux-macro/drivers/pci/quirks.c
===
--- linux-macro.orig/drivers/pci/quirks.c
+++ linux-macro/drivers/pci/quirks.c
@@ -2407,9 +2407,9 @@ static void quirk_enable_clear_retrain_l
dev->clear_retrain_link = 1;
pci_info(dev, "Enable PCIe Retrain Link quirk\n");
 }
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PERICOM, 0xe110, 
quirk_enable_clear_retrain_link);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PERICOM, 0xe111, 
quirk_enable_clear_retrain_link);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PERICOM, 0xe130, 
quirk_enable_clear_retrain_link);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_PERICOM, 0xe110, 
quirk_enable_clear_retrain_link);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_PERICOM, 0xe111, 
quirk_enable_clear_retrain_link);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_PERICOM, 0xe130, 
quirk_enable_clear_retrain_link);
 
 static void fixup_rev1_53c810(struct pci_dev *dev)
 {


[PATCH v7 3/7] PCI: Initialize `link_active_reporting' earlier

2023-04-04 Thread Maciej W. Rozycki
Determine whether Data Link Layer Link Active Reporting is available 
ahead of calling any fixups so that the cached value can be used there 
and later on.

Signed-off-by: Maciej W. Rozycki 
---
Changes from v6:

- Regenerate against 6.3-rc5.

New change in v6.
---
 drivers/pci/probe.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

linux-pcie-link-active-reporting-early.diff
Index: linux-macro/drivers/pci/probe.c
===
--- linux-macro.orig/drivers/pci/probe.c
+++ linux-macro/drivers/pci/probe.c
@@ -820,7 +820,6 @@ static void pci_set_bus_speed(struct pci
 
pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, );
bus->max_bus_speed = pcie_link_speed[linkcap & 
PCI_EXP_LNKCAP_SLS];
-   bridge->link_active_reporting = !!(linkcap & 
PCI_EXP_LNKCAP_DLLLARC);
 
pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, );
pcie_update_link_speed(bus, linksta);
@@ -1829,6 +1828,7 @@ int pci_setup_device(struct pci_dev *dev
int pos = 0;
struct pci_bus_region region;
struct resource *res;
+   u32 linkcap;
 
hdr_type = pci_hdr_type(dev);
 
@@ -1876,6 +1876,10 @@ int pci_setup_device(struct pci_dev *dev
/* "Unknown power state" */
dev->current_state = PCI_UNKNOWN;
 
+   /* Set it early to make it available to fixups, etc.  */
+   pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
+   dev->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);
+
/* Early fixups, before probing the BARs */
pci_fixup_device(pci_fixup_early, dev);
 


[PATCH v7 4/7] powerpc/eeh: Rely on `link_active_reporting'

2023-04-04 Thread Maciej W. Rozycki
Use `link_active_reporting' to determine whether Data Link Layer Link 
Active Reporting is available rather than re-retrieving the capability.

Signed-off-by: Maciej W. Rozycki 
---
NB this has been compile-tested only with a PPC64LE configuration.

No change from v6.

New change in v6.
---
 arch/powerpc/kernel/eeh_pe.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

linux-pcie-link-active-reporting-eeh.diff
Index: linux-macro/arch/powerpc/kernel/eeh_pe.c
===
--- linux-macro.orig/arch/powerpc/kernel/eeh_pe.c
+++ linux-macro/arch/powerpc/kernel/eeh_pe.c
@@ -671,9 +671,8 @@ static void eeh_bridge_check_link(struct
eeh_ops->write_config(edev, cap + PCI_EXP_LNKCTL, 2, val);
 
/* Check link */
-   eeh_ops->read_config(edev, cap + PCI_EXP_LNKCAP, 4, );
-   if (!(val & PCI_EXP_LNKCAP_DLLLARC)) {
-   eeh_edev_dbg(edev, "No link reporting capability (0x%08x) \n", 
val);
+   if (!edev->pdev->link_active_reporting) {
+   eeh_edev_dbg(edev, "No link reporting capability\n");
msleep(1000);
return;
}


[PATCH v7 6/7] PCI: pciehp: Rely on `link_active_reporting'

2023-04-04 Thread Maciej W. Rozycki
Use `link_active_reporting' to determine whether Data Link Layer Link 
Active Reporting is available rather than re-retrieving the capability.

Signed-off-by: Maciej W. Rozycki 
---
NB this has been compile-tested only with PPC64LE and x86-64
configurations.

No change from v6.

New change in v6.
---
 drivers/pci/hotplug/pciehp_hpc.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

linux-pcie-link-active-reporting-hpc.diff
Index: linux-macro/drivers/pci/hotplug/pciehp_hpc.c
===
--- linux-macro.orig/drivers/pci/hotplug/pciehp_hpc.c
+++ linux-macro/drivers/pci/hotplug/pciehp_hpc.c
@@ -984,7 +984,7 @@ static inline int pcie_hotplug_depth(str
 struct controller *pcie_init(struct pcie_device *dev)
 {
struct controller *ctrl;
-   u32 slot_cap, slot_cap2, link_cap;
+   u32 slot_cap, slot_cap2;
u8 poweron;
struct pci_dev *pdev = dev->port;
struct pci_bus *subordinate = pdev->subordinate;
@@ -1030,9 +1030,6 @@ struct controller *pcie_init(struct pcie
if (dmi_first_match(inband_presence_disabled_dmi_table))
ctrl->inband_presence_disabled = 1;
 
-   /* Check if Data Link Layer Link Active Reporting is implemented */
-   pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, _cap);
-
/* Clear all remaining event bits in Slot Status register. */
pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
@@ -1051,7 +1048,7 @@ struct controller *pcie_init(struct pcie
FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
FLAG(slot_cap2, PCI_EXP_SLTCAP2_IBPD),
-   FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC),
+   FLAG(pdev->link_active_reporting, true),
pdev->broken_cmd_compl ? " (with Cmd Compl erratum)" : "");
 
/*


[PATCH v7 5/7] net/mlx5: Rely on `link_active_reporting'

2023-04-04 Thread Maciej W. Rozycki
Use `link_active_reporting' to determine whether Data Link Layer Link 
Active Reporting is available rather than re-retrieving the capability.

Signed-off-by: Maciej W. Rozycki 
---
NB this has been compile-tested only with PPC64LE and x86-64 
configurations.

Changes from v6:

- Regenerate against 6.3-rc5.

New change in v6.
---
 drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c |8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

linux-pcie-link-active-reporting-mlx5.diff
Index: linux-macro/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
===
--- linux-macro.orig/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
+++ linux-macro/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
@@ -307,7 +307,6 @@ static int mlx5_pci_link_toggle(struct m
unsigned long timeout;
struct pci_dev *sdev;
int cap, err;
-   u32 reg32;
 
/* Check that all functions under the pci bridge are PFs of
 * this device otherwise fail this function.
@@ -346,11 +345,8 @@ static int mlx5_pci_link_toggle(struct m
return err;
 
/* Check link */
-   err = pci_read_config_dword(bridge, cap + PCI_EXP_LNKCAP, );
-   if (err)
-   return err;
-   if (!(reg32 & PCI_EXP_LNKCAP_DLLLARC)) {
-   mlx5_core_warn(dev, "No PCI link reporting capability 
(0x%08x)\n", reg32);
+   if (!bridge->link_active_reporting) {
+   mlx5_core_warn(dev, "No PCI link reporting capability\n");
msleep(1000);
goto restore;
}


[PATCH v7 7/7] PCI: Work around PCIe link training failures

2023-04-04 Thread Maciej W. Rozycki
Attempt to handle cases such as with a downstream port of the ASMedia 
ASM2824 PCIe switch where link training never completes and the link 
continues switching between speeds indefinitely with the data link layer 
never reaching the active state.

It has been observed with a downstream port of the ASMedia ASM2824 Gen 3 
switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2 
switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device, 
P/N 41433, wired to a SiFive HiFive Unmatched board.  In this setup the 
switches are supposed to negotiate the link speed of preferably 5.0GT/s, 
falling back to 2.5GT/s.

Instead the link continues oscillating between the two speeds, at the 
rate of 34-35 times per second, with link training reported repeatedly 
active ~84% of the time.  Forcibly limiting the target link speed to 
2.5GT/s with the upstream ASM2824 device however makes the two switches 
communicate correctly.  Removing the speed restriction afterwards makes 
the two devices switch to 5.0GT/s then.

Make use of these observations then and detect the inability to train 
the link, by checking for the Data Link Layer Link Active status bit 
being off while the Link Bandwidth Management Status indicating that 
hardware has changed the link speed or width in an attempt to correct 
unreliable link operation.

Restrict the speed to 2.5GT/s then with the Target Link Speed field, 
request a retrain and wait 200ms for the data link to go up.  If this 
turns out successful, then lift the restriction, letting the devices 
negotiate a higher speed.

Also check for a 2.5GT/s speed restriction the firmware may have already 
arranged and lift it too with ports of devices known to continue working 
afterwards, currently the ASM2824 only, that already report their data 
link being up.

Signed-off-by: Maciej W. Rozycki 
Link: 
https://lore.kernel.org/r/alpine.deb.2.21.2203022037020.56...@angie.orcam.me.uk/
Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68
---
Changes from v6:

- Regenerate against 6.3-rc5.

- Shorten the lore.kernel.org archive link in the change description.

Changes from v5:

- Move from a quirk into PCI core and call at device probing, hot-plug,
  reset and resume.  Keep the ASMedia part under CONFIG_PCI_QUIRKS.

- Rely on `dev->link_active_reporting' rather than re-retrieving the 
  capability.

Changes from v4:

- Remove  inclusion no longer needed.

- Make the quirk generic based on probing device features rather than 
  specific to the ASM2824 part only; take the Retrain Link bit erratum 
  into account.

- Still lift the 2.5GT/s speed restriction with the ASM2824 only.

- Increase retrain timeout from 200ms to 1s (PCIE_LINK_RETRAIN_TIMEOUT).

- Remove retrain success notification.

- Use PCIe helpers rather than generic PCI functions throughout.

- Trim down and update the wording of the change description for the 
  switch from an ASM2824-specific to a generic fixup.

Changes from v3:

- Remove the  entry for the ASM2824.

Changes from v2:

- Regenerate for 5.17-rc2 for a merge conflict.

- Replace BUG_ON for a missing PCI Express capability with WARN_ON and an
  early return.

Changes from v1:

- Regenerate for a merge conflict.
---
 drivers/pci/pci.c   |  154 ++--
 drivers/pci/pci.h   |1 
 drivers/pci/probe.c |2 
 3 files changed, 152 insertions(+), 5 deletions(-)

linux-pcie-asm2824-manual-retrain.diff
Index: linux-macro/drivers/pci/pci.c
===
--- linux-macro.orig/drivers/pci/pci.c
+++ linux-macro/drivers/pci/pci.c
@@ -859,6 +859,132 @@ int pci_wait_for_pending(struct pci_dev
return 0;
 }
 
+/*
+ * Retrain the link of a downstream PCIe port by hand if necessary.
+ *
+ * This is needed at least where a downstream port of the ASMedia ASM2824
+ * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304
+ * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 >
+ * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched
+ * board.
+ *
+ * In such a configuration the switches are supposed to negotiate the link
+ * speed of preferably 5.0GT/s, falling back to 2.5GT/s.  However the link
+ * continues switching between the two speeds indefinitely and the data
+ * link layer never reaches the active state, with link training reported
+ * repeatedly active ~84% of the time.  Forcing the target link speed to
+ * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to
+ * each other correctly however.  And more interestingly retraining with a
+ * higher target link speed afterwards lets the two successfully negotiate
+ * 5.0GT/s.
+ *
+ * With the ASM2824 we can rely on the otherwise optional Data Link Layer
+ * Link Active status bit and in the failed link training scenario it will
+ * be off along with the Link Bandwidth Management Status indicating that
+ * hardware has changed the 

[PATCH v7 1/7] PCI: Export PCI link retrain timeout

2023-04-04 Thread Maciej W. Rozycki
Rename LINK_RETRAIN_TIMEOUT to PCIE_LINK_RETRAIN_TIMEOUT and make it
available via "pci.h" for PCI drivers to use.

Signed-off-by: Maciej W. Rozycki 
---
No change from v6.

No change from v5.

New change in v5.
---
 drivers/pci/pci.h   |2 ++
 drivers/pci/pcie/aspm.c |4 +---
 2 files changed, 3 insertions(+), 3 deletions(-)

linux-pcie-link-retrain-timeout.diff
Index: linux-macro/drivers/pci/pci.h
===
--- linux-macro.orig/drivers/pci/pci.h
+++ linux-macro/drivers/pci/pci.h
@@ -11,6 +11,8 @@
 
 #define PCI_VSEC_ID_INTEL_TBT  0x1234  /* Thunderbolt */
 
+#define PCIE_LINK_RETRAIN_TIMEOUT HZ
+
 extern const unsigned char pcie_link_speed[];
 extern bool pci_early_dump;
 
Index: linux-macro/drivers/pci/pcie/aspm.c
===
--- linux-macro.orig/drivers/pci/pcie/aspm.c
+++ linux-macro/drivers/pci/pcie/aspm.c
@@ -90,8 +90,6 @@ static const char *policy_str[] = {
[POLICY_POWER_SUPERSAVE] = "powersupersave"
 };
 
-#define LINK_RETRAIN_TIMEOUT HZ
-
 /*
  * The L1 PM substate capability is only implemented in function 0 in a
  * multi function device.
@@ -213,7 +211,7 @@ static bool pcie_retrain_link(struct pci
}
 
/* Wait for link training end. Break out after waiting for timeout */
-   end_jiffies = jiffies + LINK_RETRAIN_TIMEOUT;
+   end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
do {
pcie_capability_read_word(parent, PCI_EXP_LNKSTA, );
if (!(reg16 & PCI_EXP_LNKSTA_LT))


[PATCH v7 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures

2023-04-04 Thread Maciej W. Rozycki
Hi,

 This is v7 of the change to work around a PCIe link training phenomenon 
where a pair of devices both capable of operating at a link speed above 
2.5GT/s seems unable to negotiate the link speed and continues training 
indefinitely with the Link Training bit switching on and off repeatedly 
and the data link layer never reaching the active state.

 This version has been trivially rebased on top of 6.3-rc5 and verified at 
run time.  

 Previous iteration: 
.

  Maciej


[linux-next:master] BUILD REGRESSION 6a53bda3aaf3de5edeea27d0b1d8781d067640b6

2023-04-04 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: 6a53bda3aaf3de5edeea27d0b1d8781d067640b6  Add linux-next specific 
files for 20230404

Error/Warning reports:

https://lore.kernel.org/oe-kbuild-all/202303082135.njdx1bij-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202303161521.jbgbafjj-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202304041708.siwlxmyd-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202304041748.0sqc4k4l-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202304042104.ufiuevbp-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202304050029.38ndbqpf-...@intel.com

Error/Warning: (recently discovered and may have been fixed)

Documentation/virt/kvm/api.rst:8303: WARNING: Field list ends without a blank 
line; unexpected unindent.
ERROR: modpost: "bpf_fentry_test1" 
[tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko] undefined!
Error: failed to load BTF from vmlinux: No data available
Makefile:77: *** Cannot find a vmlinux for VMLINUX_BTF at any of "vmlinux 
vmlinux ../../../../vmlinux /sys/kernel/btf/vmlinux 
/boot/vmlinux-5.9.0-0.bpo.2-amd64".  Stop.
arch/m68k/include/asm/irq.h:78:11: error: expected ';' before 'void'
arch/m68k/include/asm/irq.h:78:40: warning: 'struct pt_regs' declared inside 
parameter list will not be visible outside of this definition or declaration
diff: tools/arch/s390/include/uapi/asm/ptrace.h: No such file or directory
drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_validation.c:351:13: 
warning: variable 'bw_needed' set but not used [-Wunused-but-set-variable]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_validation.c:352:25: 
warning: variable 'link' set but not used [-Wunused-but-set-variable]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/smu_v13_0_6_ppt.c:309:17: sparse:  
  int
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/smu_v13_0_6_ppt.c:309:17: sparse:  
  void
drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c:148:31: error: implicit 
declaration of function 'pci_msix_can_alloc_dyn' 
[-Werror=implicit-function-declaration]
drivers/net/wireless/legacy/ray_cs.c:628:17: warning: 'strncpy' specified bound 
32 equals destination size [-Wstringop-truncation]
kernel/bpf/verifier.c:18503: undefined reference to `find_kallsyms_symbol_value'
ld.lld: error: .btf.vmlinux.bin.o: unknown file type
ld.lld: error: undefined symbol: find_kallsyms_symbol_value
tcp_mmap.c:211:61: warning: 'lu' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
thermal_nl.h:6:10: fatal error: netlink/netlink.h: No such file or directory
thermometer.c:21:10: fatal error: libconfig.h: No such file or directory

Unverified Error/Warning (likely false positive, please contact us if 
interested):

drivers/acpi/property.c:985 acpi_data_prop_read_single() error: potentially 
dereferencing uninitialized 'obj'.
drivers/pinctrl/pinctrl-mlxbf3.c:162:20: sparse: sparse: symbol 
'mlxbf3_pmx_funcs' was not declared. Should it be static?
drivers/soc/fsl/qe/tsa.c:140:26: sparse: sparse: incorrect type in argument 2 
(different address spaces)
drivers/soc/fsl/qe/tsa.c:150:27: sparse: sparse: incorrect type in argument 1 
(different address spaces)
drivers/soc/fsl/qe/tsa.c:189:26: sparse: sparse: dereference of noderef 
expression
drivers/soc/fsl/qe/tsa.c:663:22: sparse: sparse: incorrect type in assignment 
(different address spaces)
drivers/soc/fsl/qe/tsa.c:673:21: sparse: sparse: incorrect type in assignment 
(different address spaces)
include/linux/gpio/consumer.h: linux/err.h is included more than once.
include/linux/gpio/driver.h: asm/bug.h is included more than once.
io_uring/io_uring.c:432 io_prep_async_work() error: we previously assumed 
'req->file' could be null (see line 425)
io_uring/kbuf.c:221 __io_remove_buffers() warn: variable dereferenced before 
check 'bl->buf_ring' (see line 219)

Error/Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allyesconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:variable-bw_needed-set-but-not-used
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:variable-link-set-but-not-used
|   `-- 
drivers-net-wireless-legacy-ray_cs.c:warning:strncpy-specified-bound-equals-destination-size
|-- alpha-buildonly-randconfig-r005-20230403
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:variable-bw_needed-set-but-not-used
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_validation.c:warning:variable-link-set-but-not-used
|-- alpha-randconfig-s051-20230403
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-pm-swsmu-smu13-smu_v13_0_6_ppt.c:sparse:int
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-pm-swsmu-smu13-smu_v13_0_6_ppt.c:sparse:sparse:incompatible-types-in-conditional-expression-(different-base-types):
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-pm-swsmu-smu13-smu_v13_0_6_ppt.c:sparse:void
|   `-- 
drivers-pinctrl-pinctrl-mlxbf3.c:sparse:spar

Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-04-04 Thread Bjorn Helgaas
On Thu, Mar 30, 2023 at 07:24:27PM +0300, Andy Shevchenko wrote:
> Provide two new helper macros to iterate over PCI device resources and
> convert users.
> 
> Looking at it, refactor existing pci_bus_for_each_resource() and convert
> users accordingly.
> 
> Note, the amount of lines grew due to the documentation update.
> 
> Changelog v8:
> - fixed issue with pci_bus_for_each_resource() macro (LKP)
> - due to above added a new patch to document how it works
> - moved the last patch to be #2 (Philippe)
> - added tags (Philippe)
> 
> Changelog v7:
> - made both macros to share same name (Bjorn)

I didn't actually request the same name for both; I would have had no
idea how to even do that :)

v6 had:

  pci_dev_for_each_resource_p(dev, res)
  pci_dev_for_each_resource(dev, res, i)

and I suggested:

  pci_dev_for_each_resource(dev, res)
  pci_dev_for_each_resource_idx(dev, res, i)

because that pattern is used elsewhere.  But you figured out how to do
it, and having one name is even better, so thanks for that extra work!

> - split out the pci_resource_n() conversion (Bjorn)
> 
> Changelog v6:
> - dropped unused variable in PPC code (LKP)
> 
> Changelog v5:
> - renamed loop variable to minimize the clash (Keith)
> - addressed smatch warning (Dan)
> - addressed 0-day bot findings (LKP)
> 
> Changelog v4:
> - rebased on top of v6.3-rc1
> - added tag (Krzysztof)
> 
> Changelog v3:
> - rebased on top of v2 by Mika, see above
> - added tag to pcmcia patch (Dominik)
> 
> Changelog v2:
> - refactor to have two macros
> - refactor existing pci_bus_for_each_resource() in the same way and
>   convert users
> 
> Andy Shevchenko (6):
>   kernel.h: Split out COUNT_ARGS() and CONCATENATE()
>   PCI: Introduce pci_resource_n()
>   PCI: Document pci_bus_for_each_resource() to avoid confusion
>   PCI: Allow pci_bus_for_each_resource() to take less arguments
>   EISA: Convert to use less arguments in pci_bus_for_each_resource()
>   pcmcia: Convert to use less arguments in pci_bus_for_each_resource()
> 
> Mika Westerberg (1):
>   PCI: Introduce pci_dev_for_each_resource()
> 
>  .clang-format |  1 +
>  arch/alpha/kernel/pci.c   |  5 +-
>  arch/arm/kernel/bios32.c  | 16 +++--
>  arch/arm/mach-dove/pcie.c | 10 ++--
>  arch/arm/mach-mv78xx0/pcie.c  | 10 ++--
>  arch/arm/mach-orion5x/pci.c   | 10 ++--
>  arch/mips/pci/ops-bcm63xx.c   |  8 +--
>  arch/mips/pci/pci-legacy.c|  3 +-
>  arch/powerpc/kernel/pci-common.c  | 21 +++
>  arch/powerpc/platforms/4xx/pci.c  |  8 +--
>  arch/powerpc/platforms/52xx/mpc52xx_pci.c |  5 +-
>  arch/powerpc/platforms/pseries/pci.c  | 16 ++---
>  arch/sh/drivers/pci/pcie-sh7786.c | 10 ++--
>  arch/sparc/kernel/leon_pci.c  |  5 +-
>  arch/sparc/kernel/pci.c   | 10 ++--
>  arch/sparc/kernel/pcic.c  |  5 +-
>  drivers/eisa/pci_eisa.c   |  4 +-
>  drivers/pci/bus.c |  7 +--
>  drivers/pci/hotplug/shpchp_sysfs.c|  8 +--
>  drivers/pci/pci.c |  3 +-
>  drivers/pci/probe.c   |  2 +-
>  drivers/pci/remove.c  |  5 +-
>  drivers/pci/setup-bus.c   | 37 +---
>  drivers/pci/setup-res.c   |  4 +-
>  drivers/pci/vgaarb.c  | 17 ++
>  drivers/pci/xen-pcifront.c|  4 +-
>  drivers/pcmcia/rsrc_nonstatic.c   |  9 +--
>  drivers/pcmcia/yenta_socket.c |  3 +-
>  drivers/pnp/quirks.c  | 29 -
>  include/linux/args.h  | 13 
>  include/linux/kernel.h|  8 +--
>  include/linux/pci.h   | 72 +++
>  32 files changed, 190 insertions(+), 178 deletions(-)
>  create mode 100644 include/linux/args.h

Applied 2-7 to pci/resource for v6.4, thanks, I really like this!

I omitted

  [1/7] kernel.h: Split out COUNT_ARGS() and CONCATENATE()"

only because it's not essential to this series and has only a trivial
one-line impact on include/linux/pci.h.

Bjorn


Re: [PATCH v3 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2023-04-04 Thread Sean Anderson
On 4/4/23 11:33, Crystal Wood wrote:
> On Tue, 2023-04-04 at 10:55 -0400, Sean Anderson wrote:
> 
>> @@ -1456,11 +1456,11 @@ static void tqm_congestion_task(struct work_struct
>> *work)
>> union qm_mc_result *mcr;
>> struct qman_cgr *cgr;
>>  
>> -   spin_lock_irq(>cgr_lock);
>> +   raw_spin_lock_irq(>cgr_lock);
>> qm_mc_start(>p);
>> qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
>> if (!qm_mc_result_timeout(>p, )) {
>> -   spin_unlock_irq(>cgr_lock);
>> +   raw_spin_unlock_irq(>cgr_lock);
> 
> qm_mc_result_timeout() spins with a timeout of 10 ms which is very
> inappropriate for a raw lock.  What is the actual expected upper bound?

Hm, maybe we can move this qm_mc stuff outside cgr_lock? In most other
places they're called without cgr_lock, which implies that its usage
here is meant to synchronize against some other function.

>> dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
>> qman_p_irqsource_add(p, QM_PIRQ_CSCI);
>> return;
>> @@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct
>> *work)
>> list_for_each_entry(cgr, >cgr_cbs, node)
>> if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
>> cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
>> -   spin_unlock_irq(>cgr_lock);
>> +   raw_spin_unlock_irq(>cgr_lock);
>> qman_p_irqsource_add(p, QM_PIRQ_CSCI);
>>  }
> 
> The callback loop is also a bit concerning...

The callbacks (in .../dpaa/dpaa_eth.c and .../caam/qi.c) look OK. The
only thing which might take a bit is dpaa_eth_refill_bpools, which
allocates memory (from the atomic pool).

--Sean


Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-04 Thread Peter Zijlstra
On Tue, Apr 04, 2023 at 05:12:17PM +0200, Peter Zijlstra wrote:
> > case 2:
> > CPU-A CPU-B
> > 
> > modify pagetables
> > tlb_flush (memory barrier)
> >   state == CONTEXT_USER
> > int state = atomic_read(>state);
> >   Kernel-enter:
> >   state == CONTEXT_KERNEL
> >   READ(pagetable values)
> > if (state & CT_STATE_MASK == CONTEXT_USER)
> > 


Hmm, hold up; what about memory ordering, we need a store-load ordering
between the page-table write and the context trackng load, and a
store-load order on the context tracking update and software page-table
walker loads.

Now, iirc page-table modification is done under pte_lock (or
page_table_lock) and that only provides a RELEASE barrier on this end,
which is insufficient to order against a later load.

Is there anything else?

On the state tracking side, we have ct_state_inc() which is
atomic_add_return() which should provide full barrier and is sufficient.


[powerpc:merge 5/7] .github/problem-matchers/sparse.json: warning: ignored by one of the .gitignore files

2023-04-04 Thread kernel test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge
head:   639e8992872c632f27b130b403e263eae966231e
commit: ff94f02dbdf0d6077497f1ffb63080c6937c3ed9 [5/7] powerpc/ci: Add sparse 
problem matcher
config: arc-buildonly-randconfig-r003-20230403 
(https://download.01.org/0day-ci/archive/20230404/202304042327.blhf5ncp-...@intel.com/config)
compiler: arceb-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=ff94f02dbdf0d6077497f1ffb63080c6937c3ed9
git remote add powerpc 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
git fetch --no-tags powerpc merge
git checkout ff94f02dbdf0d6077497f1ffb63080c6937c3ed9
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
O=build_dir ARCH=arc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot 
| Link: 
https://lore.kernel.org/oe-kbuild-all/202304042327.blhf5ncp-...@intel.com/

All warnings (new ones prefixed by >>):

   .github/problem-matchers/compiler-non-source.json: warning: ignored by one 
of the .gitignore files
   .github/problem-matchers/compiler-source.json: warning: ignored by one of 
the .gitignore files
>> .github/problem-matchers/sparse.json: warning: ignored by one of the 
>> .gitignore files
   .github/workflows/powerpc-allconfig.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-clang.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-extrawarn.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-kernel+qemu.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-perf.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-ppctests.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-selftests.yml: warning: ignored by one of the 
.gitignore files
   .github/workflows/powerpc-sparse.yml: warning: ignored by one of the 
.gitignore files
   drivers/clk/.kunitconfig: warning: ignored by one of the .gitignore files
   drivers/gpu/drm/tests/.kunitconfig: warning: ignored by one of the 
.gitignore files
   drivers/gpu/drm/vc4/tests/.kunitconfig: warning: ignored by one of the 
.gitignore files
   drivers/hid/.kunitconfig: warning: ignored by one of the .gitignore files
   fs/ext4/.kunitconfig: warning: ignored by one of the .gitignore files
   fs/fat/.kunitconfig: warning: ignored by one of the .gitignore files
   kernel/kcsan/.kunitconfig: warning: ignored by one of the .gitignore files
   lib/kunit/.kunitconfig: warning: ignored by one of the .gitignore files
   mm/kfence/.kunitconfig: warning: ignored by one of the .gitignore files
   net/sunrpc/.kunitconfig: warning: ignored by one of the .gitignore files
   tools/testing/selftests/arm64/tags/.gitignore: warning: ignored by one of 
the .gitignore files
   tools/testing/selftests/arm64/tags/Makefile: warning: ignored by one of the 
.gitignore files
   tools/testing/selftests/arm64/tags/run_tags_test.sh: warning: ignored by one 
of the .gitignore files
   tools/testing/selftests/arm64/tags/tags_test.c: warning: ignored by one of 
the .gitignore files
   tools/testing/selftests/kvm/.gitignore: warning: ignored by one of the 
.gitignore files
   tools/testing/selftests/kvm/Makefile: warning: ignored by one of the 
.gitignore files
   tools/testing/selftests/kvm/config: warning: ignored by one of the 
.gitignore files
   tools/testing/selftests/kvm/settings: warning: ignored by one of the 
.gitignore files

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Re: [PATCH v3 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2023-04-04 Thread Crystal Wood
On Tue, 2023-04-04 at 10:55 -0400, Sean Anderson wrote:

> @@ -1456,11 +1456,11 @@ static void qm_congestion_task(struct work_struct
> *work)
> union qm_mc_result *mcr;
> struct qman_cgr *cgr;
>  
> -   spin_lock_irq(>cgr_lock);
> +   raw_spin_lock_irq(>cgr_lock);
> qm_mc_start(>p);
> qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
> if (!qm_mc_result_timeout(>p, )) {
> -   spin_unlock_irq(>cgr_lock);
> +   raw_spin_unlock_irq(>cgr_lock);

qm_mc_result_timeout() spins with a timeout of 10 ms which is very
inappropriate for a raw lock.  What is the actual expected upper bound?

> dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
> qman_p_irqsource_add(p, QM_PIRQ_CSCI);
> return;
> @@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct
> *work)
> list_for_each_entry(cgr, >cgr_cbs, node)
> if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
> cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
> -   spin_unlock_irq(>cgr_lock);
> +   raw_spin_unlock_irq(>cgr_lock);
> qman_p_irqsource_add(p, QM_PIRQ_CSCI);
>  }

The callback loop is also a bit concerning...

-Crystal



Re: [PATCH] powerpc/64: Always build with 128-bit long double

2023-04-04 Thread Segher Boessenkool
Hi!

On Tue, Apr 04, 2023 at 08:28:47PM +1000, Michael Ellerman wrote:
> The amdgpu driver builds some of its code with hard-float enabled,
> whereas the rest of the kernel is built with soft-float.
> 
> When building with 64-bit long double, if soft-float and hard-float
> objects are linked together, the build fails due to incompatible ABI
> tags.

> Currently those build errors are avoided because the amdgpu driver is
> gated on 128-bit long double being enabled. But that's not a detail the
> amdgpu driver should need to be aware of, and if another driver starts
> using hard-float the same problem would occur.

Well.  The kernel driver either has no business using long double (or
any other floating point even) at all, or it should know exactly what is
used: double precision, double-double, or quadruple precision.  Both of
the latter two are 128 bits.

> All versions of the 64-bit ABI specify that long-double is 128-bits.
> However some compilers, notably the kernel.org ones, are built to use
> 64-bit long double by default.

Mea culpa, I suppose?  But builddall doesn't force 64 bit explicitly.
I wonder how this happened?  Is it maybe a problem in the powerpc64le
config in GCC itself?  I have a patch from summer last year (Arnd's
toolchains are built without it) that does
+   powerpc64le-*)  TARGET_GCC_CONF=--with-long-double-128
Unfortunately I don't remember why I did that, and I never investigated
what the deeper problem is :-/

In either case, the kernel should always use specific types, not rely on
the toolchain to pick a type that may or may not work.  The correct size
floating point type alone is not enough, but it is a step in the right
direction certainly.

Reviewed-by: Segher Boessenkool 


Segher


Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-04 Thread Peter Zijlstra
On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> The tlb_remove_table_smp_sync IPI is used to ensure the outdated tlb page
> is not currently being accessed and can be cleared.
> This occurs once all CPUs have left the lockless gup code section.
> If they reenter the page table walk, the pointers will be to the new
> pages.
> Therefore the IPI is only needed for CPUs in kernel mode.
> By preventing the IPI from being sent to CPUs not in kernel mode,
> Latencies are reduced.
> 
> Race conditions considerations:
> The context state check is vulnerable to race conditions between the
> moment the context state is read to when the IPI is sent (or not).
> 
> Here are these scenarios.
> case 1:
> CPU-A CPU-B
> 
>   state == CONTEXT_KERNEL
> int state = atomic_read(>state);
>   Kernel-exit:
>   state == CONTEXT_USER
> if (state & CT_STATE_MASK == CONTEXT_KERNEL)
> 
> In this case, the IPI will be sent to CPU-B despite it is no longer in
> the kernel. The consequence of which would be an unnecessary IPI being
> handled by CPU-B, causing a reduction in latency.
> This would have been the case every time without this patch.
> 
> case 2:
> CPU-A CPU-B
> 
> modify pagetables
> tlb_flush (memory barrier)
>   state == CONTEXT_USER
> int state = atomic_read(>state);
>   Kernel-enter:
>   state == CONTEXT_KERNEL
>   READ(pagetable values)
> if (state & CT_STATE_MASK == CONTEXT_USER)
> 
> In this case, the IPI will not be sent to CPU-B despite it returning to
> the kernel and even reading the pagetable.
> However since this CPU-B has entered the pagetable after the
> modification it is reading the new, safe values.
> 
> The only case when this IPI is truly necessary is when CPU-B has entered
> the lockless gup code section before the pagetable modifications and
> has yet to exit them, in which case it is still in the kernel.
> 
> Signed-off-by: Yair Podemsky 
> ---
>  mm/mmu_gather.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 5ea9be6fb87c..731d955e152d 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -191,6 +192,20 @@ static void tlb_remove_table_smp_sync(void *arg)
>   /* Simply deliver the interrupt */
>  }
>  
> +
> +#ifdef CONFIG_CONTEXT_TRACKING
> +static bool cpu_in_kernel(int cpu, void *info)
> +{
> + struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);
> + int state = atomic_read(>state);
> + /* will return true only for cpus in kernel space */
> + return state & CT_STATE_MASK == CONTEXT_KERNEL;
> +}
> +#define CONTEXT_PREDICATE cpu_in_kernel
> +#else
> +#define CONTEXT_PREDICATE NULL
> +#endif /* CONFIG_CONTEXT_TRACKING */
> +
>  #ifdef CONFIG_ARCH_HAS_CPUMASK_BITS
>  #define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
>  #else
> @@ -206,8 +221,8 @@ void tlb_remove_table_sync_one(struct mm_struct *mm)
>* It is however sufficient for software page-table walkers that rely on
>* IRQ disabling.
>*/
> - on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
> - NULL, true);
> + on_each_cpu_cond_mask(CONTEXT_PREDICATE, tlb_remove_table_smp_sync,
> + NULL, true, REMOVE_TABLE_IPI_MASK);
>  }

I think this is correct; but... I would like much of the changelog
included in a comment above cpu_in_kernel(). I'm sure someone will try
and read this code and wonder about those race conditions.

Of crucial importance is the fact that the page-table modification comes
before the tlbi.

Also, do we really not already have this helper function somewhere, it
seems like something obvious to already have, Frederic?




Re: [PATCH 2/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-04-04 Thread Peter Zijlstra
On Tue, Apr 04, 2023 at 04:42:23PM +0300, Yair Podemsky wrote:
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 2b93cf6ac9ae..5ea9be6fb87c 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -191,7 +191,13 @@ static void tlb_remove_table_smp_sync(void *arg)
>   /* Simply deliver the interrupt */
>  }
>  
> -void tlb_remove_table_sync_one(void)
> +#ifdef CONFIG_ARCH_HAS_CPUMASK_BITS
> +#define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
> +#else
> +#define REMOVE_TABLE_IPI_MASK NULL
> +#endif /* CONFIG_ARCH_HAS_CPUMASK_BITS */
> +
> +void tlb_remove_table_sync_one(struct mm_struct *mm)
>  {
>   /*
>* This isn't an RCU grace period and hence the page-tables cannot be
> @@ -200,7 +206,8 @@ void tlb_remove_table_sync_one(void)
>* It is however sufficient for software page-table walkers that rely on
>* IRQ disabling.
>*/
> - smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> + on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
> + NULL, true);
>  }

Uhh, I don't think NULL is a valid @mask argument. Should that not be
something like:

#ifdef CONFIG_ARCH_HAS_CPUMASK
#define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
#else
#define REMOVE_TABLE_IPI_MASK cpu_online_mask
#endif

preempt_disable();
on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync, NULL 
true);
preempt_enable();


?


[PATCH v3 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2023-04-04 Thread Sean Anderson
smp_call_function_single disables IRQs when executing the callback. To
prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
This is already done by qman_update_cgr and qman_delete_cgr; fix the
other lockers.

Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
Signed-off-by: Sean Anderson 
Reviewed-by: Camelia Groza 
Tested-by: Vladimir Oltean 
---

Changes in v3:
- Change blamed commit to something more appropriate

Changes in v2:
- Fix one additional call to spin_unlock

 drivers/soc/fsl/qbman/qman.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 739e4eee6b75..1bf1f1ea67f0 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1456,11 +1456,11 @@ static void qm_congestion_task(struct work_struct *work)
union qm_mc_result *mcr;
struct qman_cgr *cgr;
 
-   spin_lock(>cgr_lock);
+   spin_lock_irq(>cgr_lock);
qm_mc_start(>p);
qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
if (!qm_mc_result_timeout(>p, )) {
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
return;
@@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct *work)
list_for_each_entry(cgr, >cgr_cbs, node)
if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
 }
 
@@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
preempt_enable();
 
cgr->chan = p->config->channel;
-   spin_lock(>cgr_lock);
+   spin_lock_irq(>cgr_lock);
 
if (opts) {
struct qm_mcc_initcgr local_opts = *opts;
@@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
qman_cgrs_get(>cgrs[1], cgr->cgrid))
cgr->cb(p, cgr, 1);
 out:
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
put_affine_portal();
return ret;
 }
-- 
2.35.1.1320.gc452695387.dirty



[PATCH v3 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2023-04-04 Thread Sean Anderson
cgr_lock may be locked with interrupts already disabled by
smp_call_function_single. As such, we must use a raw spinlock to avoid
problems on PREEMPT_RT kernels. Although this bug has existed for a
while, it was not apparent until commit ef2a8d5478b9 ("net: dpaa: Adjust
queue depth on rate change") which invokes smp_call_function_single via
qman_update_cgr_safe every time a link goes up or down.

Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
Reported-by: Vladimir Oltean 
Link: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/
Signed-off-by: Sean Anderson 
Reviewed-by: Camelia Groza 
Tested-by: Vladimir Oltean 
---

Changes in v3:
- Change blamed commit to something more appropriate

 drivers/soc/fsl/qbman/qman.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 1bf1f1ea67f0..7a1558aba523 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -991,7 +991,7 @@ struct qman_portal {
/* linked-list of CSCN handlers. */
struct list_head cgr_cbs;
/* list lock */
-   spinlock_t cgr_lock;
+   raw_spinlock_t cgr_lock;
struct work_struct congestion_work;
struct work_struct mr_work;
char irqname[MAX_IRQNAME];
@@ -1281,7 +1281,7 @@ static int qman_create_portal(struct qman_portal *portal,
/* if the given mask is NULL, assume all CGRs can be seen */
qman_cgrs_fill(>cgrs[0]);
INIT_LIST_HEAD(>cgr_cbs);
-   spin_lock_init(>cgr_lock);
+   raw_spin_lock_init(>cgr_lock);
INIT_WORK(>congestion_work, qm_congestion_task);
INIT_WORK(>mr_work, qm_mr_process_task);
portal->bits = 0;
@@ -1456,11 +1456,11 @@ static void qm_congestion_task(struct work_struct *work)
union qm_mc_result *mcr;
struct qman_cgr *cgr;
 
-   spin_lock_irq(>cgr_lock);
+   raw_spin_lock_irq(>cgr_lock);
qm_mc_start(>p);
qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
if (!qm_mc_result_timeout(>p, )) {
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
return;
@@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct *work)
list_for_each_entry(cgr, >cgr_cbs, node)
if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
 }
 
@@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
preempt_enable();
 
cgr->chan = p->config->channel;
-   spin_lock_irq(>cgr_lock);
+   raw_spin_lock_irq(>cgr_lock);
 
if (opts) {
struct qm_mcc_initcgr local_opts = *opts;
@@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
qman_cgrs_get(>cgrs[1], cgr->cgrid))
cgr->cb(p, cgr, 1);
 out:
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
put_affine_portal();
return ret;
 }
@@ -2512,7 +2512,7 @@ int qman_delete_cgr(struct qman_cgr *cgr)
return -EINVAL;
 
memset(_opts, 0, sizeof(struct qm_mcc_initcgr));
-   spin_lock_irqsave(>cgr_lock, irqflags);
+   raw_spin_lock_irqsave(>cgr_lock, irqflags);
list_del(>node);
/*
 * If there are no other CGR objects for this CGRID in the list,
@@ -2537,7 +2537,7 @@ int qman_delete_cgr(struct qman_cgr *cgr)
/* add back to the list */
list_add(>node, >cgr_cbs);
 release_lock:
-   spin_unlock_irqrestore(>cgr_lock, irqflags);
+   raw_spin_unlock_irqrestore(>cgr_lock, irqflags);
put_affine_portal();
return ret;
 }
@@ -2577,9 +2577,9 @@ static int qman_update_cgr(struct qman_cgr *cgr, struct 
qm_mcc_initcgr *opts)
if (!p)
return -EINVAL;
 
-   spin_lock_irqsave(>cgr_lock, irqflags);
+   raw_spin_lock_irqsave(>cgr_lock, irqflags);
ret = qm_modify_cgr(cgr, 0, opts);
-   spin_unlock_irqrestore(>cgr_lock, irqflags);
+   raw_spin_unlock_irqrestore(>cgr_lock, irqflags);
put_affine_portal();
return ret;
 }
-- 
2.35.1.1320.gc452695387.dirty



Re: Probing nvme disks fails on Upstream kernels on powerpc Maxconfig

2023-04-04 Thread Linux regression tracking (Thorsten Leemhuis)
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 23.03.23 10:53, Srikar Dronamraju wrote:
> 
> I am unable to boot upstream kernels from v5.16 to the latest upstream
> kernel on a maxconfig system. (Machine config details given below)
> 
> At boot, we see a series of messages like the below.
> 
> dracut-initqueue[13917]: Warning: dracut-initqueue: timeout, still waiting 
> for following initqueue hooks:
> dracut-initqueue[13917]: Warning: 
> /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f93dc0767-18aa-467f-afa7-5b4e9c13108a.sh:
>  "if ! grep -q After=remote-fs-pre.target 
> /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
> dracut-initqueue[13917]: [ -e 
> "/dev/disk/by-uuid/93dc0767-18aa-467f-afa7-5b4e9c13108a" ]
> dracut-initqueue[13917]: fi"

Alexey, did you look into this? This is apparently caused by a commit of
yours (see quoted part below) that Michael applied. Looks like it fell
through the cracks from here, but maybe I'm missing something.

Anyway, for the rest of this mail:

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 387273118714
#regzbot title powerps/pseries/dma: Probing nvme disks fails on powerpc
Maxconfig
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

> journalctl shows the below warning.
> 
>  WARNING: CPU: 242 PID: 1219 at 
> /home/srikar/work/linux.git/arch/powerpc/kernel/iommu.c:227 
> iommu_range_alloc+0x3d4/0x450
>  Modules linked in: lpfc(E+) nvmet_fc(E) nvmet(E) configfs(E) qla2xxx(E+) 
> nvme_fc(E) nvme_fabrics(E) vmx_crypto(E) gf128mul(E) xhci_pci(E) 
> xhci_pci_renesas(E) xhci_hcd(E) ipr(E+) nvme(E) usbcore(E) libata(E) 
> nvme_core(E) t10_pi(E) scsi_transport_fc(E) usb_common(E) btrfs(E) 
> blake2b_generic(E) libcrc32c(E) crc32c_vpmsum(E) xor(E) raid6_pq(E) sg(E) 
> dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
> scsi_mod(E) scsi_common(E)
>  CPU: 242 PID: 1219 Comm: kworker/u3843:0 Tainted: GW   EL
> 5.15.0-sp4+ #33 91e1c36ffe385108bbe4a3834506a047dc78552d
>  Workqueue: nvme-reset-wq nvme_reset_work [nvme]
>  NIP:  c005a134 LR: c005a128 CTR: 
>  REGS: c7fd4c7eb580 TRAP: 0700   Tainted: GW   EL 
> (5.15.0-sp4+)
>  MSR:  80029033   CR: 24002424  XER: 
>  CFAR: c020972c IRQMASK: 0
>  GPR00: c005a128 c7fd4c7eb820 c2aa4b00 0001
>  GPR04: c273d648 0003 0bfbcb21 c2d88390
>  GPR08:   00f2 c2b05240
>  GPR12: 2000 cbfbdfffcb00  c7fd4c9d1c40
>  GPR16:    
>  GPR20:   c2bab580 
>  GPR24: c73b30c8   
>  GPR28: c7fd7133  0001 0001
>  NIP [c005a134] iommu_range_alloc+0x3d4/0x450
>  LR [c005a128] iommu_range_alloc+0x3c8/0x450
>  Call Trace:
>  [c7fd4c7eb820] [c005a128] iommu_range_alloc+0x3c8/0x450 
> (unreliable)
>  [c7fd4c7eb8e0] [c005a580] iommu_alloc+0x60/0x170
>  [c7fd4c7eb930] [c005bd4c] iommu_alloc_coherent+0x11c/0x1d0
>  [c7fd4c7eb9d0] [c00597e8] dma_iommu_alloc_coherent+0x38/0x50
>  [c7fd4c7eb9f0] [c0249ce8] dma_alloc_attrs+0x128/0x180
>  [c7fd4c7eba60] [c0080001093210d8] nvme_alloc_queue+0x90/0x2b0 [nvme]
>  [c7fd4c7ebac0] [c008000109326034] nvme_reset_work+0x44c/0x1870 [nvme]
>  [c7fd4c7ebc30] [c01870b8] process_one_work+0x388/0x730
>  [c7fd4c7ebd10] [c01874d8] worker_thread+0x78/0x5b0
>  [c7fd4c7ebda0] [c01945cc] 

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-04 Thread David Hildenbrand

On 04.04.23 15:42, Yair Podemsky wrote:

The tlb_remove_table_smp_sync IPI is used to ensure the outdated tlb page
is not currently being accessed and can be cleared.
This occurs once all CPUs have left the lockless gup code section.
If they reenter the page table walk, the pointers will be to the new
pages.
Therefore the IPI is only needed for CPUs in kernel mode.
By preventing the IPI from being sent to CPUs not in kernel mode,
Latencies are reduced.

Race conditions considerations:
The context state check is vulnerable to race conditions between the
moment the context state is read to when the IPI is sent (or not).

Here are these scenarios.
case 1:
CPU-A CPU-B

   state == CONTEXT_KERNEL
int state = atomic_read(>state);
   Kernel-exit:
   state == CONTEXT_USER
if (state & CT_STATE_MASK == CONTEXT_KERNEL)

In this case, the IPI will be sent to CPU-B despite it is no longer in
the kernel. The consequence of which would be an unnecessary IPI being
handled by CPU-B, causing a reduction in latency.
This would have been the case every time without this patch.

case 2:
CPU-A CPU-B

modify pagetables
tlb_flush (memory barrier)
   state == CONTEXT_USER
int state = atomic_read(>state);
   Kernel-enter:
   state == CONTEXT_KERNEL
   READ(pagetable values)
if (state & CT_STATE_MASK == CONTEXT_USER)

In this case, the IPI will not be sent to CPU-B despite it returning to
the kernel and even reading the pagetable.
However since this CPU-B has entered the pagetable after the
modification it is reading the new, safe values.

The only case when this IPI is truly necessary is when CPU-B has entered
the lockless gup code section before the pagetable modifications and
has yet to exit them, in which case it is still in the kernel.

Signed-off-by: Yair Podemsky 
---
  mm/mmu_gather.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 5ea9be6fb87c..731d955e152d 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -191,6 +192,20 @@ static void tlb_remove_table_smp_sync(void *arg)
/* Simply deliver the interrupt */
  }
  
+

+#ifdef CONFIG_CONTEXT_TRACKING
+static bool cpu_in_kernel(int cpu, void *info)
+{
+   struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);
+   int state = atomic_read(>state);
+   /* will return true only for cpus in kernel space */
+   return state & CT_STATE_MASK == CONTEXT_KERNEL;
+}
+#define CONTEXT_PREDICATE cpu_in_kernel
+#else
+#define CONTEXT_PREDICATE NULL
+#endif /* CONFIG_CONTEXT_TRACKING */
+
  #ifdef CONFIG_ARCH_HAS_CPUMASK_BITS
  #define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
  #else
@@ -206,8 +221,8 @@ void tlb_remove_table_sync_one(struct mm_struct *mm)
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */
-   on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
-   NULL, true);
+   on_each_cpu_cond_mask(CONTEXT_PREDICATE, tlb_remove_table_smp_sync,
+   NULL, true, REMOVE_TABLE_IPI_MASK);
  }
  
  static void tlb_remove_table_rcu(struct rcu_head *head)



Maybe a bit cleaner by avoiding CONTEXT_PREDICATE, still not completely nice
(an empty dummy function "cpu_maybe_in_kernel" might be cleanest but would
be slightly slower for !CONFIG_CONTEXT_TRACKING):

#ifdef CONFIG_CONTEXT_TRACKING
static bool cpu_in_kernel(int cpu, void *info)
{
struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);
int state = atomic_read(>state);
/* will return true only for cpus in kernel space */
return state & CT_STATE_MASK == CONTEXT_KERNEL;
}
#endif /* CONFIG_CONTEXT_TRACKING */


...
#ifdef CONFIG_CONTEXT_TRACKING
on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
 NULL, true);
#else /* CONFIG_CONTEXT_TRACKING */
on_each_cpu_cond_mask(cpu_in_kernel, tlb_remove_table_smp_sync,
  NULL, true, REMOVE_TABLE_IPI_MASK);
#endif /* CONFIG_CONTEXT_TRACKING */


--
Thanks,

David / dhildenb



Re: [PATCH 1/3] arch: Introduce ARCH_HAS_CPUMASK_BITS

2023-04-04 Thread David Hildenbrand

On 04.04.23 15:42, Yair Podemsky wrote:

Some architectures set and maintain the mm_cpumask bits when loading
or removing process from cpu.
This Kconfig will mark those to allow different behavior between
kernels that maintain the mm_cpumask and those that do not.



I was wondering if we should do something along the lines of:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0722859c3647..1f5c15d8e8ed 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -767,11 +767,13 @@ struct mm_struct {
 #endif /* CONFIG_LRU_GEN */
} __randomize_layout;

+#ifdef CONFIG_MM_CPUMASK
/*
 * The mm_cpumask needs to be at the end of mm_struct, because it
 * is dynamically sized based on nr_cpu_ids.
 */
unsigned long cpu_bitmap[];
+#endif
 };

But that would, of course, require additional changes to make it 
compile. What concerns me a bit is that we have in mm/rmap.c a 
mm_cpumask() usage. But it's glued to 
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ... shaky.


At least if we would properly fence it, there would be no
accidental abuse anymore.



Signed-off-by: Yair Podemsky 
---
  arch/Kconfig | 8 
  arch/arm/Kconfig | 1 +
  arch/powerpc/Kconfig | 1 +
  arch/s390/Kconfig| 1 +
  arch/sparc/Kconfig   | 1 +
  arch/x86/Kconfig | 1 +


As Valentin says, there are other architectures that do the same.

--
Thanks,

David / dhildenb



Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-04-04 Thread Uros Bizjak
On Tue, Apr 4, 2023 at 3:19 PM Mark Rutland  wrote:
>
> On Tue, Apr 04, 2023 at 02:24:38PM +0200, Uros Bizjak wrote:
> > On Mon, Apr 3, 2023 at 12:19 PM Mark Rutland  wrote:
> > >
> > > On Sun, Mar 26, 2023 at 09:28:38PM +0200, Uros Bizjak wrote:
> > > > On Fri, Mar 24, 2023 at 5:33 PM Mark Rutland  
> > > > wrote:
> > > > >
> > > > > On Fri, Mar 24, 2023 at 04:14:22PM +, Mark Rutland wrote:
> > > > > > On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> > > > > > > On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > > > > > > > Cast _oldp to the type of _ptr to avoid 
> > > > > > > > > incompatible-pointer-types warning.
> > > > > > > >
> > > > > > > > Can you give an example of where we are passing an incompatible 
> > > > > > > > pointer?
> > > > > > >
> > > > > > > An example is patch 10/10 from the series, which will fail without
> > > > > > > this fix when fallback code is used. We have:
> > > > > > >
> > > > > > > -   } while (local_cmpxchg(>head, offset, head) != 
> > > > > > > offset);
> > > > > > > +   } while (!local_try_cmpxchg(>head, , head));
> > > > > > >
> > > > > > > where rb->head is defined as:
> > > > > > >
> > > > > > > typedef struct {
> > > > > > >atomic_long_t a;
> > > > > > > } local_t;
> > > > > > >
> > > > > > > while offset is defined as 'unsigned long'.
> > > > > >
> > > > > > Ok, but that's because we're doing the wrong thing to start with.
> > > > > >
> > > > > > Since local_t is defined in terms of atomic_long_t, we should 
> > > > > > define the
> > > > > > generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). 
> > > > > > We'll still
> > > > > > have a mismatch between 'long *' and 'unsigned long *', but then we 
> > > > > > can fix
> > > > > > that in the callsite:
> > > > > >
> > > > > >   while (!local_try_cmpxchg(>head, &(long *)offset, head))
> > > > >
> > > > > Sorry, that should be:
> > > > >
> > > > > while (!local_try_cmpxchg(>head, (long *), head))
> > > >
> > > > The fallbacks are a bit more complicated than above, and are different
> > > > from atomic_try_cmpxchg.
> > > >
> > > > Please note in patch 2/10, the falbacks when arch_try_cmpxchg_local
> > > > are not defined call arch_cmpxchg_local. Also in patch 2/10,
> > > > try_cmpxchg_local is introduced, where it calls
> > > > arch_try_cmpxchg_local. Targets (and generic code) simply define (e.g.
> > > > :
> > > >
> > > > #define local_cmpxchg(l, o, n) \
> > > >(cmpxchg_local(&((l)->a.counter), (o), (n)))
> > > > +#define local_try_cmpxchg(l, po, n) \
> > > > +   (try_cmpxchg_local(&((l)->a.counter), (po), (n)))
> > > >
> > > > which is part of the local_t API. Targets should either define all
> > > > these #defines, or none. There are no partial fallbacks as is the case
> > > > with atomic_t.
> > >
> > > Whether or not there are fallbacks is immaterial.
> > >
> > > In those cases, architectures can just as easily write C wrappers, e.g.
> > >
> > > long local_cmpxchg(local_t *l, long old, long new)
> > > {
> > > return cmpxchg_local(>a.counter, old, new);
> > > }
> > >
> > > long local_try_cmpxchg(local_t *l, long *old, long new)
> > > {
> > > return try_cmpxchg_local(>a.counter, old, new);
> > > }
> >
> > Please find attached the complete prototype patch that implements the
> > above suggestion.
> >
> > The patch includes:
> > - implementation of instrumented try_cmpxchg{,64}_local definitions
> > - corresponding arch_try_cmpxchg{,64}_local fallback definitions
> > - generic local{,64}_try_cmpxchg (and local{,64}_cmpxchg) C wrappers
> >
> > - x86 specific local_try_cmpxchg (and local_cmpxchg) C wrappers
> > - x86 specific arch_try_cmpxchg_local definition
> >
> > - kernel/events/ring_buffer.c change to test local_try_cmpxchg
> > implementation and illustrate the transition
> > - arch/x86/events/core.c change to test local64_try_cmpxchg
> > implementation and illustrate the transition
> >
> > The definition of atomic_long_t is different for 64-bit and 32-bit
> > targets (s64 vs int), so target specific C wrappers have to use
> > different casts to account for this difference.
> >
> > Uros.
>
> Thanks for this!
>
> FWIW, the patch (inline below) looks good to me.

Thanks, I will prepare a patch series for submission later today.

Uros.


Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-04-04 Thread Mark Rutland
On Tue, Apr 04, 2023 at 02:24:38PM +0200, Uros Bizjak wrote:
> On Mon, Apr 3, 2023 at 12:19 PM Mark Rutland  wrote:
> >
> > On Sun, Mar 26, 2023 at 09:28:38PM +0200, Uros Bizjak wrote:
> > > On Fri, Mar 24, 2023 at 5:33 PM Mark Rutland  wrote:
> > > >
> > > > On Fri, Mar 24, 2023 at 04:14:22PM +, Mark Rutland wrote:
> > > > > On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> > > > > > On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > > > > > > Cast _oldp to the type of _ptr to avoid 
> > > > > > > > incompatible-pointer-types warning.
> > > > > > >
> > > > > > > Can you give an example of where we are passing an incompatible 
> > > > > > > pointer?
> > > > > >
> > > > > > An example is patch 10/10 from the series, which will fail without
> > > > > > this fix when fallback code is used. We have:
> > > > > >
> > > > > > -   } while (local_cmpxchg(>head, offset, head) != offset);
> > > > > > +   } while (!local_try_cmpxchg(>head, , head));
> > > > > >
> > > > > > where rb->head is defined as:
> > > > > >
> > > > > > typedef struct {
> > > > > >atomic_long_t a;
> > > > > > } local_t;
> > > > > >
> > > > > > while offset is defined as 'unsigned long'.
> > > > >
> > > > > Ok, but that's because we're doing the wrong thing to start with.
> > > > >
> > > > > Since local_t is defined in terms of atomic_long_t, we should define 
> > > > > the
> > > > > generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). 
> > > > > We'll still
> > > > > have a mismatch between 'long *' and 'unsigned long *', but then we 
> > > > > can fix
> > > > > that in the callsite:
> > > > >
> > > > >   while (!local_try_cmpxchg(>head, &(long *)offset, head))
> > > >
> > > > Sorry, that should be:
> > > >
> > > > while (!local_try_cmpxchg(>head, (long *), head))
> > >
> > > The fallbacks are a bit more complicated than above, and are different
> > > from atomic_try_cmpxchg.
> > >
> > > Please note in patch 2/10, the falbacks when arch_try_cmpxchg_local
> > > are not defined call arch_cmpxchg_local. Also in patch 2/10,
> > > try_cmpxchg_local is introduced, where it calls
> > > arch_try_cmpxchg_local. Targets (and generic code) simply define (e.g.
> > > :
> > >
> > > #define local_cmpxchg(l, o, n) \
> > >(cmpxchg_local(&((l)->a.counter), (o), (n)))
> > > +#define local_try_cmpxchg(l, po, n) \
> > > +   (try_cmpxchg_local(&((l)->a.counter), (po), (n)))
> > >
> > > which is part of the local_t API. Targets should either define all
> > > these #defines, or none. There are no partial fallbacks as is the case
> > > with atomic_t.
> >
> > Whether or not there are fallbacks is immaterial.
> >
> > In those cases, architectures can just as easily write C wrappers, e.g.
> >
> > long local_cmpxchg(local_t *l, long old, long new)
> > {
> > return cmpxchg_local(>a.counter, old, new);
> > }
> >
> > long local_try_cmpxchg(local_t *l, long *old, long new)
> > {
> > return try_cmpxchg_local(>a.counter, old, new);
> > }
> 
> Please find attached the complete prototype patch that implements the
> above suggestion.
> 
> The patch includes:
> - implementation of instrumented try_cmpxchg{,64}_local definitions
> - corresponding arch_try_cmpxchg{,64}_local fallback definitions
> - generic local{,64}_try_cmpxchg (and local{,64}_cmpxchg) C wrappers
> 
> - x86 specific local_try_cmpxchg (and local_cmpxchg) C wrappers
> - x86 specific arch_try_cmpxchg_local definition
> 
> - kernel/events/ring_buffer.c change to test local_try_cmpxchg
> implementation and illustrate the transition
> - arch/x86/events/core.c change to test local64_try_cmpxchg
> implementation and illustrate the transition
> 
> The definition of atomic_long_t is different for 64-bit and 32-bit
> targets (s64 vs int), so target specific C wrappers have to use
> different casts to account for this difference.
> 
> Uros.

Thanks for this!

FWIW, the patch (inline below) looks good to me.

Mark.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index d096b04bf80e..d9310e9363f1 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -129,13 +129,12 @@ u64 x86_perf_event_update(struct perf_event *event)
>* exchange a new raw count - then add that new-prev delta
>* count to the generic event atomically:
>*/
> -again:
>   prev_raw_count = local64_read(>prev_count);
> - rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>  
> - if (local64_cmpxchg(>prev_count, prev_raw_count,
> - new_raw_count) != prev_raw_count)
> - goto again;
> + do {
> + rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> + } while (!local64_try_cmpxchg(>prev_count, _raw_count,
> +   new_raw_count));
>  
>   /*
>* Now we have the new raw value and have updated the prev
> diff --git 

Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-04-04 Thread Uros Bizjak
On Mon, Apr 3, 2023 at 12:19 PM Mark Rutland  wrote:
>
> On Sun, Mar 26, 2023 at 09:28:38PM +0200, Uros Bizjak wrote:
> > On Fri, Mar 24, 2023 at 5:33 PM Mark Rutland  wrote:
> > >
> > > On Fri, Mar 24, 2023 at 04:14:22PM +, Mark Rutland wrote:
> > > > On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> > > > > On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland  
> > > > > wrote:
> > > > > >
> > > > > > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > > > > > Cast _oldp to the type of _ptr to avoid 
> > > > > > > incompatible-pointer-types warning.
> > > > > >
> > > > > > Can you give an example of where we are passing an incompatible 
> > > > > > pointer?
> > > > >
> > > > > An example is patch 10/10 from the series, which will fail without
> > > > > this fix when fallback code is used. We have:
> > > > >
> > > > > -   } while (local_cmpxchg(>head, offset, head) != offset);
> > > > > +   } while (!local_try_cmpxchg(>head, , head));
> > > > >
> > > > > where rb->head is defined as:
> > > > >
> > > > > typedef struct {
> > > > >atomic_long_t a;
> > > > > } local_t;
> > > > >
> > > > > while offset is defined as 'unsigned long'.
> > > >
> > > > Ok, but that's because we're doing the wrong thing to start with.
> > > >
> > > > Since local_t is defined in terms of atomic_long_t, we should define the
> > > > generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). 
> > > > We'll still
> > > > have a mismatch between 'long *' and 'unsigned long *', but then we can 
> > > > fix
> > > > that in the callsite:
> > > >
> > > >   while (!local_try_cmpxchg(>head, &(long *)offset, head))
> > >
> > > Sorry, that should be:
> > >
> > > while (!local_try_cmpxchg(>head, (long *), head))
> >
> > The fallbacks are a bit more complicated than above, and are different
> > from atomic_try_cmpxchg.
> >
> > Please note in patch 2/10, the falbacks when arch_try_cmpxchg_local
> > are not defined call arch_cmpxchg_local. Also in patch 2/10,
> > try_cmpxchg_local is introduced, where it calls
> > arch_try_cmpxchg_local. Targets (and generic code) simply define (e.g.
> > :
> >
> > #define local_cmpxchg(l, o, n) \
> >(cmpxchg_local(&((l)->a.counter), (o), (n)))
> > +#define local_try_cmpxchg(l, po, n) \
> > +   (try_cmpxchg_local(&((l)->a.counter), (po), (n)))
> >
> > which is part of the local_t API. Targets should either define all
> > these #defines, or none. There are no partial fallbacks as is the case
> > with atomic_t.
>
> Whether or not there are fallbacks is immaterial.
>
> In those cases, architectures can just as easily write C wrappers, e.g.
>
> long local_cmpxchg(local_t *l, long old, long new)
> {
> return cmpxchg_local(>a.counter, old, new);
> }
>
> long local_try_cmpxchg(local_t *l, long *old, long new)
> {
> return try_cmpxchg_local(>a.counter, old, new);
> }

Please find attached the complete prototype patch that implements the
above suggestion.

The patch includes:
- implementation of instrumented try_cmpxchg{,64}_local definitions
- corresponding arch_try_cmpxchg{,64}_local fallback definitions
- generic local{,64}_try_cmpxchg (and local{,64}_cmpxchg) C wrappers

- x86 specific local_try_cmpxchg (and local_cmpxchg) C wrappers
- x86 specific arch_try_cmpxchg_local definition

- kernel/events/ring_buffer.c change to test local_try_cmpxchg
implementation and illustrate the transition
- arch/x86/events/core.c change to test local64_try_cmpxchg
implementation and illustrate the transition

The definition of atomic_long_t is different for 64-bit and 32-bit
targets (s64 vs int), so target specific C wrappers have to use
different casts to account for this difference.

Uros.
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d096b04bf80e..d9310e9363f1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -129,13 +129,12 @@ u64 x86_perf_event_update(struct perf_event *event)
 * exchange a new raw count - then add that new-prev delta
 * count to the generic event atomically:
 */
-again:
prev_raw_count = local64_read(>prev_count);
-   rdpmcl(hwc->event_base_rdpmc, new_raw_count);
 
-   if (local64_cmpxchg(>prev_count, prev_raw_count,
-   new_raw_count) != prev_raw_count)
-   goto again;
+   do {
+   rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+   } while (!local64_try_cmpxchg(>prev_count, _raw_count,
+ new_raw_count));
 
/*
 * Now we have the new raw value and have updated the prev
diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index 94fbe6ae7431..540573f515b7 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -221,9 +221,15 @@ extern void __add_wrong_size(void)
 #define __try_cmpxchg(ptr, pold, new, size)\
__raw_try_cmpxchg((ptr), (pold), (new), (size), 

Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-04 Thread Justin Forbes
On Tue, Apr 4, 2023 at 2:22 AM Mike Rapoport  wrote:
>
> On Wed, Mar 29, 2023 at 10:55:37AM -0500, Justin Forbes wrote:
> > On Sat, Mar 25, 2023 at 1:09 AM Mike Rapoport  wrote:
> > >
> > > From: "Mike Rapoport (IBM)" 
> > >
> > > It is not a good idea to change fundamental parameters of core memory
> > > management. Having predefined ranges suggests that the values within
> > > those ranges are sensible, but one has to *really* understand
> > > implications of changing MAX_ORDER before actually amending it and
> > > ranges don't help here.
> > >
> > > Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
> > > visible only if EXPERT=y
> >
> > I do not like suddenly hiding this behind EXPERT for a couple of
> > reasons.  Most importantly, it will silently change the config for
> > users building with an old kernel config.  If a user has for instance
> > "13" set and building with 4K pages, as is the current configuration
> > for Fedora and RHEL aarch64 builds, an oldconfig build will now set it
> > to 10 with no indication that it is doing so.  And while I think that
> > 10 is a fine default for many aarch64 users, there are valid reasons
> > for choosing other values. Putting this behind expert makes it much
> > less obvious that this is an option.
>
> That's the idea of EXPERT, no?
>
> This option was intended to allow allocation of huge pages for
> architectures that had PMD_ORDER > MAX_ORDER and not to allow user to
> select size of maximal physically contiguous allocation.
>
> Changes to MAX_ORDER fundamentally change the behaviour of core mm and
> unless users *really* know what they are doing there is no reason to choose
> non-default values so hiding this option behind EXPERT seems totally
> appropriate to me.

It sounds nice in theory. In practice. EXPERT hides too much. When you
flip expert, you expose over a 175ish new config options which are
hidden behind EXPERT.  You don't have to know what you are doing just
with the MAX_ORDER, but a whole bunch more as well.  If everyone were
already running 10, this might be less of a problem. At least Fedora
and RHEL are running 13 for 4K pages on aarch64. This was not some
accidental choice, we had to carry a patch to even allow it for a
while.  If this does go in as is, we will likely just carry a patch to
remove the "if EXPERT", but that is a bit of a disservice to users who
might be trying to debug something else upstream, bisecting upstream
kernels or testing a patch.  In those cases, people tend to use
pristine upstream sources without distro patches to verify, and they
tend to use their existing configs. With this change, their MAX_ORDER
will drop to 10 from 13 silently.   That can look like a different
issue enough to ruin a bisect or have them give bad feedback on a
patch because it introduces a "regression" which is not a regression
at all, but a config change they couldn't see.

>
> > Justin
> >
> > > Acked-by: Kirill A. Shutemov 
> > > Reviewed-by: Zi Yan 
> > > Signed-off-by: Mike Rapoport (IBM) 
> > > ---
> > >  arch/arm64/Kconfig | 4 +---
> > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > >
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index e60baf7859d1..7324032af859 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -1487,11 +1487,9 @@ config XEN
> > >  # 16K |   27  |  14  |   13| 11  
> > >|
> > >  # 64K |   29  |  16  |   13| 13  
> > >|
> > >  config ARCH_FORCE_MAX_ORDER
> > > -   int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES
> > > +   int "Maximum zone order" if EXPERT && (ARM64_4K_PAGES || 
> > > ARM64_16K_PAGES)
> > > default "13" if ARM64_64K_PAGES
> > > -   range 11 13 if ARM64_16K_PAGES
> > > default "11" if ARM64_16K_PAGES
> > > -   range 10 15 if ARM64_4K_PAGES
> > > default "10"
> > > help
> > >   The kernel memory allocator divides physically contiguous memory
> > > --
> > > 2.35.1
> > >
> > >
>
> --
> Sincerely yours,
> Mike.
>


[PATCH] powerpc/64: Always build with 128-bit long double

2023-04-04 Thread Michael Ellerman
The amdgpu driver builds some of its code with hard-float enabled,
whereas the rest of the kernel is built with soft-float.

When building with 64-bit long double, if soft-float and hard-float
objects are linked together, the build fails due to incompatible ABI
tags.

In the past there have been build errors in the amdgpu driver caused by
this, some of those were due to bad intermingling of soft & hard-float
code, but those issues have now all been fixed since commit c92b7fe0d92a
("drm/amd/display: move remaining FPU code to dml folder").

However it's still possible for soft & hard-float objects to end up
linked together, if the amdgpu driver is built-in to the kernel along
with the test_emulate_step.c code, which uses soft-float. That happens
in an allyesconfig build.

Currently those build errors are avoided because the amdgpu driver is
gated on 128-bit long double being enabled. But that's not a detail the
amdgpu driver should need to be aware of, and if another driver starts
using hard-float the same problem would occur.

All versions of the 64-bit ABI specify that long-double is 128-bits.
However some compilers, notably the kernel.org ones, are built to use
64-bit long double by default.

Apart from this issue of soft vs hard-float, the kernel doesn't care
what size long double is. In particular the kernel using 128-bit long
double doesn't impact userspace's ability to use 64-bit long double, as
musl does.

So always build the 64-bit kernel with 128-bit long double. That should
avoid any build errors due to the incompatible ABI tags. Excluding the
code that uses soft/hard-float, the vmlinux is identical with/without
the flag.

It does mean any code which is incorrectly intermingling soft &
hard-float code will build without error, so those bugs will need to be
caught by testing rather than at build time.

For more background see:
  - commit d11219ad53dc ("amdgpu: disable powerpc support for the newer display 
engine")
  - commit c653c591789b ("drm/amdgpu: Re-enable DCN for 64-bit powerpc")
  - 
https://lore.kernel.org/r/dab9cbd8-2626-4b99-8098-31fe76397...@app.fastmail.com

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/Kconfig| 4 
 arch/powerpc/Makefile   | 1 +
 drivers/gpu/drm/amd/display/Kconfig | 2 +-
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index fc4e81dafca7..3fb2c2766139 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -291,10 +291,6 @@ config PPC
# Please keep this list sorted alphabetically.
#
 
-config PPC_LONG_DOUBLE_128
-   depends on PPC64 && ALTIVEC
-   def_bool $(success,test "$(shell,echo __LONG_DOUBLE_128__ | $(CC) -E -P 
-)" = 1)
-
 config PPC_BARRIER_NOSPEC
bool
default y
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 12447b2361e4..4343cca57cb3 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -133,6 +133,7 @@ endif
 endif
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mcmodel=medium,$(call 
cc-option,-mminimal-toc))
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mno-pointers-to-nested-functions)
+CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mlong-double-128)
 
 # Clang unconditionally reserves r2 on ppc32 and does not support the flag
 # https://bugs.llvm.org/show_bug.cgi?id=39555
diff --git a/drivers/gpu/drm/amd/display/Kconfig 
b/drivers/gpu/drm/amd/display/Kconfig
index 0c9bd0a53e60..e36261d546af 100644
--- a/drivers/gpu/drm/amd/display/Kconfig
+++ b/drivers/gpu/drm/amd/display/Kconfig
@@ -8,7 +8,7 @@ config DRM_AMD_DC
depends on BROKEN || !CC_IS_CLANG || X86_64 || SPARC64 || ARM64
select SND_HDA_COMPONENT if SND_HDA_CORE
# !CC_IS_CLANG: https://github.com/ClangBuiltLinux/linux/issues/1752
-   select DRM_AMD_DC_DCN if (X86 || PPC_LONG_DOUBLE_128 || (ARM64 && 
KERNEL_MODE_NEON && !CC_IS_CLANG))
+   select DRM_AMD_DC_DCN if (X86 || (PPC64 && ALTIVEC) || (ARM64 && 
KERNEL_MODE_NEON && !CC_IS_CLANG))
help
  Choose this option if you want to use the new display engine
  support for AMDGPU. This adds required support for Vega and
-- 
2.39.2



RE: [PATCH v2 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2023-04-04 Thread Camelia Alexandra Groza
> -Original Message-
> From: Sean Anderson 
> Sent: Friday, March 31, 2023 18:14
> To: Leo Li ; linuxppc-dev@lists.ozlabs.org; linux-arm-
> ker...@lists.infradead.org
> Cc: Scott Wood ; Camelia Alexandra Groza
> ; linux-ker...@vger.kernel.org; Roy Pledge
> ; David S . Miller ; Claudiu
> Manoil ; Vladimir Oltean
> ; Sean Anderson 
> Subject: [PATCH v2 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock
> 
> cgr_lock may be locked with interrupts already disabled by
> smp_call_function_single. As such, we must use a raw spinlock to avoid
> problems on PREEMPT_RT kernels. Although this bug has existed for a
> while, it was not apparent until commit ef2a8d5478b9 ("net: dpaa: Adjust
> queue depth on rate change") which invokes smp_call_function_single via
> qman_update_cgr_safe every time a link goes up or down.
> 
> Fixes: c535e923bb97 ("soc/fsl: Introduce DPAA 1.x QMan device driver")
> Reported-by: Vladimir Oltean 
> Link: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/
> Signed-off-by: Sean Anderson 

Reviewed-by: Camelia Groza 


RE: [PATCH v2 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2023-04-04 Thread Camelia Alexandra Groza
> -Original Message-
> From: Sean Anderson 
> Sent: Monday, April 3, 2023 18:22
> To: Vladimir Oltean 
> Cc: Leo Li ; linuxppc-dev@lists.ozlabs.org; linux-arm-
> ker...@lists.infradead.org; Scott Wood ; Camelia
> Alexandra Groza ; linux-ker...@vger.kernel.org;
> Roy Pledge ; David S . Miller
> ; Claudiu Manoil 
> Subject: Re: [PATCH v2 1/2] soc: fsl: qbman: Always disable interrupts when
> taking cgr_lock
> 
> On 4/3/23 10:02, Vladimir Oltean wrote:
> > On Fri, Mar 31, 2023 at 11:14:12AM -0400, Sean Anderson wrote:
> >> smp_call_function_single disables IRQs when executing the callback. To
> >> prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
> >> This is already done by qman_update_cgr and qman_delete_cgr; fix the
> >> other lockers.
> >>
> >> Fixes: c535e923bb97 ("soc/fsl: Introduce DPAA 1.x QMan device driver")
> >
> > If you've identified smp_call_function_single() as the problem, then the
> > true issue seems to lie in commit 96f413f47677 ("soc/fsl/qbman: fix
> > issue in qman_delete_cgr_safe()") and not in the initial commit, no?
> 
> Yes, that seems better. I did a blame and saw that qman_delete_cgr_safe
> had been around since the initial driver, but I didn't realize it worked
> in a different way back then.
> 
> --Sean
> 
> > Anyway,
> >
> > Tested-by: Vladimir Oltean 

Apart from Vladimir's comment:
Reviewed-by: Camelia Groza 


Re: [PATCH 13/21] arc: dma-mapping: skip invalidating before bidirectional DMA

2023-04-04 Thread Shahab Vahedi
On 4/2/23 08:52, Vineet Gupta wrote:
> CC Shahab
> 
> On 3/27/23 17:43, Arnd Bergmann wrote:
>> From: Arnd Bergmann
>>
>> Some architectures that need to invalidate buffers after bidirectional
>> DMA because of speculative prefetching only do a simpler writeback
>> before that DMA, while architectures that don't need to do the second
>> invalidate tend to have a combined writeback+invalidate before the
>> DMA.
>>
>> arc is one of the architectures that does both, which seems unnecessary.
>>
>> Change it to behave like arm/arm64/xtensa instead, and use just a
>> writeback before the DMA when we do the invalidate afterwards.
>>
>> Signed-off-by: Arnd Bergmann
> 
> Reviewed-by: Vineet Gupta 
> 
> Shahab can you give this a spin on hsdk - run glibc testsuite over ssh and 
> make sure nothing strange happens.
> 
> Thx,
> -Vineet

On it.
-- 
Shahab



Re: [kvm-unit-tests v3 11/13] powerpc: Discover runtime load address dynamically

2023-04-04 Thread Thomas Huth

On 27/03/2023 14.45, Nicholas Piggin wrote:

The next change will load the kernels at different addresses depending
on test options, so this needs to be reverted back to dynamic
discovery.

Signed-off-by: Nicholas Piggin 
---
  powerpc/cstart64.S | 19 ++-
  1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
index 1bd0437..0592e03 100644
--- a/powerpc/cstart64.S
+++ b/powerpc/cstart64.S
@@ -33,9 +33,14 @@ start:
 * We were loaded at QEMU's kernel load address, but we're not
 * allowed to link there due to how QEMU deals with linker VMAs,
 * so we just linked at zero. This means the first thing to do is
-* to find our stack and toc, and then do a relocate.
+* to find our stack and toc, and then do a relocate. powernv and
+* pseries load addreses are not the same, so find the address


With s/addreses/addresses/ :

Acked-by: Thomas Huth 



Re: [kvm-unit-tests v3 10/13] powerpc: Add support for more interrupts including HV interrupts

2023-04-04 Thread Thomas Huth

On 27/03/2023 14.45, Nicholas Piggin wrote:

Interrupt vectors were not being populated for all architected
interrupt types, which could lead to crashes rather than a message for
unhandled interrupts.

0x20 sized vectors require some reworking of the code to fit. This
also adds support for HV / HSRR type interrupts which will be used in
a later change.

Signed-off-by: Nicholas Piggin 
---
  powerpc/cstart64.S | 79 ++
  1 file changed, 65 insertions(+), 14 deletions(-)


Acked-by: Thomas Huth 



Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-04 Thread Mike Rapoport
On Wed, Mar 29, 2023 at 10:55:37AM -0500, Justin Forbes wrote:
> On Sat, Mar 25, 2023 at 1:09 AM Mike Rapoport  wrote:
> >
> > From: "Mike Rapoport (IBM)" 
> >
> > It is not a good idea to change fundamental parameters of core memory
> > management. Having predefined ranges suggests that the values within
> > those ranges are sensible, but one has to *really* understand
> > implications of changing MAX_ORDER before actually amending it and
> > ranges don't help here.
> >
> > Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
> > visible only if EXPERT=y
> 
> I do not like suddenly hiding this behind EXPERT for a couple of
> reasons.  Most importantly, it will silently change the config for
> users building with an old kernel config.  If a user has for instance
> "13" set and building with 4K pages, as is the current configuration
> for Fedora and RHEL aarch64 builds, an oldconfig build will now set it
> to 10 with no indication that it is doing so.  And while I think that
> 10 is a fine default for many aarch64 users, there are valid reasons
> for choosing other values. Putting this behind expert makes it much
> less obvious that this is an option.

That's the idea of EXPERT, no?

This option was intended to allow allocation of huge pages for
architectures that had PMD_ORDER > MAX_ORDER and not to allow user to
select size of maximal physically contiguous allocation.

Changes to MAX_ORDER fundamentally change the behaviour of core mm and
unless users *really* know what they are doing there is no reason to choose
non-default values so hiding this option behind EXPERT seems totally
appropriate to me.
 
> Justin
> 
> > Acked-by: Kirill A. Shutemov 
> > Reviewed-by: Zi Yan 
> > Signed-off-by: Mike Rapoport (IBM) 
> > ---
> >  arch/arm64/Kconfig | 4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index e60baf7859d1..7324032af859 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -1487,11 +1487,9 @@ config XEN
> >  # 16K |   27  |  14  |   13| 11
> >  |
> >  # 64K |   29  |  16  |   13| 13
> >  |
> >  config ARCH_FORCE_MAX_ORDER
> > -   int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES
> > +   int "Maximum zone order" if EXPERT && (ARM64_4K_PAGES || 
> > ARM64_16K_PAGES)
> > default "13" if ARM64_64K_PAGES
> > -   range 11 13 if ARM64_16K_PAGES
> > default "11" if ARM64_16K_PAGES
> > -   range 10 15 if ARM64_4K_PAGES
> > default "10"
> > help
> >   The kernel memory allocator divides physically contiguous memory
> > --
> > 2.35.1
> >
> >

-- 
Sincerely yours,
Mike.


Re: [kvm-unit-tests v3 09/13] powerpc: Expand exception handler vector granularity

2023-04-04 Thread Thomas Huth

On 27/03/2023 14.45, Nicholas Piggin wrote:

Exception handlers are currently indexed in units of 0x100, but
powerpc can have vectors that are aligned to as little as 0x20
bytes. Increase granularity of the handler functions before
adding support for thse vectors.


s/thse/those/

 Thomas



Re: [kvm-unit-tests v3 08/13] powerpc/spapr_vpa: Add basic VPA tests

2023-04-04 Thread Thomas Huth

On 27/03/2023 14.45, Nicholas Piggin wrote:

The VPA is an optional memory structure shared between the hypervisor
and operating system, defined by PAPR. This test defines the structure
and adds registration, deregistration, and a few simple sanity tests.

[Thanks to Thomas Huth for suggesting many of the test cases.]
Signed-off-by: Nicholas Piggin 
---

...

diff --git a/powerpc/Makefile.ppc64 b/powerpc/Makefile.ppc64
index ea68447..b0ed2b1 100644
--- a/powerpc/Makefile.ppc64
+++ b/powerpc/Makefile.ppc64
@@ -19,7 +19,7 @@ reloc.o  = $(TEST_DIR)/reloc64.o
  OBJDIRS += lib/ppc64
  
  # ppc64 specific tests

-tests =
+tests = $(TEST_DIR)/spapr_vpa.elf
  
  include $(SRCDIR)/$(TEST_DIR)/Makefile.common


That reminds me: We added all other tests to Makefile.common ... without 
ever checking them on 32-bit. Since we removed the early 32-bit code long 
ago already (see commit 2a814baab80af990eaf), it just might not make sense 
anymore to keep the separation for 64-bit and 32-bit Makefiles around here 
anymore --> could be a future cleanup to merge the Makefiles in the powerpc 
folder.


Anyway, that's not a problem of your patch here which looks fine, so:

Reviewed-by: Thomas Huth 



Re: [kvm-unit-tests v3 07/13] powerpc/sprs: Specify SPRs with data rather than code

2023-04-04 Thread Thomas Huth

On 27/03/2023 14.45, Nicholas Piggin wrote:

A significant rework that builds an array of 'struct spr', where each
element describes an SPR. This makes various metadata about the SPR
like name and access type easier to carry and use.

Hypervisor privileged registers are described despite not being used
at the moment for completeness, but also the code might one day be
reused for a hypervisor-privileged test.

Signed-off-by: Nicholas Piggin 

This ended up a little over-engineered perhaps, but there are lots of
SPRs, lots of access types, lots of changes between processor and ISA
versions, and lots of places they are implemented and used, so lots of
room for mistakes. There is not a good system in place to easily
see that userspace, supervisor, etc., switches perform all the right
SPR context switching so this is a nice test case to have. The sprs test
quickly caught a few QEMU TCG SPR bugs which really motivated me to
improve the SPR coverage.
---
Since v2:
- Merged with "Indirect SPR accessor functions" patch.

  powerpc/sprs.c | 643 ++---
  1 file changed, 450 insertions(+), 193 deletions(-)


Acked-by: Thomas Huth