Re: [PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-01-22 Thread Gautam Menghani
On Thu, Jan 18, 2024 at 03:26:53PM +0530, Amit Machhiwal wrote:
> Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> ppc_set_compat ioctl. This triggers a condition failure in
> kvmppc_set_arch_compat() resulting in an EINVAL.
> 
> qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
> 
> This patch updates kvmppc_set_arch_compat() to use the host PVR value if
> 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
> specific PVR compat mode.
> 
> Signed-off-by: Amit Machhiwal 
> ---
>  arch/powerpc/kvm/book3s_hv.c  |  2 +-
>  arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++--
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 1ed6ec140701..9573d7f4764a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
> u32 arch_compat)
>   if (guest_pcr_bit > host_pcr_bit)
>   return -EINVAL;
>  
> - if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
> + if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) {
>   if (!(cap & nested_capabilities))
>   return -EINVAL;
>   }
> diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
> b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> index fd3c4f2d9480..069a1fcfd782 100644
> --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
> +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> @@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> kvmppc_gs_buff *gsb,
>   vector128 v;
>   int rc, i;
>   u16 iden;
> + u32 arch_compat = 0;
>  
>   vcpu = gsm->data;
>  
> @@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> kvmppc_gs_buff *gsb,
>   break;
>   }
>   case KVMPPC_GSID_LOGICAL_PVR:
> - rc = kvmppc_gse_put_u32(gsb, iden,
> - vcpu->arch.vcore->arch_compat);
> + if (!vcpu->arch.vcore->arch_compat) {
> + if (cpu_has_feature(CPU_FTR_ARCH_31))
> + arch_compat = PVR_ARCH_31;
> + else if (cpu_has_feature(CPU_FTR_ARCH_300))
> + arch_compat = PVR_ARCH_300;
> + } else {
> + arch_compat = vcpu->arch.vcore->arch_compat;
> + }
> + rc = kvmppc_gse_put_u32(gsb, iden, arch_compat);
>   break;
>   }
>  
> -- 
> 2.43.0
> 

I tested this patch on pseries Power 10  machine with KVM support : 
Without this patch, with the latest mainline as host,the kvm guest on 
pseries/powervm fails to reboot and with this patch, reboot works fine.

Tested-by: Gautam Menghani 


Re: [PING PATCH] powerpc/kasan: Fix addr error caused by page alignment

2024-01-22 Thread Christophe Leroy


Le 23/01/2024 à 02:45, Jiangfeng Xiao a écrit :
> [Vous ne recevez pas souvent de courriers de xiaojiangf...@huawei.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> In kasan_init_region, when k_start is not page aligned,
> at the begin of for loop, k_cur = k_start & PAGE_MASK
> is less than k_start, and then va = block + k_cur - k_start
> is less than block, the addr va is invalid, because the
> memory address space from va to block is not alloced by
> memblock_alloc, which will not be reserved
> by memblock_reserve later, it will be used by other places.
> 
> As a result, memory overwriting occurs.
> 
> for example:
> int __init __weak kasan_init_region(void *start, size_t size)
> {
> [...]
>  /* if say block(dcd97000) k_start(feef7400) k_end(feeff3fe) */
>  block = memblock_alloc(k_end - k_start, PAGE_SIZE);
>  [...]
>  for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) 
> {
>  /* at the begin of for loop
>   * block(dcd97000) va(dcd96c00) k_cur(feef7000) 
> k_start(feef7400)
>   * va(dcd96c00) is less than block(dcd97000), va is invalid
>   */
>  void *va = block + k_cur - k_start;
>  [...]
>  }
> [...]
> }
> 
> Therefore, page alignment is performed on k_start before
> memblock_alloc to ensure the validity of the VA address.
> 
> Fixes: 663c0c9496a6 ("powerpc/kasan: Fix shadow area set up for modules.")
> 
> Signed-off-by: Jiangfeng Xiao 

Be patient, your patch is not lost. Now we have it twice, see:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=76392

> ---
>   arch/powerpc/mm/kasan/init_32.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/mm/kasan/init_32.c b/arch/powerpc/mm/kasan/init_32.c
> index a70828a..aa9aa11 100644
> --- a/arch/powerpc/mm/kasan/init_32.c
> +++ b/arch/powerpc/mm/kasan/init_32.c
> @@ -64,6 +64,7 @@ int __init __weak kasan_init_region(void *start, size_t 
> size)
>  if (ret)
>  return ret;
> 
> +   k_start = k_start & PAGE_MASK;
>  block = memblock_alloc(k_end - k_start, PAGE_SIZE);
>  if (!block)
>  return -ENOMEM;
> --
> 1.8.5.6
> 


Re: [PATCH 1/1] arch/mm/fault: fix major fault accounting when retrying under per-VMA lock

2024-01-22 Thread Suren Baghdasaryan
On Sun, Jan 21, 2024 at 11:38 PM Suren Baghdasaryan  wrote:
>
> On Sat, Jan 20, 2024 at 1:15 PM Russell King (Oracle)
>  wrote:
> >
> > On Sat, Jan 20, 2024 at 09:09:47PM +, 
> > patchwork-bot+linux-ri...@kernel.org wrote:
> > > Hello:
> > >
> > > This patch was applied to riscv/linux.git (fixes)
> > > by Andrew Morton :
> > >
> > > On Tue, 26 Dec 2023 13:46:10 -0800 you wrote:
> > > > A test [1] in Android test suite started failing after [2] was merged.
> > > > It turns out that after handling a major fault under per-VMA lock, the
> > > > process major fault counter does not register that fault as major.
> > > > Before [2] read faults would be done under mmap_lock, in which case
> > > > FAULT_FLAG_TRIED flag is set before retrying. That in turn causes
> > > > mm_account_fault() to account the fault as major once retry completes.
> > > > With per-VMA locks we often retry because a fault can't be handled
> > > > without locking the whole mm using mmap_lock. Therefore such retries
> > > > do not set FAULT_FLAG_TRIED flag. This logic does not work after [2]
> > > > because we can now handle read major faults under per-VMA lock and
> > > > upon retry the fact there was a major fault gets lost. Fix this by
> > > > setting FAULT_FLAG_TRIED after retrying under per-VMA lock if
> > > > VM_FAULT_MAJOR was returned. Ideally we would use an additional
> > > > VM_FAULT bit to indicate the reason for the retry (could not handle
> > > > under per-VMA lock vs other reason) but this simpler solution seems
> > > > to work, so keeping it simple.
> > > >
> > > > [...]
> > >
> > > Here is the summary with links:
> > >   - [1/1] arch/mm/fault: fix major fault accounting when retrying under 
> > > per-VMA lock
> > > https://git.kernel.org/riscv/c/46e714c729c8
> > >
> > > You are awesome, thank you!
> >
> > Now that 32-bit ARM has support for the per-VMA lock, does that also
> > need to be patched?
>
> Yes, I think so. I missed the ARM32 change that added support for
> per-VMA locks. Will post a similar patch for it tomorrow.

Fix for ARM posted at
https://lore.kernel.org/all/20240123064305.2829244-1-sur...@google.com/

> Thanks,
> Suren.
>
> >
> > --
> > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


[PATCH 1/1] arch/arm/mm: fix major fault accounting when retrying under per-VMA lock

2024-01-22 Thread Suren Baghdasaryan
The change [1] missed ARM architecture when fixing major fault accounting
for page fault retry under per-VMA lock. Add missing code to fix ARM
architecture fault accounting.

[1] 46e714c729c8 ("arch/mm/fault: fix major fault accounting when retrying 
under per-VMA lock")

Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock")

Reported-by: Russell King (Oracle) 
Signed-off-by: Suren Baghdasaryan 
---
 arch/arm/mm/fault.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index e96fb40b9cc3..07565b593ed6 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -298,6 +298,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct 
pt_regs *regs)
goto done;
}
count_vm_vma_lock_event(VMA_LOCK_RETRY);
+   if (fault & VM_FAULT_MAJOR)
+   flags |= FAULT_FLAG_TRIED;
 
/* Quick path to respond to signals */
if (fault_signal_pending(fault, regs)) {
-- 
2.43.0.429.g432eaa2c6b-goog



[PATCH v2] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id

2024-01-22 Thread Huang Shijie
During the kernel booting, the generic cpu_to_node() is called too early in
arm64, powerpc and riscv when CONFIG_NUMA is enabled.

For arm64/powerpc/riscv, there are at least four places in the common code
where the generic cpu_to_node() is called before it is initialized:
   1.) early_trace_init() in kernel/trace/trace.c
   2.) sched_init()   in kernel/sched/core.c
   3.) init_sched_fair_class()in kernel/sched/fair.c
   4.) workqueue_init_early() in kernel/workqueue.c

In order to fix the bug, the patch changes generic cpu_to_node to
function pointer, and export it for kernel modules.
Introduce smp_prepare_boot_cpu_start() to wrap the original
smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node.
Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(),
and set the cpu_to_node to formal _cpu_to_node().

Signed-off-by: Huang Shijie 
---
v1 --> v2:
In order to fix the x86 compiling error, move the cpu_to_node()
from driver/base/arch_numa.c to driver/base/node.c.

v1: 
http://lists.infradead.org/pipermail/linux-arm-kernel/2024-January/896160.html

An old different title patch:

http://lists.infradead.org/pipermail/linux-arm-kernel/2024-January/895963.html

---
 drivers/base/node.c  | 11 +++
 include/linux/topology.h |  6 ++
 init/main.c  | 29 +++--
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 1c05640461dd..477d58c12886 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -976,3 +976,14 @@ void __init node_dev_init(void)
panic("%s() failed to add node: %d\n", __func__, ret);
}
 }
+
+#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
+#ifndef cpu_to_node
+int _cpu_to_node(int cpu)
+{
+   return per_cpu(numa_node, cpu);
+}
+int (*cpu_to_node)(int cpu);
+EXPORT_SYMBOL(cpu_to_node);
+#endif
+#endif
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 52f5850730b3..e7ce2bae11dd 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -91,10 +91,8 @@ static inline int numa_node_id(void)
 #endif
 
 #ifndef cpu_to_node
-static inline int cpu_to_node(int cpu)
-{
-   return per_cpu(numa_node, cpu);
-}
+extern int (*cpu_to_node)(int cpu);
+extern int _cpu_to_node(int cpu);
 #endif
 
 #ifndef set_numa_node
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..b142e9c51161 100644
--- a/init/main.c
+++ b/init/main.c
@@ -870,6 +870,18 @@ static void __init print_unknown_bootoptions(void)
memblock_free(unknown_options, len);
 }
 
+static void __init smp_prepare_boot_cpu_start(void)
+{
+   smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+
+#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
+#ifndef cpu_to_node
+   /* The early_cpu_to_node should be ready now. */
+   cpu_to_node = early_cpu_to_node;
+#endif
+#endif
+}
+
 asmlinkage __visible __init __no_sanitize_address __noreturn 
__no_stack_protector
 void start_kernel(void)
 {
@@ -899,7 +911,7 @@ void start_kernel(void)
setup_command_line(command_line);
setup_nr_cpu_ids();
setup_per_cpu_areas();
-   smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+   smp_prepare_boot_cpu_start();
boot_cpu_hotplug_init();
 
pr_notice("Kernel command line: %s\n", saved_command_line);
@@ -1519,6 +1531,19 @@ void __init console_on_rootfs(void)
fput(file);
 }
 
+static void __init smp_prepare_cpus_done(unsigned int setup_max_cpus)
+{
+   /* Different ARCHs may override smp_prepare_cpus() */
+   smp_prepare_cpus(setup_max_cpus);
+
+#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
+#ifndef cpu_to_node
+   /* Change to the formal function. */
+   cpu_to_node = _cpu_to_node;
+#endif
+#endif
+}
+
 static noinline void __init kernel_init_freeable(void)
 {
/* Now the scheduler is fully set up and can do blocking allocations */
@@ -1531,7 +1556,7 @@ static noinline void __init kernel_init_freeable(void)
 
cad_pid = get_pid(task_pid(current));
 
-   smp_prepare_cpus(setup_max_cpus);
+   smp_prepare_cpus_done(setup_max_cpus);
 
workqueue_init();
 
-- 
2.40.1



[PING PATCH] powerpc/kasan: Fix addr error caused by page alignment

2024-01-22 Thread Jiangfeng Xiao
In kasan_init_region, when k_start is not page aligned,
at the begin of for loop, k_cur = k_start & PAGE_MASK
is less than k_start, and then va = block + k_cur - k_start
is less than block, the addr va is invalid, because the
memory address space from va to block is not alloced by
memblock_alloc, which will not be reserved
by memblock_reserve later, it will be used by other places.

As a result, memory overwriting occurs.

for example:
int __init __weak kasan_init_region(void *start, size_t size)
{
[...]
/* if say block(dcd97000) k_start(feef7400) k_end(feeff3fe) */
block = memblock_alloc(k_end - k_start, PAGE_SIZE);
[...]
for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
/* at the begin of for loop
 * block(dcd97000) va(dcd96c00) k_cur(feef7000) 
k_start(feef7400)
 * va(dcd96c00) is less than block(dcd97000), va is invalid
 */
void *va = block + k_cur - k_start;
[...]
}
[...]
}

Therefore, page alignment is performed on k_start before
memblock_alloc to ensure the validity of the VA address.

Fixes: 663c0c9496a6 ("powerpc/kasan: Fix shadow area set up for modules.")

Signed-off-by: Jiangfeng Xiao 
---
 arch/powerpc/mm/kasan/init_32.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/kasan/init_32.c b/arch/powerpc/mm/kasan/init_32.c
index a70828a..aa9aa11 100644
--- a/arch/powerpc/mm/kasan/init_32.c
+++ b/arch/powerpc/mm/kasan/init_32.c
@@ -64,6 +64,7 @@ int __init __weak kasan_init_region(void *start, size_t size)
if (ret)
return ret;
 
+   k_start = k_start & PAGE_MASK;
block = memblock_alloc(k_end - k_start, PAGE_SIZE);
if (!block)
return -ENOMEM;
-- 
1.8.5.6



[PATCH 60/82] powerpc: Refactor intentional wrap-around test

2024-01-22 Thread Kees Cook
In an effort to separate intentional arithmetic wrap-around from
unexpected wrap-around, we need to refactor places that depend on this
kind of math. One of the most common code patterns of this is:

VAR + value < VAR

Notably, this is considered "undefined behavior" for signed and pointer
types, which the kernel works around by using the -fno-strict-overflow
option in the build[1] (which used to just be -fwrapv). Regardless, we
want to get the kernel source to the position where we can meaningfully
instrument arithmetic wrap-around conditions and catch them when they
are unexpected, regardless of whether they are signed[2], unsigned[3],
or pointer[4] types.

Refactor open-coded wrap-around addition test to use add_would_overflow().
This paves the way to enabling the wrap-around sanitizers in the future.

Link: https://git.kernel.org/linus/68df3755e383e6fecf2354a67b08f92f18536594 [1]
Link: https://github.com/KSPP/linux/issues/26 [2]
Link: https://github.com/KSPP/linux/issues/27 [3]
Link: https://github.com/KSPP/linux/issues/344 [4]
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: "Naveen N. Rao" 
Cc: Mahesh Salgaonkar 
Cc: Vasant Hegde 
Cc: dingsenjie 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Cc: Naveen N. Rao 
Signed-off-by: Kees Cook 
---
 arch/powerpc/platforms/powernv/opal-prd.c | 2 +-
 arch/powerpc/xmon/xmon.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index b66b06efcef1..eaf95dc82925 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -51,7 +51,7 @@ static bool opal_prd_range_is_valid(uint64_t addr, uint64_t 
size)
struct device_node *parent, *node;
bool found;
 
-   if (addr + size < addr)
+   if (add_would_overflow(addr, size))
return false;
 
parent = of_find_node_by_path("/reserved-memory");
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index b3b94cd37713..b91fdda49434 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -3252,7 +3252,7 @@ memzcan(void)
} else if (!ok && ook)
printf("%.8lx\n", a - mskip);
ook = ok;
-   if (a + mskip < a)
+   if (add_would_overflow(a, mskip))
break;
}
if (ook)
-- 
2.34.1



[PATCH 6.7 517/641] perf vendor events powerpc: Update datasource event name to fix duplicate events

2024-01-22 Thread Greg Kroah-Hartman
6.7-stable review patch.  If anyone has any objections, please let me know.

--

From: Athira Rajeev 

[ Upstream commit 9eef41014fe01287dae79fe208b9b433b13040bb ]

Running "perf list" on powerpc fails with segfault as below:

   $ ./perf list
   Segmentation fault (core dumped)
   $

This happens because of duplicate events in the JSON list.  The powerpc
JSON event list contains some event with same event name, but different
event code. They are:

- PM_INST_FROM_L3MISS (Present in datasource and frontend)
- PM_MRK_DATA_FROM_L2MISS (Present in datasource and marked)
- PM_MRK_INST_FROM_L3MISS (Present in datasource and marked)
- PM_MRK_DATA_FROM_L3MISS (Present in datasource and marked)

pmu_events_table__num_events() uses the value from table_pmu->num_entries
which includes duplicate events as well. This causes issue during "perf
list" and results in a segmentation fault.

Since both event codes are valid, append _DSRC to the Data Source events
(datasource.json), so that they would have a unique name.

Also add PM_DATA_FROM_L2MISS_DSRC and PM_DATA_FROM_L3MISS_DSRC events.

With the fix, 'perf list' works as expected.

Fixes: fc143580753348c6 ("perf vendor events power10: Update JSON/events")
Signed-off-by: Athira Jajeev 
Tested-by: Disha Goel 
Cc: Adrian Hunter 
Cc: Disha Goel 
Cc: Ian Rogers 
Cc: James Clark 
Cc: Jiri Olsa 
Cc: Kajol Jain 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Madhavan Srinivasan 
Cc: Namhyung Kim 
Link: 
https://lore.kernel.org/r/20231123160110.94090-1-atraj...@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
Signed-off-by: Sasha Levin 
---
 .../arch/powerpc/power10/datasource.json   | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/tools/perf/pmu-events/arch/powerpc/power10/datasource.json 
b/tools/perf/pmu-events/arch/powerpc/power10/datasource.json
index 6b0356f2d301..0eeaaf1a95b8 100644
--- a/tools/perf/pmu-events/arch/powerpc/power10/datasource.json
+++ b/tools/perf/pmu-events/arch/powerpc/power10/datasource.json
@@ -99,6 +99,11 @@
 "EventName": "PM_INST_FROM_L2MISS",
 "BriefDescription": "The processor's instruction cache was reloaded from a 
source beyond the local core's L2 due to a demand miss."
   },
+  {
+"EventCode": "0x0003C000C040",
+"EventName": "PM_DATA_FROM_L2MISS_DSRC",
+"BriefDescription": "The processor's L1 data cache was reloaded from a 
source beyond the local core's L2 due to a demand miss."
+  },
   {
 "EventCode": "0x00038010C040",
 "EventName": "PM_INST_FROM_L2MISS_ALL",
@@ -161,9 +166,14 @@
   },
   {
 "EventCode": "0x00078000C040",
-"EventName": "PM_INST_FROM_L3MISS",
+"EventName": "PM_INST_FROM_L3MISS_DSRC",
 "BriefDescription": "The processor's instruction cache was reloaded from 
beyond the local core's L3 due to a demand miss."
   },
+  {
+"EventCode": "0x0007C000C040",
+"EventName": "PM_DATA_FROM_L3MISS_DSRC",
+"BriefDescription": "The processor's L1 data cache was reloaded from 
beyond the local core's L3 due to a demand miss."
+  },
   {
 "EventCode": "0x00078010C040",
 "EventName": "PM_INST_FROM_L3MISS_ALL",
@@ -981,7 +991,7 @@
   },
   {
 "EventCode": "0x0003C000C142",
-"EventName": "PM_MRK_DATA_FROM_L2MISS",
+"EventName": "PM_MRK_DATA_FROM_L2MISS_DSRC",
 "BriefDescription": "The processor's L1 data cache was reloaded from a 
source beyond the local core's L2 due to a demand miss for a marked 
instruction."
   },
   {
@@ -1046,12 +1056,12 @@
   },
   {
 "EventCode": "0x00078000C142",
-"EventName": "PM_MRK_INST_FROM_L3MISS",
+"EventName": "PM_MRK_INST_FROM_L3MISS_DSRC",
 "BriefDescription": "The processor's instruction cache was reloaded from 
beyond the local core's L3 due to a demand miss for a marked instruction."
   },
   {
 "EventCode": "0x0007C000C142",
-"EventName": "PM_MRK_DATA_FROM_L3MISS",
+"EventName": "PM_MRK_DATA_FROM_L3MISS_DSRC",
 "BriefDescription": "The processor's L1 data cache was reloaded from 
beyond the local core's L3 due to a demand miss for a marked instruction."
   },
   {
-- 
2.43.0





[PATCH v2] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-01-22 Thread Gaurav Batra
When a PCI device is Dynamically added, LPAR OOPS with NULL pointer
exception.

Complete stack is as below

[  211.239206] BUG: Kernel NULL pointer dereference on read at 0x0030
[  211.239210] Faulting instruction address: 0xc06bbe5c
[  211.239214] Oops: Kernel access of bad area, sig: 11 [#1]
[  211.239218] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  211.239223] Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 
auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding 
nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc 
rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser 
libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs 
ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core 
mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto 
pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
[  211.239280] CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
[  211.239284] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
[  211.239289] NIP:  c06bbe5c LR: c0a13e68 CTR: c00579f8
[  211.239293] REGS: c0009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
[  211.239298] MSR:  80009033   CR: 24002220  
XER: 20040006
[  211.239306] CFAR: c0a13e64 DAR: 0030 DSISR: 4000 
IRQMASK: 0
[  211.239306] GPR00: c0a13e68 c0009924f4e0 c15a2b00 

[  211.239306] GPR04: c13c5590  c6d07970 
c000d8f8f180
[  211.239306] GPR08: 06ec c000d8f8f180 c2c35d58 
24002228
[  211.239306] GPR12: c00579f8 c003ffeb3880  

[  211.239306] GPR16:    

[  211.239306] GPR20:    

[  211.239306] GPR24: c000919460c0  f000 
c10088e8
[  211.239306] GPR28: c13c5590 c6d07970 c000919460c0 
c000919460c0
[  211.239354] NIP [c06bbe5c] sysfs_add_link_to_group+0x34/0x94
[  211.239361] LR [c0a13e68] iommu_device_link+0x5c/0x118
[  211.239367] Call Trace:
[  211.239369] [c0009924f4e0] [c0a109b8] 
iommu_init_device+0x26c/0x318 (unreliable)
[  211.239376] [c0009924f520] [c0a13e68] 
iommu_device_link+0x5c/0x118
[  211.239382] [c0009924f560] [c0a107f4] 
iommu_init_device+0xa8/0x318
[  211.239387] [c0009924f5c0] [c0a11a08] 
iommu_probe_device+0xc0/0x134
[  211.239393] [c0009924f600] [c0a11ac0] 
iommu_bus_notifier+0x44/0x104
[  211.239398] [c0009924f640] [c018dcc0] 
notifier_call_chain+0xb8/0x19c
[  211.239405] [c0009924f6a0] [c018df88] 
blocking_notifier_call_chain+0x64/0x98
[  211.239411] [c0009924f6e0] [c0a250fc] bus_notify+0x50/0x7c
[  211.239416] [c0009924f720] [c0a20838] device_add+0x640/0x918
[  211.239421] [c0009924f7f0] [c08f1a34] pci_device_add+0x23c/0x298
[  211.239427] [c0009924f840] [c0077460] 
of_create_pci_dev+0x400/0x884
[  211.239432] [c0009924f8e0] [c0077a08] of_scan_pci_dev+0x124/0x1b0
[  211.239437] [c0009924f980] [c0077b0c] __of_scan_bus+0x78/0x18c
[  211.239442] [c0009924fa10] [c0073f90] 
pcibios_scan_phb+0x2a4/0x3b0
[  211.239447] [c0009924fad0] [c01007a8] init_phb_dynamic+0xb8/0x110
[  211.239453] [c0009924fb40] [c00806920620] dlpar_add_slot+0x170/0x3b8 
[rpadlpar_io]
[  211.239461] [c0009924fbe0] [c00806920d64] 
add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
[  211.239468] [c0009924fc70] [c0fb4144] kobj_attr_store+0x2c/0x48
[  211.239473] [c0009924fc90] [c06b90e4] sysfs_kf_write+0x64/0x78
[  211.239479] [c0009924fcb0] [c06b7b78] 
kernfs_fop_write_iter+0x1b0/0x290
[  211.239485] [c0009924fd00] [c05b6fdc] vfs_write+0x350/0x4a0
[  211.239491] [c0009924fdc0] [c05b7450] ksys_write+0x84/0x140
[  211.239496] [c0009924fe10] [c0030a04] 
system_call_exception+0x124/0x330
[  211.239502] [c0009924fe50] [c000cedc] 
system_call_vectored_common+0x15c/0x2ec

Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR ADD of pci devices.

The above added iommu_device structure to pci_controller. During
system boot, pci devices are discovered and this newly added iommu_device
structure initialized by a call to iommu_device_register().

During DLPAR ADD of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix would be to register iommu device during DLPAR ADD as well.

Fixes: a940904443e4 ("powerpc/iommu: Add 

Re: [RFC PATCH 2/3] fs: remove duplicate ifdefs

2024-01-22 Thread Chandan Babu R
On Thu, Jan 18, 2024 at 01:33:25 PM +0530, Shrikanth Hegde wrote:
> when a ifdef is used in the below manner, second one could be considered as
> duplicate.
>
> ifdef DEFINE_A
> ...code block...
> ifdef DEFINE_A
> ...code block...
> endif
> ...code block...
> endif
>
> There are few places in fs code where above pattern was seen.
> No functional change is intended here. It only aims to improve code
> readability.
>

Can you please post the xfs changes as a separate patch along with Darrick's
RVB tag? This will make it easy for me to apply the resulting patch to the XFS
tree.

-- 
Chandan


Re: [RFC PATCH] mm: z3fold: rename CONFIG_Z3FOLD to CONFIG_Z3FOLD_DEPRECATED

2024-01-22 Thread Yosry Ahmed
On Sun, Jan 21, 2024 at 11:42 PM Christoph Hellwig  wrote:
>
> On Tue, Jan 16, 2024 at 12:19:39PM -0800, Yosry Ahmed wrote:
> > Well, better compression ratios for one :)
> >
> > I think a long time ago there were complaints that zsmalloc had higher
> > latency than zbud/z3fold, but since then a lot of things have changed
> > (including nice compaction optimization from Sergey, and compaction
> > was one of the main factors AFAICT). Also, recent experiments that
> > Chris Li conducted showed that (at least in our setup), the
> > decompression is only a small part of the fault latency with zswap
> > (i.e. not the main factor) -- so I am not sure if it actually matters
> > in practice.
> >
> > That said, I have not conducted any experiments personally with z3fold
> > or zbud, which is why I proposed the conservative approach of marking
> > as deprecated first. However, if others believe this is unnecessary I
> > am fine with removal as well. Whatever we agree on is fine by me.
>
> In general deprecated is for code that has active (intentional) users
> and/or would break setups.  I does sound to me like that is not the
> case here, but others might understand this better.

I generally agree. So far we have no knowledge of active users, and if
there are some, I expect most of them to be able to switch to zsmalloc
with no problems. That being said, I was trying to take the
conservative approach. If others agree I can send a removal patch
instead.


Re: [PATCH v2 0/3] ASoC: Support SAI and MICFIL on i.MX95 platform

2024-01-22 Thread Mark Brown
On Fri, 12 Jan 2024 14:43:28 +0900, Chancel Liu wrote:
> Support SAI and MICFIL on i.MX95 platform
> 
> changes in v2
> - Remove unnecessary "item" in fsl,micfil.yaml
> - Don't change alphabetical order in fsl,sai.yaml
> 
> Chancel Liu (3):
>   ASoC: dt-bindings: fsl,sai: Add compatible string for i.MX95 platform
>   ASoC: fsl_sai: Add support for i.MX95 platform
>   ASoC: dt-bindings: fsl,micfil: Add compatible string for i.MX95
> platform
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/3] ASoC: dt-bindings: fsl,sai: Add compatible string for i.MX95 platform
  commit: 52523f70fdf9b2cb0bfd1999eba4aa3a30b04fa6
[2/3] ASoC: fsl_sai: Add support for i.MX95 platform
  commit: 2f2d78e2c29347a96268f6f34092538b307ed056
[3/3] ASoC: dt-bindings: fsl,micfil: Add compatible string for i.MX95 platform
  commit: 20d2719937cf439602566a8f041d3208274abc01

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark



Re: [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread David Hildenbrand

On 22.01.24 21:03, Alexandre Ghiti wrote:

Hi David,

On 22/01/2024 20:41, David Hildenbrand wrote:

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
   arch/riscv/include/asm/pgtable.h | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c1..add5cd30ab34d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
set_pte(ptep, pteval);
   }
   
+#define PFN_PTE_SHIFT		_PAGE_PFN_SHIFT

+
   static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval, unsigned int nr)
   {



There is a typo in the commit title: risc -> riscv. Otherwise, this is
right so:


Whops :)



Reviewed-by: Alexandre Ghiti 


Thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread Alexandre Ghiti

Hi David,

On 22/01/2024 20:41, David Hildenbrand wrote:

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
  arch/riscv/include/asm/pgtable.h | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c1..add5cd30ab34d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
set_pte(ptep, pteval);
  }
  
+#define PFN_PTE_SHIFT		_PAGE_PFN_SHIFT

+
  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval, unsigned int nr)
  {



There is a typo in the commit title: risc -> riscv. Otherwise, this is 
right so:


Reviewed-by: Alexandre Ghiti 

Thanks,

Alex



[PATCH v1 11/11] mm/memory: ignore writable bit in folio_pte_batch()

2024-01-22 Thread David Hildenbrand
... and conditionally return to the caller if any pte except the first one
is writable. fork() has to make sure to properly write-protect in case any
PTE is writable. Other users (e.g., page unmaping) won't care.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 341b2be845b6e..a26fd0669016b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -955,7 +955,7 @@ static __always_inline void __copy_present_ptes(struct 
vm_area_struct *dst_vma,
 
 static inline pte_t __pte_batch_clear_ignored(pte_t pte)
 {
-   return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
+   return pte_wrprotect(pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte;
 }
 
 /*
@@ -963,20 +963,29 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte)
  * pages of the same folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
- * the accessed bit, dirty bit and soft-dirty bit.
+ * the accessed bit, dirty bit, soft-dirty bit and writable bit.
+ . If "any_writable" is set, it will indicate if any other PTE besides the
+ * first (given) PTE is writable.
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-   pte_t *start_ptep, pte_t pte, int max_nr)
+   pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
pte_t *ptep = start_ptep + 1;
+   bool writable;
+
+   if (any_writable)
+   *any_writable = false;
 
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
while (ptep != end_ptep) {
-   pte = __pte_batch_clear_ignored(ptep_get(ptep));
+   pte = ptep_get(ptep);
+   if (any_writable)
+   writable = !!pte_write(pte);
+   pte = __pte_batch_clear_ignored(pte);
 
if (!pte_same(pte, expected_pte))
break;
@@ -989,6 +998,9 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
if (pte_pfn(pte) == folio_end_pfn)
break;
 
+   if (any_writable)
+   *any_writable |= writable;
+
expected_pte = pte_next_pfn(expected_pte);
ptep++;
}
@@ -1010,6 +1022,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
 {
struct page *page;
struct folio *folio;
+   bool any_writable;
int err, nr;
 
page = vm_normal_page(src_vma, addr, pte);
@@ -1024,7 +1037,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
 * by keeping the batching logic separate.
 */
if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
-   nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
+   nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr,
+_writable);
if (folio_test_anon(folio)) {
folio_ref_add(folio, nr);
if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1039,6 +1053,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
folio_dup_file_rmap_ptes(folio, page, nr);
rss[mm_counter_file(page)] += nr;
}
+   if (any_writable)
+   pte = pte_mkwrite(pte, src_vma);
__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
addr, nr);
return nr;
-- 
2.43.0



[PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()

2024-01-22 Thread David Hildenbrand
Let's ignore these bits: they are irrelevant for fork, and will likely
be irrelevant for upcoming users such as page unmapping.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f563aec85b2a8..341b2be845b6e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct 
vm_area_struct *dst_vma,
set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
 }
 
+static inline pte_t __pte_batch_clear_ignored(pte_t pte)
+{
+   return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
+}
+
 /*
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
  * pages of the same folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
+ * the accessed bit, dirty bit and soft-dirty bit.
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
pte_t *start_ptep, pte_t pte, int max_nr)
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
-   pte_t expected_pte = pte_next_pfn(pte);
+   pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
pte_t *ptep = start_ptep + 1;
 
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
while (ptep != end_ptep) {
-   pte = ptep_get(ptep);
+   pte = __pte_batch_clear_ignored(ptep_get(ptep));
 
if (!pte_same(pte, expected_pte))
break;
-- 
2.43.0



[PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP

2024-01-22 Thread David Hildenbrand
Let's implement PTE batching when consecutive (present) PTEs map
consecutive pages of the same large folio, and all other PTE bits besides
the PFNs are equal.

We will optimize folio_pte_batch() separately, to ignore some other
PTE bits. This patch is based on work by Ryan Roberts.

Use __always_inline for __copy_present_ptes() and keep the handling for
single PTEs completely separate from the multi-PTE case: we really want
the compiler to optimize for the single-PTE case with small folios, to
not degrade performance.

Note that PTE batching will never exceed a single page table and will
always stay within VMA boundaries.

Signed-off-by: David Hildenbrand 
---
 include/linux/pgtable.h |  17 +-
 mm/memory.c | 113 +---
 2 files changed, 109 insertions(+), 21 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f6d0e3513948a..d32cedf6936ba 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,8 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
-#ifndef set_ptes
-
 #ifndef pte_next_pfn
 static inline pte_t pte_next_pfn(pte_t pte)
 {
@@ -221,6 +219,7 @@ static inline pte_t pte_next_pfn(pte_t pte)
 }
 #endif
 
+#ifndef set_ptes
 /**
  * set_ptes - Map consecutive pages to a contiguous range of addresses.
  * @mm: Address space to map the pages into.
@@ -650,6 +649,20 @@ static inline void ptep_set_wrprotect(struct mm_struct 
*mm, unsigned long addres
 }
 #endif
 
+#ifndef wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr)
+{
+   for (;;) {
+   ptep_set_wrprotect(mm, addr, ptep);
+   if (--nr == 0)
+   break;
+   ptep++;
+   addr += PAGE_SIZE;
+   }
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index 185b4aff13d62..f563aec85b2a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -930,15 +930,15 @@ copy_present_page(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
return 0;
 }
 
-static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
+static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
-   pte_t pte, unsigned long addr)
+   pte_t pte, unsigned long addr, int nr)
 {
struct mm_struct *src_mm = src_vma->vm_mm;
 
/* If it's a COW mapping, write protect it both processes. */
if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
-   ptep_set_wrprotect(src_mm, addr, src_pte);
+   wrprotect_ptes(src_mm, addr, src_pte, nr);
pte = pte_wrprotect(pte);
}
 
@@ -950,26 +950,94 @@ static inline void __copy_present_pte(struct 
vm_area_struct *dst_vma,
if (!userfaultfd_wp(dst_vma))
pte = pte_clear_uffd_wp(pte);
 
-   set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+   set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+}
+
+/*
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same folio.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
+ */
+static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
+   pte_t *start_ptep, pte_t pte, int max_nr)
+{
+   unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
+   const pte_t *end_ptep = start_ptep + max_nr;
+   pte_t expected_pte = pte_next_pfn(pte);
+   pte_t *ptep = start_ptep + 1;
+
+   VM_WARN_ON_FOLIO(!pte_present(pte), folio);
+
+   while (ptep != end_ptep) {
+   pte = ptep_get(ptep);
+
+   if (!pte_same(pte, expected_pte))
+   break;
+
+   /*
+* Stop immediately once we reached the end of the folio. In
+* corner cases the next PFN might fall into a different
+* folio.
+*/
+   if (pte_pfn(pte) == folio_end_pfn)
+   break;
+
+   expected_pte = pte_next_pfn(expected_pte);
+   ptep++;
+   }
+
+   return ptep - start_ptep;
 }
 
 /*
- * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy one present PTE, trying to batch-process subsequent PTEs that map
+ * consecutive pages of the same folio by copying them as well.
+ *
+ * Returns -EAGAIN if one preallocated page is required to copy the next PTE.
+ * Otherwise, returns the number of copied PTEs (at least 1).
  */
 static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct 

[PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte()

2024-01-22 Thread David Hildenbrand
We already read it, let's just forward it.

This patch is based on work by Ryan Roberts.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2aa2051ee51d3..185b4aff13d62 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct 
vm_area_struct *dst_vma,
  */
 static inline int
 copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma,
-pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-struct folio **prealloc)
+pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
+int *rss, struct folio **prealloc)
 {
-   pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
 
@@ -1104,7 +1103,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
}
/* copy_present_pte() will clear `*prealloc' if consumed */
ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-  addr, rss, );
+  ptent, addr, rss, );
/*
 * If we need a pre-allocated page for this pte, drop the
 * locks, allocate, and try again.
-- 
2.43.0



[PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte()

2024-01-22 Thread David Hildenbrand
Let's prepare for further changes.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 60 -
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463aa..2aa2051ee51d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -930,6 +930,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma
return 0;
 }
 
+static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
+   struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
+   pte_t pte, unsigned long addr)
+{
+   struct mm_struct *src_mm = src_vma->vm_mm;
+
+   /* If it's a COW mapping, write protect it both processes. */
+   if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
+   ptep_set_wrprotect(src_mm, addr, src_pte);
+   pte = pte_wrprotect(pte);
+   }
+
+   /* If it's a shared mapping, mark it clean in the child. */
+   if (src_vma->vm_flags & VM_SHARED)
+   pte = pte_mkclean(pte);
+   pte = pte_mkold(pte);
+
+   if (!userfaultfd_wp(dst_vma))
+   pte = pte_clear_uffd_wp(pte);
+
+   set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+}
+
 /*
  * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
  * is required to copy this pte.
@@ -939,16 +962,16 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
 struct folio **prealloc)
 {
-   struct mm_struct *src_mm = src_vma->vm_mm;
-   unsigned long vm_flags = src_vma->vm_flags;
pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
 
page = vm_normal_page(src_vma, addr, pte);
-   if (page)
-   folio = page_folio(page);
-   if (page && folio_test_anon(folio)) {
+   if (unlikely(!page))
+   goto copy_pte;
+
+   folio = page_folio(page);
+   if (folio_test_anon(folio)) {
/*
 * If this page may have been pinned by the parent process,
 * copy the page immediately for the child so that we'll always
@@ -963,34 +986,15 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
 addr, rss, prealloc, page);
}
rss[MM_ANONPAGES]++;
-   } else if (page) {
+   VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
+   } else {
folio_get(folio);
folio_dup_file_rmap_pte(folio, page);
rss[mm_counter_file(page)]++;
}
 
-   /*
-* If it's a COW mapping, write protect it both
-* in the parent and the child
-*/
-   if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-   ptep_set_wrprotect(src_mm, addr, src_pte);
-   pte = pte_wrprotect(pte);
-   }
-   VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
-
-   /*
-* If it's a shared mapping, mark it clean in
-* the child
-*/
-   if (vm_flags & VM_SHARED)
-   pte = pte_mkclean(pte);
-   pte = pte_mkold(pte);
-
-   if (!userfaultfd_wp(dst_vma))
-   pte = pte_clear_uffd_wp(pte);
-
-   set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+copy_pte:
+   __copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr);
return 0;
 }
 
-- 
2.43.0



[PATCH v1 06/11] sparc/pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/sparc/include/asm/pgtable_64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index a8c871b7d7860..652af9d63fa29 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -929,6 +929,8 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
 }
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
 {
-- 
2.43.0



[PATCH v1 05/11] s390/pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1299b56e43f6f..4b91e65c85d97 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1316,6 +1316,8 @@ pgprot_t pgprot_writecombine(pgprot_t prot);
 #define pgprot_writethroughpgprot_writethrough
 pgprot_t pgprot_writethrough(pgprot_t prot);
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 /*
  * Set multiple PTEs to consecutive pages with a single call.  All PTEs
  * are within the same folio, PMD and VMA.
-- 
2.43.0



[PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/riscv/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c1..add5cd30ab34d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
set_pte(ptep, pteval);
 }
 
+#define PFN_PTE_SHIFT  _PAGE_PFN_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval, unsigned int nr)
 {
-- 
2.43.0



[PATCH v1 03/11] powerpc/pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9224f23065fff..7a1ba8889aeae 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -41,6 +41,8 @@ struct mm_struct;
 
 #ifndef __ASSEMBLY__
 
+#define PFN_PTE_SHIFT  PTE_RPN_SHIFT
+
 void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte, unsigned int nr);
 #define set_ptes set_ptes
-- 
2.43.0



[PATCH v1 02/11] nios2/pgtable: define PFN_PTE_SHIFT

2024-01-22 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/nios2/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 5144506dfa693..d052dfcbe8d3a 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -178,6 +178,8 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
*ptep = pteval;
 }
 
+#define PFN_PTE_SHIFT  0
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
 {
-- 
2.43.0



[PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64

2024-01-22 Thread David Hildenbrand
We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand 
---
 arch/arm/include/asm/pgtable.h   | 2 ++
 arch/arm64/include/asm/pgtable.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index d657b84b6bf70..be91e376df79e 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
 extern void __sync_icache_dcache(pte_t pteval);
 #endif
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 void set_ptes(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep, pte_t pteval, unsigned int nr);
 #define set_ptes set_ptes
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751c..d4b3bd96e3304 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte, 
unsigned int nr_pages)
mte_sync_tags(pte, nr_pages);
 }
 
+#define PFN_PTE_SHIFT  PAGE_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm,
unsigned long __always_unused addr,
pte_t *ptep, pte_t pte, unsigned int nr)
-- 
2.43.0



[PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP

2024-01-22 Thread David Hildenbrand
Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
  4KiB | 0.014328 | 0.014265 | 0%
 16KiB | 0.014263 | 0.013293 |   - 7%
 32KiB | 0.014334 | 0.012355 |   -14%
 64KiB | 0.014046 | 0.011837 |   -16%
128KiB | 0.014011 | 0.011536 |   -18%
256KiB | 0.013993 | 0.01134  |   -19%
512KiB | 0.013983 | 0.011311 |   -19%
   1024KiB | 0.013986 | 0.011282 |   -19%
   2048KiB | 0.014305 | 0.011496 |   -20%

Next up is PTE batching when unmapping, that I'll probably send out
based on this series this/next week.

Only tested on x86-64. Compile-tested on most other architectures. Will
do more testing and double-check the arch changes while this is getting
some review.

[1] https://lkml.kernel.org/r/20231220224504.646757-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com

Cc: Andrew Morton 
Cc: Matthew Wilcox (Oracle) 
Cc: Ryan Roberts 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Dinh Nguyen 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: "Naveen N. Rao" 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Alexander Gordeev 
Cc: Gerald Schaefer 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Sven Schnelle 
Cc: "David S. Miller" 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: sparcli...@vger.kernel.org

David Hildenbrand (11):
  arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  nios2/pgtable: define PFN_PTE_SHIFT
  powerpc/pgtable: define PFN_PTE_SHIFT
  risc: pgtable: define PFN_PTE_SHIFT
  s390/pgtable: define PFN_PTE_SHIFT
  sparc/pgtable: define PFN_PTE_SHIFT
  mm/memory: factor out copying the actual PTE in copy_present_pte()
  mm/memory: pass PTE to copy_present_pte()
  mm/memory: optimize fork() with PTE-mapped THP
  mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  mm/memory: ignore writable bit in folio_pte_batch()

 arch/arm/include/asm/pgtable.h  |   2 +
 arch/arm64/include/asm/pgtable.h|   2 +
 arch/nios2/include/asm/pgtable.h|   2 +
 arch/powerpc/include/asm/pgtable.h  |   2 +
 arch/riscv/include/asm/pgtable.h|   2 +
 arch/s390/include/asm/pgtable.h |   2 +
 arch/sparc/include/asm/pgtable_64.h |   2 +
 include/linux/pgtable.h |  17 ++-
 mm/memory.c | 188 +---
 9 files changed, 173 insertions(+), 46 deletions(-)


base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
-- 
2.43.0



Re: [PATCH 1/1] PCI/DPC: Fix TLP Prefix register reading offset

2024-01-22 Thread Bjorn Helgaas
On Thu, Jan 18, 2024 at 01:08:15PM +0200, Ilpo Järvinen wrote:
> The TLP Prefix Log Register consists of multiple DWORDs (PCIe r6.1 sec
> 7.9.14.13) but the loop in dpc_process_rp_pio_error() keeps reading
> from the first DWORD. Add the iteration count based offset calculation
> into the config read.
> 
> Fixes: f20c4ea49ec4 ("PCI/DPC: Add eDPC support")
> Signed-off-by: Ilpo Järvinen 

Applied to pci/dpc for v6.9 with commit log below, thanks!

PCI/DPC: Print all TLP Prefixes, not just the first

The TLP Prefix Log Register consists of multiple DWORDs (PCIe r6.1 sec
7.9.14.13) but the loop in dpc_process_rp_pio_error() keeps reading from
the first DWORD, so we print only the first PIO TLP Prefix (duplicated
several times), and we never print the second, third, etc., Prefixes.

Add the iteration count based offset calculation into the config read.

Fixes: f20c4ea49ec4 ("PCI/DPC: Add eDPC support")
Link: 
https://lore.kernel.org/r/20240118110815.3867-1-ilpo.jarvi...@linux.intel.com
Signed-off-by: Ilpo Järvinen 
[bhelgaas: add user-visible details to commit log]
Signed-off-by: Bjorn Helgaas 

> ---
>  drivers/pci/pcie/dpc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 94111e438241..e5d7c12854fa 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -234,7 +234,7 @@ static void dpc_process_rp_pio_error(struct pci_dev *pdev)
>  
>   for (i = 0; i < pdev->dpc_rp_log_size - 5; i++) {
>   pci_read_config_dword(pdev,
> - cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG, );
> + cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG + i * 4, 
> );
>   pci_err(pdev, "TLP Prefix Header: dw%d, %#010x\n", i, prefix);
>   }
>   clear_status:
> -- 
> 2.39.2
> 


[RFC PATCH v2 4/4] arch/powerpc: remove duplicate ifdefs

2024-01-22 Thread Shrikanth Hegde
when a ifdef is used in the below manner, second one could be considered as
duplicate.

ifdef DEFINE_A
...code block...
ifdef DEFINE_A
...code block...
endif
...code block...
endif

few places in arch/powerpc where this pattern was seen. In addition to that
in paca.h, CONFIG_PPC_BOOK3S_64 was defined back to back. merged the two
ifdefs.

No functional change is intended here. It only aims to improve code
readability.

Signed-off-by: Shrikanth Hegde 
---
 arch/powerpc/include/asm/paca.h   | 4 
 arch/powerpc/kernel/asm-offsets.c | 2 --
 arch/powerpc/platforms/powermac/feature.c | 2 --
 arch/powerpc/xmon/xmon.c  | 2 --
 4 files changed, 10 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index e667d455ecb4..1d58da946739 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -163,9 +163,7 @@ struct paca_struct {
u64 kstack; /* Saved Kernel stack addr */
u64 saved_r1;   /* r1 save for RTAS calls or PM or EE=0 
*/
u64 saved_msr;  /* MSR saved here by enter_rtas */
-#ifdef CONFIG_PPC64
u64 exit_save_r1;   /* Syscall/interrupt R1 save */
-#endif
 #ifdef CONFIG_PPC_BOOK3E_64
u16 trap_save;  /* Used when bad stack is encountered */
 #endif
@@ -214,8 +212,6 @@ struct paca_struct {
/* Non-maskable exceptions that are not performance critical */
u64 exnmi[EX_SIZE]; /* used for system reset (nmi) */
u64 exmc[EX_SIZE];  /* used for machine checks */
-#endif
-#ifdef CONFIG_PPC_BOOK3S_64
/* Exclusive stacks for system reset and machine check exception. */
void *nmi_emergency_sp;
void *mc_emergency_sp;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 9f14d95b8b32..f029755f9e69 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -246,9 +246,7 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
-#ifdef CONFIG_PPC64
OFFSET(PACA_EXIT_SAVE_R1, paca_struct, exit_save_r1);
-#endif
 #ifdef CONFIG_PPC_BOOK3E_64
OFFSET(PACA_TRAP_SAVE, paca_struct, trap_save);
 #endif
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 81c9fbae88b1..2cc257f75c50 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -2333,7 +2333,6 @@ static struct pmac_mb_def pmac_mb_defs[] = {
PMAC_TYPE_POWERMAC_G5,  g5_features,
0,
},
-#ifdef CONFIG_PPC64
{   "PowerMac7,3",  "PowerMac G5",
PMAC_TYPE_POWERMAC_G5,  g5_features,
0,
@@ -2359,7 +2358,6 @@ static struct pmac_mb_def pmac_mb_defs[] = {
0,
},
 #endif /* CONFIG_PPC64 */
-#endif /* CONFIG_PPC64 */
 };

 /*
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index b3b94cd37713..f413c220165c 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -643,10 +643,8 @@ static int xmon_core(struct pt_regs *regs, volatile int 
fromipi)
touch_nmi_watchdog();
} else {
cmd = 1;
-#ifdef CONFIG_SMP
if (xmon_batch)
cmd = batch_cmds(regs);
-#endif
if (!locked_down && cmd)
cmd = cmds(regs);
if (locked_down || cmd != 0) {
--
2.39.3



[RFC PATCH v2 3/4] ntfs: remove duplicate ifdefs

2024-01-22 Thread Shrikanth Hegde
when a ifdef is used in the below manner, second one could be considered as
duplicate.

ifdef DEFINE_A
...code block...
ifdef DEFINE_A
...code block...
endif
...code block...
endif

In the ntfs code, one such pattern was seen. Hence remove that duplicate
ifdef.
No functional change is intended here. It only aims to improve code
readability.

Signed-off-by: Shrikanth Hegde 
---
 fs/ntfs/inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index aba1e22db4e9..d2c8622d53d1 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -2859,11 +2859,9 @@ int ntfs_truncate(struct inode *vi)
  *
  * See ntfs_truncate() description above for details.
  */
-#ifdef NTFS_RW
 void ntfs_truncate_vfs(struct inode *vi) {
ntfs_truncate(vi);
 }
-#endif

 /**
  * ntfs_setattr - called from notify_change() when an attribute is being 
changed
--
2.39.3



[RFC PATCH v2 2/4] xfs: remove duplicate ifdefs

2024-01-22 Thread Shrikanth Hegde
when a ifdef is used in the below manner, second one could be considered as
duplicate.

ifdef DEFINE_A
...code block...
ifdef DEFINE_A
...code block...
endif
...code block...
endif

In the xfs code two such patterns were seen. Hence removing these ifdefs.
No functional change is intended here. It only aims to improve code
readability.

Reviewed-by: Darrick J. Wong 
Signed-off-by: Shrikanth Hegde 
---
 fs/xfs/xfs_sysfs.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index 17485666b672..d2391eec37fe 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -193,7 +193,6 @@ always_cow_show(
 }
 XFS_SYSFS_ATTR_RW(always_cow);

-#ifdef DEBUG
 /*
  * Override how many threads the parallel work queue is allowed to create.
  * This has to be a debug-only global (instead of an errortag) because one of
@@ -260,7 +259,6 @@ larp_show(
return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.larp);
 }
 XFS_SYSFS_ATTR_RW(larp);
-#endif /* DEBUG */

 STATIC ssize_t
 bload_leaf_slack_store(
@@ -319,10 +317,8 @@ static struct attribute *xfs_dbg_attrs[] = {
ATTR_LIST(log_recovery_delay),
ATTR_LIST(mount_delay),
ATTR_LIST(always_cow),
-#ifdef DEBUG
ATTR_LIST(pwork_threads),
ATTR_LIST(larp),
-#endif
ATTR_LIST(bload_leaf_slack),
ATTR_LIST(bload_node_slack),
NULL,
--
2.39.3



[RFC PATCH v2 0/4] remove duplicate ifdefs

2024-01-22 Thread Shrikanth Hegde
When going through the code observed a case in scheduler,
where #ifdef CONFIG_SMP was used to inside an #ifdef CONFIG_SMP.
That didn't make sense since first one is good enough and second
one is a duplicate.

This could improve code readability. No functional change is intended.

Since this might be present in other code areas wrote a very basic
python script which helps in finding these cases. It doesn't handle any
complicated #defines or space separated "# if". At some places the
log collected had to be manually corrected due to space separated ifdefs.
Thats why its not a treewide change.
There might be an opportunity for other files as well.

Logic is very simple. If there is #ifdef or #if or #ifndef add that
variable to list. Upon every subsequent #ifdef or #if or #ifndef
check if the same variable is in the list. If yes flag
an error. Verification was done manually later checking for any #undef
or any error due to script. These were the ones that flagged out and
made sense after going through code.

More details about how the logs were collected and the script used for
processing the logs are mentioned in v1 cover letter.

v2->v1:
split the fs change into two patches as suggested by Chandan Babu R.
v1: https://lore.kernel.org/all/20240118080326.13137-1-sshe...@linux.ibm.com/

Shrikanth Hegde (4):
  sched: remove duplicate ifdefs
  xfs: remove duplicate ifdefs
  ntfs: remove duplicate ifdefs
  arch/powerpc: remove duplicate ifdefs

 arch/powerpc/include/asm/paca.h   | 4 
 arch/powerpc/kernel/asm-offsets.c | 2 --
 arch/powerpc/platforms/powermac/feature.c | 2 --
 arch/powerpc/xmon/xmon.c  | 2 --
 fs/ntfs/inode.c   | 2 --
 fs/xfs/xfs_sysfs.c| 4 
 kernel/sched/core.c   | 4 +---
 kernel/sched/fair.c   | 2 --
 8 files changed, 1 insertion(+), 21 deletions(-)

--
2.39.3



[RFC PATCH v2 1/4] sched: remove duplicate ifdefs

2024-01-22 Thread Shrikanth Hegde
when a ifdef is used in the below manner, second one could be considered as
duplicate.

ifdef DEFINE_A
...code block...
ifdef DEFINE_A
...code block...
endif
...code block...
endif

In the scheduler code, there are two places where above pattern can be
observed. Hence second ifdef is a duplicate and not needed.
Plus a minor comment update to reflect the else case.

No functional change is intended here. It only aims to improve code
readability.

Signed-off-by: Shrikanth Hegde 
---
 kernel/sched/core.c | 4 +---
 kernel/sched/fair.c | 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..a76c7095f736 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1792,7 +1792,6 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css);
 #endif

 #ifdef CONFIG_SYSCTL
-#ifdef CONFIG_UCLAMP_TASK
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void uclamp_update_root_tg(void)
 {
@@ -1898,7 +1897,6 @@ static int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
return result;
 }
 #endif
-#endif

 static int uclamp_validate(struct task_struct *p,
   const struct sched_attr *attr)
@@ -2065,7 +2063,7 @@ static void __init init_uclamp(void)
}
 }

-#else /* CONFIG_UCLAMP_TASK */
+#else /* !CONFIG_UCLAMP_TASK */
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
 static inline int uclamp_validate(struct task_struct *p,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..8e30e2bb77a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10182,10 +10182,8 @@ static int idle_cpu_without(int cpu, struct 
task_struct *p)
 * be computed and tested before calling idle_cpu_without().
 */

-#ifdef CONFIG_SMP
if (rq->ttwu_pending)
return 0;
-#endif

return 1;
 }
--
2.39.3



Re: [RFC PATCH 2/3] fs: remove duplicate ifdefs

2024-01-22 Thread Shrikanth Hegde



On 1/22/24 6:20 PM, Chandan Babu R wrote:
> On Thu, Jan 18, 2024 at 01:33:25 PM +0530, Shrikanth Hegde wrote:
>> when a ifdef is used in the below manner, second one could be considered as
>> duplicate.
>>
>> ifdef DEFINE_A
>> ...code block...
>> ifdef DEFINE_A
>> ...code block...
>> endif
>> ...code block...
>> endif
>>
>> There are few places in fs code where above pattern was seen.
>> No functional change is intended here. It only aims to improve code
>> readability.
>>
> 
> Can you please post the xfs changes as a separate patch along with Darrick's
> RVB tag? This will make it easy for me to apply the resulting patch to the XFS
> tree.

Ok. will split the fs patches into two and send v2 soon. 

Thanks.

> 


Re: [PATCH 1/1] PCI/DPC: Fix TLP Prefix register reading offset

2024-01-22 Thread Ilpo Järvinen
On Fri, 19 Jan 2024, Bjorn Helgaas wrote:

> On Thu, Jan 18, 2024 at 01:08:15PM +0200, Ilpo Järvinen wrote:
> > The TLP Prefix Log Register consists of multiple DWORDs (PCIe r6.1 sec
> > 7.9.14.13) but the loop in dpc_process_rp_pio_error() keeps reading
> > from the first DWORD. Add the iteration count based offset calculation
> > into the config read.
> 
> So IIUC the user-visible bug is that we print only the first PIO TLP
> Prefix (duplicated several times), and we never print the second,
> third, etc Prefixes, right?

Yes.

> I wish we could print them all in a single pci_err(), as we do for the
> TLP Header Log, instead of dribbling them out one by one.

I've also done some work towards consolidating AER and DPC TLP 
Header/Prefix Log handling which is when I found this bug (the reading 
side is already done but printing is still pending).

> > Fixes: f20c4ea49ec4 ("PCI/DPC: Add eDPC support")
> > Signed-off-by: Ilpo Järvinen 

-- 
 i.

Re: [PATCH] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id

2024-01-22 Thread Shijie Huang



在 2024/1/22 15:41, Mike Rapoport 写道:

On Fri, Jan 19, 2024 at 04:50:53PM +0800, Shijie Huang wrote:

在 2024/1/19 16:42, Mike Rapoport 写道:

Is there a fundamental reason to have early_cpu_to_node() at all?

The early_cpu_to_node does not work on some ARCHs (which support the NUMA),
such as  SPARC, MIPS and S390.

My question was why we need early_cpu_to_node() at all and why can't we use
cpu_to_node() early on arches that do have it


As you see, some ARCHs use cpu_to_node() all the time, such as 
SPARC,mips and S390.


They do not use early_cpu_to_node() at all.


In some ARCHs(arm64, powerpc riscv), the cpu_to_node() is ready at:

    start_kernel --> arch_call_rest_init() --> rest_init()
   --> kernel_init() --> kernel_init_freeable()
   --> smp_prepare_cpus()


The cpu_to_node() is initialized too late.


I am not sure if we can move "cpu_to_node initialization" to an early place.

Move "cpu_to_node() initization" to an early place is more complicated, 
I guess.



Thanks

Huang Shijie






Re: [PATCH v2 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES

2024-01-22 Thread Peter Xu
On Mon, Jan 15, 2024 at 01:37:37PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 05:14:11PM +0800, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > Introduce a config option that will be selected as long as huge leaves are
> > involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
> > code with this new config that can process either hugetlb or thp pages in
> > any level that is higher than pte level.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> >  mm/Kconfig | 3 +++
> >  1 file changed, 3 insertions(+)
> 
> So you mean anything that supports page table entires > PAGE_SIZE ?

Yes.

> 
> Makes sense to me, though maybe add a comment in the kconfig?

Sure I'll add some.

> 
> Reviewed-by: Jason Gunthorpe 

Thanks for your reviews and also positive comments in previous versions,
Jason.  I appreciate that.

I'm just pretty occupied with other tasks recently so I don't yet have time
to revisit this series, along with other comments yet.  I'll do so and
reply to the comments / discussions together afterwards.

-- 
Peter Xu