Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
On Thu, 1 Apr 2021 21:12:00 +0530, Srikar Dronamraju wrote: > Geethika reported a trace when doing a dlpar CPU add. > > [ cut here ] > WARNING: CPU: 152 PID: 1134 at kernel/sched/topology.c:2057 > CPU: 152 PID: 1134 Comm: kworker/152:1 Not tainted 5.12.0-rc5-master #5 > Workqueue: events cpuset_hotplug_workfn > NIP: c01cfc14 LR: c01cfc10 CTR: c07e3420 > REGS: c034a08eb260 TRAP: 0700 Not tainted (5.12.0-rc5-master+) > MSR: 80029033 CR: 28828422 XER: 0020 > CFAR: c01fd888 IRQMASK: 0 #012GPR00: c01cfc10 > c034a08eb500 c1f35400 0027 #012GPR04: > c035abaa8010 c035abb30a00 0027 c035abaa8018 > #012GPR08: 0023 c035abaaef48 0035aa54 > c035a49dffe8 #012GPR12: 28828424 c035bf1a1c80 > 0497 0004 #012GPR16: c347a258 > 0140 c203d468 c1a1a490 #012GPR20: > c1f9c160 c034adf70920 c034aec9fd20 000100087bd3 > #012GPR24: 000100087bd3 c035b3de09f8 0030 > c035b3de09f8 #012GPR28: 0028 c347a280 > c034aefe0b00 c10a2a68 > NIP [c01cfc14] build_sched_domains+0x6a4/0x1500 > LR [c01cfc10] build_sched_domains+0x6a0/0x1500 > Call Trace: > [c034a08eb500] [c01cfc10] build_sched_domains+0x6a0/0x1500 > (unreliable) > [c034a08eb640] [c01d1e6c] > partition_sched_domains_locked+0x3ec/0x530 > [c034a08eb6e0] [c02936d4] rebuild_sched_domains_locked+0x524/0xbf0 > [c034a08eb7e0] [c0296bb0] rebuild_sched_domains+0x40/0x70 > [c034a08eb810] [c0296e74] cpuset_hotplug_workfn+0x294/0xe20 > [c034a08ebc30] [c0178dd0] process_one_work+0x300/0x670 > [c034a08ebd10] [c01791b8] worker_thread+0x78/0x520 > [c034a08ebda0] [c0185090] kthread+0x1a0/0x1b0 > [c034a08ebe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70 > Instruction dump: > 7d2903a6 4e800421 e8410018 7f67db78 7fe6fb78 7f45d378 7f84e378 7c681b78 > 3c62ff1a 3863c6f8 4802dc35 6000 <0fe0> 3920fff4 f9210070 e86100a0 > ---[ end trace 532d9066d3d4d7ec ]--- > > [...] Applied to powerpc/next. [1/1] powerpc/smp: Set numa node before updating mask https://git.kernel.org/powerpc/c/6980d13f0dd189846887bbbfa43793d9a41768d3 cheers
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Srikar Dronamraju writes: > > Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search > for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before > updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated > correctly before its used in setting the masks. Setting the numa_node will > ensure that when cpu_cpu_mask() gets called, the correct node number is > used. This code movement helped fix the above call trace. > > Reported-by: Geetika Moolchandani > Signed-off-by: Srikar Dronamraju Reviewed-by: Nathan Lynch Thanks.
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Srikar Dronamraju writes: > That leaves us with just 2 options for now. > 1. Update numa_mem later and only update numa_node here. > - Over a longer period of time, this would be more confusing since we > may lose track of why we are splitting the set of numa_node and numa_mem. > > or > 2. Use my earlier patch. > > My choice would be to go with my earlier patch. > Please do let me know your thoughts on the same. OK, agreed. Thanks.
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Hey Nathan, > > Regardless of your change: at boot time, this set of calls to > set_numa_node() and set_numa_mem() is redundant, right? Because > smp_prepare_cpus() has: > > for_each_possible_cpu(cpu) { > ... > if (cpu_present(cpu)) { > set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); > set_cpu_numa_mem(cpu, > local_memory_node(numa_cpu_lookup_table[cpu])); > } > > I would rather that, when onlining a CPU that happens to have been > dynamically added after boot, we enter start_secondary() with conditions > equivalent to those at boot time. Or as close to that as is practical. > > So I'd suggest that pseries_add_processor() be made to update > these things when the CPUs are marked present, before onlining them. I did try updating at pseries_add_processor when things were being marked as present. But looks like the zonelists may not be updated at that time. I saw couple of crashes in local_memory_node() when dplar adding CPUs. (Appending the patch that causes these crash to this mail for your reference) [ 293.669901] Kernel attempted to read user page (1508) - exploit attempt? (uid: 0) [1309/17174] [ 293.669925] BUG: Kernel NULL pointer dereference on read at 0x1508 [ 293.669935] Faulting instruction address: 0xc0484ae0 [ 293.669947] Oops: Kernel access of bad area, sig: 11 [#1] [ 293.669956] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 293.669969] Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_ fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink dm_multipath pseries_rng xts vmx_c rypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod fuse [ 293.670101] CPU: 17 PID: 3183 Comm: drmgr Not tainted 5.12.0-rc5-master+ #2 [ 293.670113] NIP: c0484ae0 LR: c010a548 CTR: 006dd130 [ 293.670121] REGS: c025a9997160 TRAP: 0300 Not tainted (5.12.0-rc5-master+) [ 293.670131] MSR: 80009033 CR: 48008422 XER: 0008 [ 293.670155] CFAR: c010a544 DAR: 1508 DSISR: 4000 IRQMASK: 0 [ 293.670155] GPR00: c010a548 c025a9997400 c1f55500 [ 293.670155] GPR04: c1a3c598 0006 0027 c035873b8018 [ 293.670155] GPR08: 0023 c20464f8 0035894d c03584c8ffe8 [ 293.670155] GPR12: 28008424 c035bf22ce80 [ 293.670155] GPR16: [ 293.670155] GPR20: c1fbc160 [ 293.670155] GPR24: 0002 0200 c1fc04c8 0001 [ 293.670155] GPR28: c10c6fc8 c1fc08a0 c02f8fb99e60 0040 [ 293.670263] NIP [c0484ae0] local_memory_node+0x20/0x80 [ 293.670281] LR [c010a548] pseries_add_processor+0x368/0x3b0 [ 293.670292] Call Trace: [ 293.670297] [c025a9997400] [c010a524] pseries_add_processor+0x344/0x3b0 (unreliable) [ 293.670311] [c025a99976c0] [c010a790] pseries_smp_notifier+0x200/0x2a0 [ 293.670322] [c025a9997780] [c0197288] blocking_notifier_call_chain+0xa8/0x110 [ 293.670335] [c025a99977d0] [c0b27010] of_attach_node+0xc0/0x110 [ 293.670347] [c025a9997830] [c01032a0] dlpar_attach_node+0x30/0x70 [ 293.670358] [c025a99978a0] [c0109ac0] dlpar_cpu_add+0x1d0/0x510 [ 293.670368] [c025a9997980] [c010a898] dlpar_cpu+0x68/0x6e0 [ 293.670377] [c025a9997a80] [c01036b8] handle_dlpar_errorlog+0xf8/0x190 [ 293.670388] [c025a9997af0] [c0103928] dlpar_store+0x178/0x4a0 [ 293.670396] [c025a9997bb0] [c07df050] kobj_attr_store+0x30/0x50 [ 293.670408] [c025a9997bd0] [c062f0b0] sysfs_kf_write+0x70/0xb0 [ 293.670421] [c025a9997c10] [c062d4e0] kernfs_fop_write_iter+0x1d0/0x280 [ 293.670432] [c025a9997c60] [c051673c] new_sync_write+0x14c/0x1d0 [ 293.670445] [c025a9997d00] [c0519df4] vfs_write+0x264/0x380 [ 293.670455] [c025a9997d60] [c051a0ec] ksys_write+0x7c/0x140 [ 293.670464] [c025a9997db0] [c0032af0] system_call_exception+0x150/0x290 [ 293.670475] [c025a9997e10] [c000d45c] system_call_common+0xec/0x278 [ 293.670486] --- interrupt: c00 at 0x2025bd74 That leaves us with just 2
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Srikar Dronamraju writes: > * Nathan Lynch [2021-04-07 14:46:24]: >> I don't know. I guess this question just makes me wonder whether powerpc >> needs to have the additional lookup table. How is it different from the >> generic per_cpu numa_node? > > lookup table is for early cpu to node i.e when per_cpu variables may not be > available. This would mean that calling set_numa_node/set_cpu_numa_node from > map_cpu_to_node() may not always be an option, since map_cpu_to_node() does > end up getting called very early in the system. Ah that's right, thanks.
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
* Nathan Lynch [2021-04-07 14:46:24]: > Srikar Dronamraju writes: > > > * Nathan Lynch [2021-04-07 07:19:10]: > > > >> Sorry for the delay in following up here. > >> > > > > No issues. > > > >> >> So I'd suggest that pseries_add_processor() be made to update > >> >> these things when the CPUs are marked present, before onlining them. > >> > > >> > In pseries_add_processor, we are only marking the cpu as present. i.e > >> > I believe numa_setup_cpu() would not have been called. So we may not > >> > have a > >> > way to associate the CPU to the node. Otherwise we will have to call > >> > numa_setup_cpu() or the hcall_vphn. > >> > > >> > We could try calling numa_setup_cpu() immediately after we set the > >> > CPU to be present, but that would be one more extra hcall + I dont know > >> > if > >> > there are any more steps needed before CPU being made present and > >> > associating the CPU to the node. > >> > >> An additional hcall in this path doesn't seem too expensive. > >> > >> > Are we sure the node is already online? > >> > >> I see that dlpar_online_cpu() calls find_and_online_cpu_nid(), so yes I > >> think that's covered. > > > > Okay, > > > > Can we just call set_cpu_numa_node() at the end of map_cpu_to_node(). > > The advantage would be the update to numa_cpu_lookup_table and cpu_to_node > > would happen at the same time and would be in sync. > > I don't know. I guess this question just makes me wonder whether powerpc > needs to have the additional lookup table. How is it different from the > generic per_cpu numa_node? lookup table is for early cpu to node i.e when per_cpu variables may not be available. This would mean that calling set_numa_node/set_cpu_numa_node from map_cpu_to_node() may not always be an option, since map_cpu_to_node() does end up getting called very early in the system. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Srikar Dronamraju writes: > * Nathan Lynch [2021-04-07 07:19:10]: > >> Sorry for the delay in following up here. >> > > No issues. > >> >> So I'd suggest that pseries_add_processor() be made to update >> >> these things when the CPUs are marked present, before onlining them. >> > >> > In pseries_add_processor, we are only marking the cpu as present. i.e >> > I believe numa_setup_cpu() would not have been called. So we may not have a >> > way to associate the CPU to the node. Otherwise we will have to call >> > numa_setup_cpu() or the hcall_vphn. >> > >> > We could try calling numa_setup_cpu() immediately after we set the >> > CPU to be present, but that would be one more extra hcall + I dont know if >> > there are any more steps needed before CPU being made present and >> > associating the CPU to the node. >> >> An additional hcall in this path doesn't seem too expensive. >> >> > Are we sure the node is already online? >> >> I see that dlpar_online_cpu() calls find_and_online_cpu_nid(), so yes I >> think that's covered. > > Okay, > > Can we just call set_cpu_numa_node() at the end of map_cpu_to_node(). > The advantage would be the update to numa_cpu_lookup_table and cpu_to_node > would happen at the same time and would be in sync. I don't know. I guess this question just makes me wonder whether powerpc needs to have the additional lookup table. How is it different from the generic per_cpu numa_node?
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
* Nathan Lynch [2021-04-07 07:19:10]: > Sorry for the delay in following up here. > No issues. > >> So I'd suggest that pseries_add_processor() be made to update > >> these things when the CPUs are marked present, before onlining them. > > > > In pseries_add_processor, we are only marking the cpu as present. i.e > > I believe numa_setup_cpu() would not have been called. So we may not have a > > way to associate the CPU to the node. Otherwise we will have to call > > numa_setup_cpu() or the hcall_vphn. > > > > We could try calling numa_setup_cpu() immediately after we set the > > CPU to be present, but that would be one more extra hcall + I dont know if > > there are any more steps needed before CPU being made present and > > associating the CPU to the node. > > An additional hcall in this path doesn't seem too expensive. > > > Are we sure the node is already online? > > I see that dlpar_online_cpu() calls find_and_online_cpu_nid(), so yes I > think that's covered. Okay, Can we just call set_cpu_numa_node() at the end of map_cpu_to_node(). The advantage would be the update to numa_cpu_lookup_table and cpu_to_node would happen at the same time and would be in sync. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Sorry for the delay in following up here. Srikar Dronamraju writes: >> > - set_numa_node(numa_cpu_lookup_table[cpu]); >> > - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); >> > - >> >> Regardless of your change: at boot time, this set of calls to >> set_numa_node() and set_numa_mem() is redundant, right? Because >> smp_prepare_cpus() has: >> >> for_each_possible_cpu(cpu) { >> ... >> if (cpu_present(cpu)) { >> set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); >> set_cpu_numa_mem(cpu, >> local_memory_node(numa_cpu_lookup_table[cpu])); >> } >> >> I would rather that, when onlining a CPU that happens to have been >> dynamically added after boot, we enter start_secondary() with conditions >> equivalent to those at boot time. Or as close to that as is practical. >> >> So I'd suggest that pseries_add_processor() be made to update >> these things when the CPUs are marked present, before onlining them. > > In pseries_add_processor, we are only marking the cpu as present. i.e > I believe numa_setup_cpu() would not have been called. So we may not have a > way to associate the CPU to the node. Otherwise we will have to call > numa_setup_cpu() or the hcall_vphn. > > We could try calling numa_setup_cpu() immediately after we set the > CPU to be present, but that would be one more extra hcall + I dont know if > there are any more steps needed before CPU being made present and > associating the CPU to the node. An additional hcall in this path doesn't seem too expensive. > Are we sure the node is already online? I see that dlpar_online_cpu() calls find_and_online_cpu_nid(), so yes I think that's covered. > For the numa_mem, we are better of if the zonelists for the node are > built. > > or the other solution would be to call this in map_cpu_to_node(). > Here also we have to be sure the zonelists for the node are already > built.
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
* Nathan Lynch [2021-04-01 17:51:05]: Thanks Nathan for reviewing. > > - set_numa_node(numa_cpu_lookup_table[cpu]); > > - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); > > - > > Regardless of your change: at boot time, this set of calls to > set_numa_node() and set_numa_mem() is redundant, right? Because > smp_prepare_cpus() has: > > for_each_possible_cpu(cpu) { > ... > if (cpu_present(cpu)) { > set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); > set_cpu_numa_mem(cpu, > local_memory_node(numa_cpu_lookup_table[cpu])); > } > > I would rather that, when onlining a CPU that happens to have been > dynamically added after boot, we enter start_secondary() with conditions > equivalent to those at boot time. Or as close to that as is practical. > > So I'd suggest that pseries_add_processor() be made to update > these things when the CPUs are marked present, before onlining them. In pseries_add_processor, we are only marking the cpu as present. i.e I believe numa_setup_cpu() would not have been called. So we may not have a way to associate the CPU to the node. Otherwise we will have to call numa_setup_cpu() or the hcall_vphn. We could try calling numa_setup_cpu() immediately after we set the CPU to be present, but that would be one more extra hcall + I dont know if there are any more steps needed before CPU being made present and associating the CPU to the node. Are we sure the node is already online? For the numa_mem, we are better of if the zonelists for the node are built. or the other solution would be to call this in map_cpu_to_node(). Here also we have to be sure the zonelists for the node are already built. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask
Hi Srikar, Thanks for figuring this out. Srikar Dronamraju writes: > > Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search > for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before > updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated > correctly before its used in setting the masks. Setting the numa_node will > ensure that when cpu_cpu_mask() gets called, the correct node number is > used. This code movement helped fix the above call trace. > > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 5a4d59a1070d..1a99d75679a8 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1521,6 +1521,9 @@ void start_secondary(void *unused) > > vdso_getcpu_init(); > #endif > + set_numa_node(numa_cpu_lookup_table[cpu]); > + set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); > + > /* Update topology CPU masks */ > add_cpu_to_masks(cpu); > > @@ -1539,9 +1542,6 @@ void start_secondary(void *unused) > shared_caches = true; > } > > - set_numa_node(numa_cpu_lookup_table[cpu]); > - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); > - Regardless of your change: at boot time, this set of calls to set_numa_node() and set_numa_mem() is redundant, right? Because smp_prepare_cpus() has: for_each_possible_cpu(cpu) { ... if (cpu_present(cpu)) { set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); set_cpu_numa_mem(cpu, local_memory_node(numa_cpu_lookup_table[cpu])); } I would rather that, when onlining a CPU that happens to have been dynamically added after boot, we enter start_secondary() with conditions equivalent to those at boot time. Or as close to that as is practical. So I'd suggest that pseries_add_processor() be made to update these things when the CPUs are marked present, before onlining them.
[PATCH 1/1] powerpc/smp: Set numa node before updating mask
Geethika reported a trace when doing a dlpar CPU add. [ cut here ] WARNING: CPU: 152 PID: 1134 at kernel/sched/topology.c:2057 CPU: 152 PID: 1134 Comm: kworker/152:1 Not tainted 5.12.0-rc5-master #5 Workqueue: events cpuset_hotplug_workfn NIP: c01cfc14 LR: c01cfc10 CTR: c07e3420 REGS: c034a08eb260 TRAP: 0700 Not tainted (5.12.0-rc5-master+) MSR: 80029033 CR: 28828422 XER: 0020 CFAR: c01fd888 IRQMASK: 0 #012GPR00: c01cfc10 c034a08eb500 c1f35400 0027 #012GPR04: c035abaa8010 c035abb30a00 0027 c035abaa8018 #012GPR08: 0023 c035abaaef48 0035aa54 c035a49dffe8 #012GPR12: 28828424 c035bf1a1c80 0497 0004 #012GPR16: c347a258 0140 c203d468 c1a1a490 #012GPR20: c1f9c160 c034adf70920 c034aec9fd20 000100087bd3 #012GPR24: 000100087bd3 c035b3de09f8 0030 c035b3de09f8 #012GPR28: 0028 c347a280 c034aefe0b00 c10a2a68 NIP [c01cfc14] build_sched_domains+0x6a4/0x1500 LR [c01cfc10] build_sched_domains+0x6a0/0x1500 Call Trace: [c034a08eb500] [c01cfc10] build_sched_domains+0x6a0/0x1500 (unreliable) [c034a08eb640] [c01d1e6c] partition_sched_domains_locked+0x3ec/0x530 [c034a08eb6e0] [c02936d4] rebuild_sched_domains_locked+0x524/0xbf0 [c034a08eb7e0] [c0296bb0] rebuild_sched_domains+0x40/0x70 [c034a08eb810] [c0296e74] cpuset_hotplug_workfn+0x294/0xe20 [c034a08ebc30] [c0178dd0] process_one_work+0x300/0x670 [c034a08ebd10] [c01791b8] worker_thread+0x78/0x520 [c034a08ebda0] [c0185090] kthread+0x1a0/0x1b0 [c034a08ebe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70 Instruction dump: 7d2903a6 4e800421 e8410018 7f67db78 7fe6fb78 7f45d378 7f84e378 7c681b78 3c62ff1a 3863c6f8 4802dc35 6000 <0fe0> 3920fff4 f9210070 e86100a0 ---[ end trace 532d9066d3d4d7ec ]--- Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated correctly before its used in setting the masks. Setting the numa_node will ensure that when cpu_cpu_mask() gets called, the correct node number is used. This code movement helped fix the above call trace. Cc: linuxppc-dev@lists.ozlabs.org Cc: Nathan Lynch Cc: Michael Ellerman Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Scott Cheloha Cc: Gautham R Shenoy Cc: Geetika Moolchandani Reported-by: Geetika Moolchandani Signed-off-by: Srikar Dronamraju --- arch/powerpc/kernel/smp.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5a4d59a1070d..1a99d75679a8 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1521,6 +1521,9 @@ void start_secondary(void *unused) vdso_getcpu_init(); #endif + set_numa_node(numa_cpu_lookup_table[cpu]); + set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + /* Update topology CPU masks */ add_cpu_to_masks(cpu); @@ -1539,9 +1542,6 @@ void start_secondary(void *unused) shared_caches = true; } - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); - smp_wmb(); notify_cpu_starting(cpu); set_cpu_online(cpu, true); -- 2.27.0