Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On 28/05/20 00:21, David Ahern wrote: > On 5/27/20 3:07 PM, Paolo Bonzini wrote: >> I see what you meant now. statsfs can also be used to enumerate objects >> if one is so inclined (with the prototype in patch 7, for example, each >> network interface becomes a directory). > > there are many use cases that have 100's to 1000's have network devices. > Having a sysfs entry per device already bloats memory usage for these > use cases; another filesystem with an entry per device makes that worse. > Really the wrong direction for large scale systems. Hi David, IMO the important part for now is having a flexible kernel API for exposing statistics across multiple subsystems, so that they can be harvested in an efficient way. The userspace API is secondary, and multiple APIs can be added to cater for different usecases. For example, as of the first five patches the memory usage is the same as what is now in the mainline kernel, since all the patchset does is take existing debugfs inodes and move them to statsfs. I agree that, if the concept is extended to the whole kernel, scalability and memory usage becomes an issue; and indeed, the long-term plan is to support a binary format that is actually _more_ efficient than the status quo for large scale systems. In the meanwhile, the new filesystem can be disabled (see the difference between "STATS_FS" and "STATS_FS_API") if it imposes undesirable overhead. Thanks, Paolo
Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote: > The locking rules for walking partition scoped table is different from process > scoped table. Hence add a helper for secondary linux page table walk and also > add check whether we are holding the right locks. This patch is causing new warnings to appear when testing migration, like this: [ 142.090159] [ cut here ] [ 142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held [ 142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv] [ 142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm [ 142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW 5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432 [ 142.090189] NIP: c00800fe848c LR: c00800fe8488 CTR: [ 142.090190] REGS: c01e19f077e0 TRAP: 0700 Tainted: GW (5.7.0-rc5-kvm-00211-g9ccf10d6d088) [ 142.090191] MSR: 90029033 CR: 4422 XER: 2004 [ 142.090196] CFAR: c012f5ac IRQMASK: 0 GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 0039 GPR04: 0001 c01ffc8b4900 00018840 0007 GPR08: 0003 0001 0007 0001 GPR12: 2000 c01fff6d9400 00011f884678 7fff70b7 GPR16: 7fff7137cb90 7fff7dcb4410 0001 GPR20: 0ffe 0001 GPR24: 8000 0001 c01e1f67e600 c01e1fd82410 GPR28: 1000 c01e2e41 0fff 0ffe [ 142.090217] NIP [c00800fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv] [ 142.090223] LR [c00800fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] [ 142.090224] Call Trace: [ 142.090230] [c01e19f07a70] [c00800fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable) [ 142.090236] [c01e19f07b50] [c00800fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv] [ 142.090292] [c01e19f07be0] [c00800eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm] [ 142.090300] [c01e19f07c00] [c00800edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm] [ 142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150 [ 142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80 [ 142.090307] [c01e19f07dc0] [c003652c] system_call_exception+0x16c/0x240 [ 142.090309] [c01e19f07e20] [c000d070] system_call_common+0xf0/0x278 [ 142.090310] Instruction dump: [ 142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 e8848468 3c62 [ 142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 6000 6000 [ 142.090322] ---[ end trace 619d45057b6919e0 ]--- Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit locklessly, and only takes the kvm->mmu_lock once it finds a dirty PTE. I think that is important for performance, since on any given scan of the guest real address space we may only find a small proportion of the guest pages to be dirty. Are you now relying on the kvm->mmu_lock to protect the existence of the PTEs, or just their content? Paul.
Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
Petr Mladek writes: > The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only > to archs where they work") caused that bpf_probe_read{, str}() functions > were not longer available on architectures where the same logical address > might have different content in kernel and user memory mapping. These > architectures should use probe_read_{user,kernel}_str helpers. > > For backward compatibility, the problematic functions are still available > on architectures where the user and kernel address spaces are not > overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. > > At the moment, these backward compatible functions are enabled only > on x86_64, arm, and arm64. Let's do it also on powerpc that has > the non overlapping address space as well. > > Signed-off-by: Petr Mladek This seems like it should have a Fixes: tag and go into v5.7? cheers > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index d13b5328ca10..b29d7cb38368 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -126,6 +126,7 @@ config PPC > select ARCH_HAS_MMIOWB if PPC64 > select ARCH_HAS_PHYS_TO_DMA > select ARCH_HAS_PMEM_API > + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE > select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64 > select ARCH_HAS_PTE_SPECIAL > select ARCH_HAS_MEMBARRIER_CALLBACKS > -- > 2.26.2
Re: [PATCH] powerpc/64: Remove unused generic_secondary_thread_init()
On Tue, May 26, 2020 at 4:36 PM Michael Ellerman wrote: > > The last caller was removed in 2014 in commit fb5a515704d7 ("powerpc: > Remove platforms/wsp and associated pieces"). > > Once generic_secondary_thread_init() is removed there are no longer > any uses of book3e_secondary_thread_init() or > generic_secondary_common_init so remove them too. > > Signed-off-by: Michael Ellerman > --- > arch/powerpc/include/asm/smp.h | 1 - > arch/powerpc/kernel/exceptions-64e.S | 4 > arch/powerpc/kernel/head_64.S| 18 -- > 3 files changed, 23 deletions(-) > > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h > index 49a25e2400f2..81a49566ccd8 100644 > --- a/arch/powerpc/include/asm/smp.h > +++ b/arch/powerpc/include/asm/smp.h > @@ -243,7 +243,6 @@ extern void arch_send_call_function_ipi_mask(const struct > cpumask *mask); > * 64-bit but defining them all here doesn't harm > */ > extern void generic_secondary_smp_init(void); > -extern void generic_secondary_thread_init(void); > extern unsigned long __secondary_hold_spinloop; > extern unsigned long __secondary_hold_acknowledge; > extern char __secondary_hold; > diff --git a/arch/powerpc/kernel/exceptions-64e.S > b/arch/powerpc/kernel/exceptions-64e.S > index d9ed79415100..9f9e8686798b 100644 > --- a/arch/powerpc/kernel/exceptions-64e.S > +++ b/arch/powerpc/kernel/exceptions-64e.S > @@ -1814,10 +1814,6 @@ _GLOBAL(book3e_secondary_core_init) > 1: mtlrr28 > blr > > -_GLOBAL(book3e_secondary_thread_init) > - mflrr28 > - b 3b > - > .globl init_core_book3e > init_core_book3e: > /* Establish the interrupt vector base */ > diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S > index 0e05a9a47a4b..4ae2c18c5fc6 100644 > --- a/arch/powerpc/kernel/head_64.S > +++ b/arch/powerpc/kernel/head_64.S > @@ -302,23 +302,6 @@ _GLOBAL(fsl_secondary_thread_init) > 1: > #endif Nothing directly calls generic_secondary_thread_init() but I think fsl_secondary_thread_init() which is directly above "falls through" into it. fsl_secondary_thread_init() still has callers. > > -_GLOBAL(generic_secondary_thread_init) > - mr r24,r3 > - > - /* turn on 64-bit mode */ > - bl enable_64b_mode > - > - /* get a valid TOC pointer, wherever we're mapped at */ > - bl relative_toc > - tovirt(r2,r2) > - > -#ifdef CONFIG_PPC_BOOK3E > - /* Book3E initialization */ > - mr r3,r24 > - bl book3e_secondary_thread_init > -#endif > - b generic_secondary_common_init > - > /* > * On pSeries and most other platforms, secondary processors spin > * in the following code. > @@ -385,7 +368,6 @@ _GLOBAL(generic_secondary_smp_init) > 20: > #endif > > -generic_secondary_common_init: > /* Set up a paca value for this processor. Since we have the > * physical cpu id in r24, we need to search the pacas to find > * which logical id maps to our physical one. > -- > 2.25.1 >
Re: [v3 2/2] dts: ppc: t1024rdb: remove interrupts property
On Tue, May 26, 2020 at 10:52 PM Biwen Li wrote: > > From: Biwen Li > > Since the interrupt pin for RTC DS1339 is not connected > to the CPU on T1024RDB, remove the interrupt property > from the device tree. > > This also fix the following warning for hwclock.util-linux: > $ hwclock.util-linux > hwclock.util-linux: select() to /dev/rtc0 > to wait for clock tick timed out > > Signed-off-by: Biwen Li Acked-by: Li Yang > --- > arch/powerpc/boot/dts/fsl/t1024rdb.dts | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/powerpc/boot/dts/fsl/t1024rdb.dts > b/arch/powerpc/boot/dts/fsl/t1024rdb.dts > index 645caff98ed1..605ceec66af3 100644 > --- a/arch/powerpc/boot/dts/fsl/t1024rdb.dts > +++ b/arch/powerpc/boot/dts/fsl/t1024rdb.dts > @@ -161,7 +161,6 @@ > rtc@68 { > compatible = "dallas,ds1339"; > reg = <0x68>; > - interrupts = <0x1 0x1 0 0>; > }; > }; > > -- > 2.17.1 >
Re: [v3 1/2] dts: ppc: t4240rdb: remove interrupts property
On Tue, May 26, 2020 at 10:49 PM Biwen Li wrote: > > From: Biwen Li > > Since the interrupt pin for RTC DS1374 is not connected > to the CPU on T4240RDB, remove the interrupt property > from the device tree. > > This also fix the following warning for hwclock.util-linux: > $ hwclock.util-linux > hwclock.util-linux: select() to /dev/rtc0 > to wait for clock tick timed out > > Signed-off-by: Biwen Li Acked-by: Li Yang > --- > arch/powerpc/boot/dts/fsl/t4240rdb.dts | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/powerpc/boot/dts/fsl/t4240rdb.dts > b/arch/powerpc/boot/dts/fsl/t4240rdb.dts > index a56a705d41f7..145896f2eef6 100644 > --- a/arch/powerpc/boot/dts/fsl/t4240rdb.dts > +++ b/arch/powerpc/boot/dts/fsl/t4240rdb.dts > @@ -144,7 +144,6 @@ > rtc@68 { > compatible = "dallas,ds1374"; > reg = <0x68>; > - interrupts = <0x1 0x1 0 0>; > }; > }; > > -- > 2.17.1 >
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On 5/27/20 3:07 PM, Paolo Bonzini wrote: > I see what you meant now. statsfs can also be used to enumerate objects > if one is so inclined (with the prototype in patch 7, for example, each > network interface becomes a directory). there are many use cases that have 100's to 1000's have network devices. Having a sysfs entry per device already bloats memory usage for these use cases; another filesystem with an entry per device makes that worse. Really the wrong direction for large scale systems.
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On 27/05/20 23:27, Jakub Kicinski wrote: > On Wed, 27 May 2020 23:07:53 +0200 Paolo Bonzini wrote: >>> Again, I have little KVM knowledge, but BPF also uses a fd-based API, >>> and carries stats over the same syscall interface. >> >> Can BPF stats (for BPF scripts created by whatever process is running in >> the system) be collected by an external daemon that does not have access >> to the file descriptor? For KVM it's of secondary importance to gather >> stats in the program; it can be nice to have and we are thinking of a >> way to export the stats over the fd-based API, but it's less useful than >> system-wide monitoring. Perhaps this is a difference between the two. > > Yes, check out bpftool prog list (bpftool code is under tools/bpf/ in > the kernel tree). BPF statistics are under a static key, so you may not > see any on your system. My system shows e.g.: > > 81: kprobe name abc tag cefaa9376bdaae75 gpl run_time_ns 80941 run_cnt 152 > loaded_at 2020-05-26T13:00:24-0700 uid 0 > xlated 512B jited 307B memlock 4096B map_ids 66,64 > btf_id 16 > > In this example run_time_ns and run_cnt are stats. > > The first number on the left is the program ID. BPF has an IDA, and > each object gets an integer id. So admin (or CAP_BPF, I think) can > iterate over the ids and open fds to objects of interest. Got it, thanks. But then "I'd hope that whatever daemon collects [BPF] stats doesn't run as root". :) >> Another case where stats and configuration are separate is CPUs, where >> CPU enumeration is done in sysfs but statistics are exposed in various >> procfs files such as /proc/interrupts and /proc/stats. > > True, but I'm guessing everyone is just okay living with the legacy > procfs format there. Otherwise I'd guess the stats would had been added > to sysfs. I'd be curious to hear the full story there. Yeah, it's a chicken-and-egg problem in that there's no good place in sysfs to put statistics right now, which is part of what this filesystem is trying to solve (the other part is the API). You can read more about Google's usecase at http://lkml.iu.edu/hypermail/linux/kernel/2005.0/08056.html, it does include both network and interrupt stats and it's something that they've been using in production for quite some time. We'd like the statsfs API to be the basis for including something akin to that in Linux. To be honest, it's unlikely that Emanuele (who has just finished his internship at Red Hat) and I will pursue the networking stats further than the demo patch at the end of this series. However, we're trying to make sure that the API is at least ready for that, and to probe whether any developers from other subsystems would be interested in using statsfs. So thanks for bringing your point of view! Thanks, Paolo
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On Wed, 27 May 2020 23:07:53 +0200 Paolo Bonzini wrote: > > Again, I have little KVM knowledge, but BPF also uses a fd-based API, > > and carries stats over the same syscall interface. > > Can BPF stats (for BPF scripts created by whatever process is running in > the system) be collected by an external daemon that does not have access > to the file descriptor? For KVM it's of secondary importance to gather > stats in the program; it can be nice to have and we are thinking of a > way to export the stats over the fd-based API, but it's less useful than > system-wide monitoring. Perhaps this is a difference between the two. Yes, check out bpftool prog list (bpftool code is under tools/bpf/ in the kernel tree). BPF statistics are under a static key, so you may not see any on your system. My system shows e.g.: 81: kprobe name abc tag cefaa9376bdaae75 gpl run_time_ns 80941 run_cnt 152 loaded_at 2020-05-26T13:00:24-0700 uid 0 xlated 512B jited 307B memlock 4096B map_ids 66,64 btf_id 16 In this example run_time_ns and run_cnt are stats. The first number on the left is the program ID. BPF has an IDA, and each object gets an integer id. So admin (or CAP_BPF, I think) can iterate over the ids and open fds to objects of interest. > Another case where stats and configuration are separate is CPUs, where > CPU enumeration is done in sysfs but statistics are exposed in various > procfs files such as /proc/interrupts and /proc/stats. True, but I'm guessing everyone is just okay living with the legacy procfs format there. Otherwise I'd guess the stats would had been added to sysfs. I'd be curious to hear the full story there.
Re: [PATCH v3 6/8] mm: parallelize deferred_init_memmap()
On Wed, May 27, 2020 at 10:37 AM Daniel Jordan wrote: > > Deferred struct page init is a significant bottleneck in kernel boot. > Optimizing it maximizes availability for large-memory systems and allows > spinning up short-lived VMs as needed without having to leave them > running. It also benefits bare metal machines hosting VMs that are > sensitive to downtime. In projects such as VMM Fast Restart[1], where > guest state is preserved across kexec reboot, it helps prevent > application and network timeouts in the guests. > > Multithread to take full advantage of system memory bandwidth. > > The maximum number of threads is capped at the number of CPUs on the > node because speedups always improve with additional threads on every > system tested, and at this phase of boot, the system is otherwise idle > and waiting on page init to finish. > > Helper threads operate on section-aligned ranges to both avoid false > sharing when setting the pageblock's migrate type and to avoid accessing > uninitialized buddy pages, though max order alignment is enough for the > latter. > > The minimum chunk size is also a section. There was benefit to using > multiple threads even on relatively small memory (1G) systems, and this > is the smallest size that the alignment allows. > > The time (milliseconds) is the slowest node to initialize since boot > blocks until all nodes finish. intel_pstate is loaded in active mode > without hwp and with turbo enabled, and intel_idle is active as well. > > Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal) > 2 nodes * 26 cores * 2 threads = 104 CPUs > 384G/node = 768G memory > >kernel boot deferred init > > node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) > ( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6) >2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8) > 12% ( 6) 34.9% 2662.7 ( 2.9) 79.9%359.3 ( 0.6) > 25% ( 13) 39.9% 2459.0 ( 3.6) 91.2%157.0 ( 0.0) > 37% ( 19) 39.2% 2485.0 ( 29.7) 90.4%172.0 ( 28.6) > 50% ( 26) 39.3% 2482.7 ( 25.7) 90.3%173.7 ( 30.0) > 75% ( 39) 39.0% 2495.7 ( 5.5) 89.4%190.0 ( 1.0) > 100% ( 52) 40.2% 2443.7 ( 3.8) 92.3%138.0 ( 1.0) > > Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest) > 1 node * 16 cores * 2 threads = 32 CPUs > 192G/node = 192G memory > >kernel boot deferred init > > node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) > ( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5) >3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0) > 12% ( 4) 41.1% 1170.3 ( 14.2) 73.8%287.0 ( 3.6) > 25% ( 8) 47.1% 1052.7 ( 21.9) 83.9%177.0 ( 13.5) > 38% ( 12) 48.9% 1016.3 ( 12.1) 86.8%144.7 ( 1.5) > 50% ( 16) 48.9% 1015.7 ( 8.1) 87.8%134.0 ( 4.4) > 75% ( 24) 49.1% 1012.3 ( 3.1) 88.1%130.3 ( 2.3) > 100% ( 32) 49.5% 1004.0 ( 5.3) 88.5%125.7 ( 2.1) > > Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal) > 2 nodes * 18 cores * 2 threads = 72 CPUs > 128G/node = 256G memory > >kernel boot deferred init > > node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) > ( 0) -- 1680.0 ( 4.6) --627.0 ( 4.0) >3% ( 1) 0.3% 1675.7 ( 4.5) -0.2%628.0 ( 3.6) > 11% ( 4) 25.6% 1250.7 ( 2.1) 67.9%201.0 ( 0.0) > 25% ( 9) 30.7% 1164.0 ( 17.3) 81.8%114.3 ( 17.7) > 36% ( 13) 31.4% 1152.7 ( 10.8) 84.0%100.3 ( 17.9) > 50% ( 18) 31.5% 1150.7 ( 9.3) 83.9%101.0 ( 14.1) > 75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4) > 100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0) > > AMD EPYC 7551 32-Core Processor (Zen, kvm guest) > 1 node * 8 cores * 2 threads = 16 CPUs > 64G/node = 64G memory > >kernel boot deferred init > > node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) > ( 0) -- 1029.3 ( 25.1) --240.7 ( 1.5) >6% ( 1) -0.6% 1036.0 ( 7.8) -2.2%246.0 ( 0.0) > 12% ( 2) 11.8%907.7 ( 8.6) 44.7%133.0 ( 1.0) > 25% ( 4) 13.9%886.0 ( 10.6)
Re: [PATCH v3 5/8] mm: don't track number of pages during deferred initialization
On Wed, May 27, 2020 at 10:37 AM Daniel Jordan wrote: > > Deferred page init used to report the number of pages initialized: > > node 0 initialised, 32439114 pages in 97ms > > Tracking this makes the code more complicated when using multiple > threads. Given that the statistic probably has limited value, > especially since a zone grows on demand so that the page count can vary, > just remove it. > > The boot message now looks like > > node 0 deferred pages initialised in 97ms > > Signed-off-by: Daniel Jordan > Suggested-by: Alexander Duyck This looks good to me. Reviewed-by: Alexander Duyck > --- > mm/page_alloc.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d0c0d9364aa6d..d64f3027fdfa6 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data) > { > pg_data_t *pgdat = data; > const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); > - unsigned long spfn = 0, epfn = 0, nr_pages = 0; > + unsigned long spfn = 0, epfn = 0; > unsigned long first_init_pfn, flags; > unsigned long start = jiffies; > struct zone *zone; > @@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data) > * allocator. > */ > while (spfn < epfn) { > - nr_pages += deferred_init_maxorder(, zone, , ); > + deferred_init_maxorder(, zone, , ); > cond_resched(); > } > zone_empty: > /* Sanity check that the next zone really is unpopulated */ > WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone)); > > - pr_info("node %d initialised, %lu pages in %ums\n", > - pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start)); > + pr_info("node %d deferred pages initialised in %ums\n", > + pgdat->node_id, jiffies_to_msecs(jiffies - start)); > > pgdat_init_report_one_done(); > return 0; > -- > 2.26.2 > >
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On Tue, May 26, 2020 at 01:03:10PM +0200, Emanuele Giuseppe Esposito wrote: > There is currently no common way for Linux kernel subsystems to expose > statistics to userspace shared throughout the Linux kernel; subsystems have > to take care of gathering and displaying statistics by themselves, for > example in the form of files in debugfs. For example KVM has its own code > section that takes care of this in virt/kvm/kvm_main.c, where it sets up > debugfs handlers for displaying values and aggregating them from various > subfolders to obtain information about the system state (i.e. displaying > the total number of exits, calculated by summing all exits of all cpus of > all running virtual machines). > > Allowing each section of the kernel to do so has two disadvantages. First, > it will introduce redundant code. Second, debugfs is anyway not the right > place for statistics (for example it is affected by lockdown) > > In this patch series I introduce statsfs, a synthetic ram-based virtual > filesystem that takes care of gathering and displaying statistics for the > Linux kernel subsystems. > > The file system is mounted on /sys/kernel/stats and would be already used > by kvm. Statsfs was initially introduced by Paolo Bonzini [1]. > > Statsfs offers a generic and stable API, allowing any kind of > directory/file organization and supporting multiple kind of aggregations > (not only sum, but also average, max, min and count_zero) and data types > (boolean, unsigned/signed and custom types). The implementation, which is > a generalization of KVM’s debugfs statistics code, takes care of gathering > and displaying information at run time; users only need to specify the > values to be included in each source. > > Statsfs would also be a different mountpoint from debugfs, and would not > suffer from limited access due to the security lock down patches. Its main > function is to display each statistics as a file in the desired folder > hierarchy defined through the API. Statsfs files can be read, and possibly > cleared if their file mode allows it. > > Statsfs has two main components: the public API defined by > include/linux/statsfs.h, and the virtual file system which should end up in > /sys/kernel/stats. > Hi Emanuele > The API has two main elements, values and sources. Kernel subsystems like > KVM can use the API to create a source, add child sources/values/aggregates > and register it to the root source (that on the virtual fs would be > /sys/kernel/statsfs). Another issue i see with networking is that statistic counters can be dynamic. They can come and go. One of the drivers i work on has extra statistics available when a fibre interface is used, compared to a copper interface. And this happens at run time. The netlink API has no problems with this. It is a snapshot of what counters are currently available. There is no state in the API. In my humble opinion, networking is unlikely to adopt your approach. You probably want to look around for other subsystems which have statistics, and see if you can cover their requirements, and get them on board. Andrew
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On 27/05/20 22:23, Jakub Kicinski wrote: > On Wed, 27 May 2020 15:14:41 +0200 Emanuele Giuseppe Esposito wrote: >> Regarding the config, as I said the idea is to gather multiple >> subsystems' statistics, therefore there wouldn't be a single >> configuration method like in netlink. >> For example in kvm there are file descriptors for configuration, and >> creating them requires no privilege, contrary to the network interfaces. > > Enumerating networking interfaces, addresses, and almost all of the > configuration requires no extra privilege. In fact I'd hope that > whatever daemon collects network stats doesn't run as root :) > > I think enumerating objects is of primary importance, and statistics > of those objects are subordinate. I see what you meant now. statsfs can also be used to enumerate objects if one is so inclined (with the prototype in patch 7, for example, each network interface becomes a directory). > Again, I have little KVM knowledge, but BPF also uses a fd-based API, > and carries stats over the same syscall interface. Can BPF stats (for BPF scripts created by whatever process is running in the system) be collected by an external daemon that does not have access to the file descriptor? For KVM it's of secondary importance to gather stats in the program; it can be nice to have and we are thinking of a way to export the stats over the fd-based API, but it's less useful than system-wide monitoring. Perhaps this is a difference between the two. Another case where stats and configuration are separate is CPUs, where CPU enumeration is done in sysfs but statistics are exposed in various procfs files such as /proc/interrupts and /proc/stats. Thanks, Paolo
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On Wed, 27 May 2020 15:14:41 +0200 Emanuele Giuseppe Esposito wrote: > Regarding the config, as I said the idea is to gather multiple > subsystems' statistics, therefore there wouldn't be a single > configuration method like in netlink. > For example in kvm there are file descriptors for configuration, and > creating them requires no privilege, contrary to the network interfaces. Enumerating networking interfaces, addresses, and almost all of the configuration requires no extra privilege. In fact I'd hope that whatever daemon collects network stats doesn't run as root :) I think enumerating objects is of primary importance, and statistics of those objects are subordinate. Again, I have little KVM knowledge, but BPF also uses a fd-based API, and carries stats over the same syscall interface.
Re: [PATCH v8 1/5] powerpc: Document details on H_SCM_HEALTH hcall
On Tue, May 26, 2020 at 9:13 PM Vaibhav Jain wrote: > > Add documentation to 'papr_hcalls.rst' describing the bitmap flags > that are returned from H_SCM_HEALTH hcall as per the PAPR-SCM > specification. > Please do a global s/SCM/PMEM/ or s/SCM/NVDIMM/. It's unfortunate that we already have 2 ways to describe persistent memory devices, let's not perpetuate a third so that "grep" has a chance to find interrelated code across architectures. Other than that this looks good to me. > Cc: "Aneesh Kumar K . V" > Cc: Dan Williams > Cc: Michael Ellerman > Cc: Ira Weiny > Signed-off-by: Vaibhav Jain > --- > Changelog: > v7..v8: > * Added a clarification on bit-ordering of Health Bitmap > > Resend: > * None > > v6..v7: > * None > > v5..v6: > * New patch in the series > --- > Documentation/powerpc/papr_hcalls.rst | 45 --- > 1 file changed, 41 insertions(+), 4 deletions(-) > > diff --git a/Documentation/powerpc/papr_hcalls.rst > b/Documentation/powerpc/papr_hcalls.rst > index 3493631a60f8..45063f305813 100644 > --- a/Documentation/powerpc/papr_hcalls.rst > +++ b/Documentation/powerpc/papr_hcalls.rst > @@ -220,13 +220,50 @@ from the LPAR memory. > **H_SCM_HEALTH** > > | Input: drcIndex > -| Out: *health-bitmap, health-bit-valid-bitmap* > +| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)* > | Return Value: *H_Success, H_Parameter, H_Hardware* > > Given a DRC Index return the info on predictive failure and overall health of > -the NVDIMM. The asserted bits in the health-bitmap indicate a single > predictive > -failure and health-bit-valid-bitmap indicate which bits in health-bitmap are > -valid. > +the NVDIMM. The asserted bits in the health-bitmap indicate one or more > states > +(described in table below) of the NVDIMM and health-bit-valid-bitmap indicate > +which bits in health-bitmap are valid. The bits are reported in > +reverse bit ordering for example a value of 0xC400 > +indicates bits 0, 1, and 5 are valid. > + > +Health Bitmap Flags: > + > ++--+---+ > +| Bit | Definition > | > ++==+===+ > +| 00 | SCM device is unable to persist memory contents. > | > +| | If the system is powered down, nothing will be saved. > | > ++--+---+ > +| 01 | SCM device failed to persist memory contents. Either contents were > not| > +| | saved successfully on power down or were not restored properly on > | > +| | power up. > | > ++--+---+ > +| 02 | SCM device contents are persisted from previous IPL. The data from > | > +| | the last boot were successfully restored. > | > ++--+---+ > +| 03 | SCM device contents are not persisted from previous IPL. There was > no | > +| | data to restore from the last boot. > | > ++--+---+ > +| 04 | SCM device memory life remaining is critically low > | > ++--+---+ > +| 05 | SCM device will be garded off next IPL due to failure > | > ++--+---+ > +| 06 | SCM contents cannot persist due to current platform health status. > A | > +| | hardware failure may prevent data from being saved or restored. > | > ++--+---+ > +| 07 | SCM device is unable to persist memory contents in certain > conditions | > ++--+---+ > +| 08 | SCM device is encrypted > | > ++--+---+ > +| 09 | SCM device has successfully completed a requested erase or secure > | > +| | erase procedure. > | > ++--+---+ > +|10:63 | Reserved / Unused > | > ++--+---+ > > **H_SCM_PERFORMANCE_STATS** > > -- > 2.26.2 >
[PATCH v3 7/8] mm: make deferred init's max threads arch-specific
Using padata during deferred init has only been tested on x86, so for now limit it to this architecture. If another arch wants this, it can find the max thread limit that's best for it and override deferred_page_init_max_threads(). Signed-off-by: Daniel Jordan Tested-by: Josh Triplett --- arch/x86/mm/init_64.c| 12 include/linux/memblock.h | 3 +++ mm/page_alloc.c | 13 - 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 8b5f73f5e207c..2d749ec12ea8a 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1260,6 +1260,18 @@ void __init mem_init(void) mem_init_print_info(NULL); } +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask) +{ + /* +* More CPUs always led to greater speedups on tested systems, up to +* all the nodes' CPUs. Use all since the system is otherwise idle +* now. +*/ + return max_t(int, cpumask_weight(node_cpumask), 1); +} +#endif + int kernel_set_to_readonly; void mark_rodata_ro(void) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 6bc37a731d27b..2b289df44194f 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \ for (; i != U64_MAX; \ __next_mem_pfn_range_in_zone(, zone, p_start, p_end)) + +int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask); + #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ /** diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1d47016849531..329fd1a809c59 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1835,6 +1835,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn, } } +/* An arch may override for more concurrency. */ +__weak int __init +deferred_page_init_max_threads(const struct cpumask *node_cpumask) +{ + return 1; +} + /* Initialise remaining memory on a node */ static int __init deferred_init_memmap(void *data) { @@ -1883,11 +1890,7 @@ static int __init deferred_init_memmap(void *data) first_init_pfn)) goto zone_empty; - /* -* More CPUs always led to greater speedups on tested systems, up to -* all the nodes' CPUs. Use all since the system is otherwise idle now. -*/ - max_threads = max(cpumask_weight(cpumask), 1u); + max_threads = deferred_page_init_max_threads(cpumask); while (spfn < epfn) { unsigned long epfn_align = ALIGN(epfn, PAGES_PER_SECTION); -- 2.26.2
[PATCH v3 1/8] padata: remove exit routine
padata_driver_exit() is unnecessary because padata isn't built as a module and doesn't exit. padata's init routine will soon allocate memory, so getting rid of the exit function now avoids pointless code to free it. Signed-off-by: Daniel Jordan Tested-by: Josh Triplett --- kernel/padata.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/kernel/padata.c b/kernel/padata.c index a6afa12fb75ee..835919c745266 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void) } module_init(padata_driver_init); -static __exit void padata_driver_exit(void) -{ - cpuhp_remove_multi_state(CPUHP_PADATA_DEAD); - cpuhp_remove_multi_state(hp_online); -} -module_exit(padata_driver_exit); #endif -- 2.26.2
[PATCH v3 8/8] padata: document multithreaded jobs
Add Documentation for multithreaded jobs. Signed-off-by: Daniel Jordan Tested-by: Josh Triplett --- Documentation/core-api/padata.rst | 41 +++ 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst index 9a24c111781d9..0830e5b0e8211 100644 --- a/Documentation/core-api/padata.rst +++ b/Documentation/core-api/padata.rst @@ -4,23 +4,26 @@ The padata parallel execution mechanism === -:Date: December 2019 +:Date: May 2020 Padata is a mechanism by which the kernel can farm jobs out to be done in -parallel on multiple CPUs while retaining their ordering. It was developed for -use with the IPsec code, which needs to be able to perform encryption and -decryption on large numbers of packets without reordering those packets. The -crypto developers made a point of writing padata in a sufficiently general -fashion that it could be put to other uses as well. +parallel on multiple CPUs while optionally retaining their ordering. -Usage -= +It was originally developed for IPsec, which needs to perform encryption and +decryption on large numbers of packets without reordering those packets. This +is currently the sole consumer of padata's serialized job support. + +Padata also supports multithreaded jobs, splitting up the job evenly while load +balancing and coordinating between threads. + +Running Serialized Jobs +=== Initializing -The first step in using padata is to set up a padata_instance structure for -overall control of how jobs are to be run:: +The first step in using padata to run serialized jobs is to set up a +padata_instance structure for overall control of how jobs are to be run:: #include @@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse:: It is the user's responsibility to ensure all outstanding jobs are complete before any of the above are called. +Running Multithreaded Jobs +== + +A multithreaded job has a main thread and zero or more helper threads, with the +main thread participating in the job and then waiting until all helpers have +finished. padata splits the job into units called chunks, where a chunk is a +piece of the job that one thread completes in one call to the thread function. + +A user has to do three things to run a multithreaded job. First, describe the +job by defining a padata_mt_job structure, which is explained in the Interface +section. This includes a pointer to the thread function, which padata will +call each time it assigns a job chunk to a thread. Then, define the thread +function, which accepts three arguments, ``start``, ``end``, and ``arg``, where +the first two delimit the range that the thread operates on and the last is a +pointer to the job's shared state, if any. Prepare the shared state, which is +typically allocated on the main thread's stack. Last, call +padata_do_multithreaded(), which will return once the job is finished. + Interface = -- 2.26.2
[PATCH v3 5/8] mm: don't track number of pages during deferred initialization
Deferred page init used to report the number of pages initialized: node 0 initialised, 32439114 pages in 97ms Tracking this makes the code more complicated when using multiple threads. Given that the statistic probably has limited value, especially since a zone grows on demand so that the page count can vary, just remove it. The boot message now looks like node 0 deferred pages initialised in 97ms Signed-off-by: Daniel Jordan Suggested-by: Alexander Duyck --- mm/page_alloc.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d0c0d9364aa6d..d64f3027fdfa6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data) { pg_data_t *pgdat = data; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); - unsigned long spfn = 0, epfn = 0, nr_pages = 0; + unsigned long spfn = 0, epfn = 0; unsigned long first_init_pfn, flags; unsigned long start = jiffies; struct zone *zone; @@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data) * allocator. */ while (spfn < epfn) { - nr_pages += deferred_init_maxorder(, zone, , ); + deferred_init_maxorder(, zone, , ); cond_resched(); } zone_empty: /* Sanity check that the next zone really is unpopulated */ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone)); - pr_info("node %d initialised, %lu pages in %ums\n", - pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start)); + pr_info("node %d deferred pages initialised in %ums\n", + pgdat->node_id, jiffies_to_msecs(jiffies - start)); pgdat_init_report_one_done(); return 0; -- 2.26.2
[PATCH v3 3/8] padata: allocate work structures for parallel jobs from a pool
padata allocates per-CPU, per-instance work structs for parallel jobs. A do_parallel call assigns a job to a sequence number and hashes the number to a CPU, where the job will eventually run using the corresponding work. This approach fit with how padata used to bind a job to each CPU round-robin, makes less sense after commit bfde23ce200e6 ("padata: unbind parallel jobs from specific CPUs") because a work isn't bound to a particular CPU anymore, and isn't needed at all for multithreaded jobs because they don't have sequence numbers. Replace the per-CPU works with a preallocated pool, which allows sharing them between existing padata users and the upcoming multithreaded user. The pool will also facilitate setting NUMA-aware concurrency limits with later users. The pool is sized according to the number of possible CPUs. With this limit, MAX_OBJ_NUM no longer makes sense, so remove it. If the global pool is exhausted, a parallel job is run in the current task instead to throttle a system trying to do too much in parallel. Signed-off-by: Daniel Jordan Tested-by: Josh Triplett --- include/linux/padata.h | 8 +-- kernel/padata.c| 118 +++-- 2 files changed, 78 insertions(+), 48 deletions(-) diff --git a/include/linux/padata.h b/include/linux/padata.h index 476ecfa41f363..3bfa503503ac5 100644 --- a/include/linux/padata.h +++ b/include/linux/padata.h @@ -24,7 +24,6 @@ * @list: List entry, to attach to the padata lists. * @pd: Pointer to the internal control structure. * @cb_cpu: Callback cpu for serializatioon. - * @cpu: Cpu for parallelization. * @seq_nr: Sequence number of the parallelized data object. * @info: Used to pass information from the parallel to the serial function. * @parallel: Parallel execution function. @@ -34,7 +33,6 @@ struct padata_priv { struct list_headlist; struct parallel_data*pd; int cb_cpu; - int cpu; unsigned intseq_nr; int info; void(*parallel)(struct padata_priv *padata); @@ -68,15 +66,11 @@ struct padata_serial_queue { /** * struct padata_parallel_queue - The percpu padata parallel queue * - * @parallel: List to wait for parallelization. * @reorder: List to wait for reordering after parallel processing. - * @work: work struct for parallelization. * @num_obj: Number of objects that are processed by this cpu. */ struct padata_parallel_queue { - struct padata_listparallel; struct padata_listreorder; - struct work_structwork; atomic_t num_obj; }; @@ -111,7 +105,7 @@ struct parallel_data { struct padata_parallel_queue__percpu *pqueue; struct padata_serial_queue __percpu *squeue; atomic_trefcnt; - atomic_tseq_nr; + unsigned intseq_nr; unsigned intprocessed; int cpu; struct padata_cpumask cpumask; diff --git a/kernel/padata.c b/kernel/padata.c index 6f709bc0fc413..78ff9aa529204 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -32,7 +32,15 @@ #include #include -#define MAX_OBJ_NUM 1000 +struct padata_work { + struct work_struct pw_work; + struct list_headpw_list; /* padata_free_works linkage */ + void*pw_data; +}; + +static DEFINE_SPINLOCK(padata_works_lock); +static struct padata_work *padata_works; +static LIST_HEAD(padata_free_works); static void padata_free_pd(struct parallel_data *pd); @@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr) return padata_index_to_cpu(pd, cpu_index); } -static void padata_parallel_worker(struct work_struct *parallel_work) +static struct padata_work *padata_work_alloc(void) { - struct padata_parallel_queue *pqueue; - LIST_HEAD(local_list); + struct padata_work *pw; - local_bh_disable(); - pqueue = container_of(parallel_work, - struct padata_parallel_queue, work); + lockdep_assert_held(_works_lock); - spin_lock(>parallel.lock); - list_replace_init(>parallel.list, _list); - spin_unlock(>parallel.lock); + if (list_empty(_free_works)) + return NULL;/* No more work items allowed to be queued. */ - while (!list_empty(_list)) { - struct padata_priv *padata; + pw = list_first_entry(_free_works, struct padata_work, pw_list); + list_del(>pw_list); + return pw; +} - padata = list_entry(local_list.next, - struct padata_priv, list); +static void padata_work_init(struct padata_work *pw, work_func_t work_fn, +void *data) +{ +
[PATCH v3 6/8] mm: parallelize deferred_init_memmap()
Deferred struct page init is a significant bottleneck in kernel boot. Optimizing it maximizes availability for large-memory systems and allows spinning up short-lived VMs as needed without having to leave them running. It also benefits bare metal machines hosting VMs that are sensitive to downtime. In projects such as VMM Fast Restart[1], where guest state is preserved across kexec reboot, it helps prevent application and network timeouts in the guests. Multithread to take full advantage of system memory bandwidth. The maximum number of threads is capped at the number of CPUs on the node because speedups always improve with additional threads on every system tested, and at this phase of boot, the system is otherwise idle and waiting on page init to finish. Helper threads operate on section-aligned ranges to both avoid false sharing when setting the pageblock's migrate type and to avoid accessing uninitialized buddy pages, though max order alignment is enough for the latter. The minimum chunk size is also a section. There was benefit to using multiple threads even on relatively small memory (1G) systems, and this is the smallest size that the alignment allows. The time (milliseconds) is the slowest node to initialize since boot blocks until all nodes finish. intel_pstate is loaded in active mode without hwp and with turbo enabled, and intel_idle is active as well. Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal) 2 nodes * 26 cores * 2 threads = 104 CPUs 384G/node = 768G memory kernel boot deferred init node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) ( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6) 2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8) 12% ( 6) 34.9% 2662.7 ( 2.9) 79.9%359.3 ( 0.6) 25% ( 13) 39.9% 2459.0 ( 3.6) 91.2%157.0 ( 0.0) 37% ( 19) 39.2% 2485.0 ( 29.7) 90.4%172.0 ( 28.6) 50% ( 26) 39.3% 2482.7 ( 25.7) 90.3%173.7 ( 30.0) 75% ( 39) 39.0% 2495.7 ( 5.5) 89.4%190.0 ( 1.0) 100% ( 52) 40.2% 2443.7 ( 3.8) 92.3%138.0 ( 1.0) Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest) 1 node * 16 cores * 2 threads = 32 CPUs 192G/node = 192G memory kernel boot deferred init node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) ( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5) 3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0) 12% ( 4) 41.1% 1170.3 ( 14.2) 73.8%287.0 ( 3.6) 25% ( 8) 47.1% 1052.7 ( 21.9) 83.9%177.0 ( 13.5) 38% ( 12) 48.9% 1016.3 ( 12.1) 86.8%144.7 ( 1.5) 50% ( 16) 48.9% 1015.7 ( 8.1) 87.8%134.0 ( 4.4) 75% ( 24) 49.1% 1012.3 ( 3.1) 88.1%130.3 ( 2.3) 100% ( 32) 49.5% 1004.0 ( 5.3) 88.5%125.7 ( 2.1) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal) 2 nodes * 18 cores * 2 threads = 72 CPUs 128G/node = 256G memory kernel boot deferred init node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) ( 0) -- 1680.0 ( 4.6) --627.0 ( 4.0) 3% ( 1) 0.3% 1675.7 ( 4.5) -0.2%628.0 ( 3.6) 11% ( 4) 25.6% 1250.7 ( 2.1) 67.9%201.0 ( 0.0) 25% ( 9) 30.7% 1164.0 ( 17.3) 81.8%114.3 ( 17.7) 36% ( 13) 31.4% 1152.7 ( 10.8) 84.0%100.3 ( 17.9) 50% ( 18) 31.5% 1150.7 ( 9.3) 83.9%101.0 ( 14.1) 75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4) 100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0) AMD EPYC 7551 32-Core Processor (Zen, kvm guest) 1 node * 8 cores * 2 threads = 16 CPUs 64G/node = 64G memory kernel boot deferred init node% (thr)speedup time_ms (stdev)speedup time_ms (stdev) ( 0) -- 1029.3 ( 25.1) --240.7 ( 1.5) 6% ( 1) -0.6% 1036.0 ( 7.8) -2.2%246.0 ( 0.0) 12% ( 2) 11.8%907.7 ( 8.6) 44.7%133.0 ( 1.0) 25% ( 4) 13.9%886.0 ( 10.6) 62.6% 90.0 ( 6.0) 38% ( 6) 17.8%845.7 ( 14.2) 69.1% 74.3 ( 3.8) 50% ( 8) 16.8%856.0 ( 22.1) 72.9% 65.3 ( 5.7) 75% ( 12) 15.4%871.0 ( 29.2)
[PATCH v3 0/8] padata: parallelize deferred page init
Thanks to Alex for his continued review and Josh for running v2! Please continue to review and test, and acks for the padata parts would be appreciated. Daniel -- Deferred struct page init is a bottleneck in kernel boot--the biggest for us and probably others. Optimizing it maximizes availability for large-memory systems and allows spinning up short-lived VMs as needed without having to leave them running. It also benefits bare metal machines hosting VMs that are sensitive to downtime. In projects such as VMM Fast Restart[1], where guest state is preserved across kexec reboot, it helps prevent application and network timeouts in the guests. So, multithread deferred init to take full advantage of system memory bandwidth. Extend padata, a framework that handles many parallel singlethreaded jobs, to handle multithreaded jobs as well by adding support for splitting up the work evenly, specifying a minimum amount of work that's appropriate for one helper thread to do, load balancing between helpers, and coordinating them. More documentation in patches 4 and 8. This series is the first step in a project to address other memory proportional bottlenecks in the kernel such as pmem struct page init, vfio page pinning, hugetlb fallocate, and munmap. Deferred page init doesn't require concurrency limits, resource control, or priority adjustments like these other users will because it happens during boot when the system is otherwise idle and waiting for page init to finish. This has been run on a variety of x86 systems and speeds up kernel boot by 4% to 49%, saving up to 1.6 out of 4 seconds. Patch 6 has more numbers. The powerpc and s390 lists are included in case they want to give this a try, they had enabled this feature when it was configured per arch. Series based on v5.7-rc7 plus these three from mmotm mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch mm-initialize-deferred-pages-with-interrupts-enabled.patch mm-call-cond_resched-from-deferred_init_memmap.patch and it's available here: git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v3 https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v3 and the future users and related features are available as work-in-progress: git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.5 https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5 v3: - Remove nr_pages accounting as suggested by Alex, adding a new patch - Align deferred init ranges up not down, simplify surrounding code (Alex) - Add Josh's T-b's from v2 (Josh's T-b's for v1 lost in rebase, apologies!) - Move padata.h include up in init/main.c to reduce patch collisions (Andrew) - Slightly reword Documentation patch - Rebase on v5.7-rc7 and retest v2: - Improve the problem statement (Andrew, Josh, Pavel) - Add T-b's to unchanged patches (Josh) - Fully initialize max-order blocks to avoid buddy issues (Alex) - Parallelize on section-aligned boundaries to avoid potential false sharing (Alex) - Return the maximum thread count from a function that architectures can override, with the generic version returning 1 (current behavior). Override for x86 since that's the only arch this series has been tested on so far. Other archs can test with more threads by dropping patch 6. - Rebase to v5.7-rc6, rerun tests RFC v4 [2] -> v1: - merged with padata (Peter) - got rid of the 'task' nomenclature (Peter, Jon) future work branch: - made lockdep-aware (Jason, Peter) - adjust workqueue worker priority with renice_or_cancel() (Tejun) - fixed undo problem in VFIO (Alex) The remaining feedback, mainly resource control awareness (cgroup etc), is TODO for later series. [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf https://www.youtube.com/watch?v=pBsHnf93tcQ https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/ [2] https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jor...@oracle.com/ Daniel Jordan (8): padata: remove exit routine padata: initialize earlier padata: allocate work structures for parallel jobs from a pool padata: add basic support for multithreaded jobs mm: don't track number of pages during deferred initialization mm: parallelize deferred_init_memmap() mm: make deferred init's max threads arch-specific padata: document multithreaded jobs Documentation/core-api/padata.rst | 41 +++-- arch/x86/mm/init_64.c | 12 ++ include/linux/memblock.h | 3 + include/linux/padata.h| 43 - init/main.c | 2 + kernel/padata.c | 277 -- mm/Kconfig| 6 +- mm/page_alloc.c | 59 +-- 8 files changed, 361 insertions(+), 82 deletions(-) base-commit:
[PATCH v3 4/8] padata: add basic support for multithreaded jobs
Sometimes the kernel doesn't take full advantage of system memory bandwidth, leading to a single CPU spending excessive time in initialization paths where the data scales with memory size. Multithreading naturally addresses this problem. Extend padata, a framework that handles many parallel yet singlethreaded jobs, to also handle multithreaded jobs by adding support for splitting up the work evenly, specifying a minimum amount of work that's appropriate for one helper thread to do, load balancing between helpers, and coordinating them. This is inspired by work from Pavel Tatashin and Steve Sistare. Signed-off-by: Daniel Jordan Tested-by: Josh Triplett --- include/linux/padata.h | 29 kernel/padata.c| 152 - 2 files changed, 178 insertions(+), 3 deletions(-) diff --git a/include/linux/padata.h b/include/linux/padata.h index 3bfa503503ac5..b0affa466a841 100644 --- a/include/linux/padata.h +++ b/include/linux/padata.h @@ -4,6 +4,9 @@ * * Copyright (C) 2008, 2009 secunet Security Networks AG * Copyright (C) 2008, 2009 Steffen Klassert + * + * Copyright (c) 2020 Oracle and/or its affiliates. + * Author: Daniel Jordan */ #ifndef PADATA_H @@ -130,6 +133,31 @@ struct padata_shell { struct list_headlist; }; +/** + * struct padata_mt_job - represents one multithreaded job + * + * @thread_fn: Called for each chunk of work that a padata thread does. + * @fn_arg: The thread function argument. + * @start: The start of the job (units are job-specific). + * @size: size of this node's work (units are job-specific). + * @align: Ranges passed to the thread function fall on this boundary, with the + * possible exceptions of the beginning and end of the job. + * @min_chunk: The minimum chunk size in job-specific units. This allows + * the client to communicate the minimum amount of work that's + * appropriate for one worker thread to do at once. + * @max_threads: Max threads to use for the job, actual number may be less + * depending on task size and minimum chunk size. + */ +struct padata_mt_job { + void (*thread_fn)(unsigned long start, unsigned long end, void *arg); + void*fn_arg; + unsigned long start; + unsigned long size; + unsigned long align; + unsigned long min_chunk; + int max_threads; +}; + /** * struct padata_instance - The overall control structure. * @@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps); extern int padata_do_parallel(struct padata_shell *ps, struct padata_priv *padata, int *cb_cpu); extern void padata_do_serial(struct padata_priv *padata); +extern void __init padata_do_multithreaded(struct padata_mt_job *job); extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type, cpumask_var_t cpumask); extern int padata_start(struct padata_instance *pinst); diff --git a/kernel/padata.c b/kernel/padata.c index 78ff9aa529204..e78f57d9aef90 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -7,6 +7,9 @@ * Copyright (C) 2008, 2009 secunet Security Networks AG * Copyright (C) 2008, 2009 Steffen Klassert * + * Copyright (c) 2020 Oracle and/or its affiliates. + * Author: Daniel Jordan + * * This program is free software; you can redistribute it and/or modify it * under the terms and conditions of the GNU General Public License, * version 2, as published by the Free Software Foundation. @@ -21,6 +24,7 @@ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. */ +#include #include #include #include @@ -32,6 +36,8 @@ #include #include +#definePADATA_WORK_ONSTACK 1 /* Work's memory is on stack */ + struct padata_work { struct work_struct pw_work; struct list_headpw_list; /* padata_free_works linkage */ @@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock); static struct padata_work *padata_works; static LIST_HEAD(padata_free_works); +struct padata_mt_job_state { + spinlock_t lock; + struct completion completion; + struct padata_mt_job*job; + int nworks; + int nworks_fini; + unsigned long chunk_size; +}; + static void padata_free_pd(struct parallel_data *pd); +static void __init padata_mt_helper(struct work_struct *work); static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index) { @@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void) } static void padata_work_init(struct padata_work *pw, work_func_t work_fn, -void *data) +void *data, int flags) { - INIT_WORK(>pw_work, work_fn); + if (flags & PADATA_WORK_ONSTACK) +
[PATCH v3 2/8] padata: initialize earlier
padata will soon initialize the system's struct pages in parallel, so it needs to be ready by page_alloc_init_late(). The error return from padata_driver_init() triggers an initcall warning, so add a warning to padata_init() to avoid silent failure. Signed-off-by: Daniel Jordan Tested-by: Josh Triplett --- include/linux/padata.h | 6 ++ init/main.c| 2 ++ kernel/padata.c| 17 - 3 files changed, 16 insertions(+), 9 deletions(-) diff --git a/include/linux/padata.h b/include/linux/padata.h index a0d8b41850b25..476ecfa41f363 100644 --- a/include/linux/padata.h +++ b/include/linux/padata.h @@ -164,6 +164,12 @@ struct padata_instance { #definePADATA_INVALID 4 }; +#ifdef CONFIG_PADATA +extern void __init padata_init(void); +#else +static inline void __init padata_init(void) {} +#endif + extern struct padata_instance *padata_alloc_possible(const char *name); extern void padata_free(struct padata_instance *pinst); extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst); diff --git a/init/main.c b/init/main.c index 03371976d3872..df32f67214d23 100644 --- a/init/main.c +++ b/init/main.c @@ -63,6 +63,7 @@ #include #include #include +#include #include #include #include @@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void) smp_init(); sched_init_smp(); + padata_init(); page_alloc_init_late(); /* Initialize page ext after all struct pages are initialized. */ page_ext_init(); diff --git a/kernel/padata.c b/kernel/padata.c index 835919c745266..6f709bc0fc413 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -31,7 +31,6 @@ #include #include #include -#include #define MAX_OBJ_NUM 1000 @@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps) } EXPORT_SYMBOL(padata_free_shell); -#ifdef CONFIG_HOTPLUG_CPU - -static __init int padata_driver_init(void) +void __init padata_init(void) { +#ifdef CONFIG_HOTPLUG_CPU int ret; ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online", padata_cpu_online, NULL); if (ret < 0) - return ret; + goto err; hp_online = ret; ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead", NULL, padata_cpu_dead); if (ret < 0) { cpuhp_remove_multi_state(hp_online); - return ret; + goto err; } - return 0; -} -module_init(padata_driver_init); + return; +err: + pr_warn("padata: initialization failed\n"); #endif +} -- 2.26.2
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
On 27/05/20 15:33, Andrew Lunn wrote: >> I don't really know a lot about the networking subsystem, and as it was >> pointed out in another email on patch 7 by Andrew, networking needs to >> atomically gather and display statistics in order to make them consistent, >> and currently this is not supported by stats_fs but could be added in >> future. > > Do you have any idea how you will support atomic access? It does not > seem easy to implement in a filesystem based model. Hi Andrew, there are plans to support binary access. Emanuele and I don't really have a plan for how to implement it, but there are developers from Google that have ideas (because Google has a similar "metricfs" thing in-house). I think atomic access would use some kind of "source_ops" struct containing create_snapshot and release_snapshot function pointers. Paolo
[RFC PATCH 2/4] powerpc/64s: Don't let DT CPU features set FSCR_DSCR
The device tree CPU features binding includes FSCR bit numbers which Linux is instructed to set by firmware. Whether that's a good idea or not, in the case of the DSCR the Linux implementation has a hard requirement that the FSCR_DSCR bit not be set by default. We use it to track when a process reads/writes to DSCR, so it must be clear to begin with. So if firmware tells us to set FSCR_DSCR we must ignore it. Currently this does not cause a bug in our DSCR handling because the value of FSCR that the device tree CPU features code establishes is only used by swapper. All other tasks use the value hard coded in init_task.thread.fscr. However we'd like to fix that in a future commit, at which point this will become necessary. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Cc: sta...@vger.kernel.org # v4.12+ Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/dt_cpu_ftrs.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c index 36bc0d5c4f3a..fca4d7ff22b9 100644 --- a/arch/powerpc/kernel/dt_cpu_ftrs.c +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c @@ -346,6 +346,14 @@ static int __init feat_enable_dscr(struct dt_cpu_feature *f) { u64 lpcr; + /* +* Linux relies on FSCR[DSCR] being clear, so that we can take the +* facility unavailable interrupt and track the task's usage of DSCR. +* See facility_unavailable_exception(). +* Clear the bit here so that feat_enable() doesn't set it. +*/ + f->fscr_bit_nr = -1; + feat_enable(f); lpcr = mfspr(SPRN_LPCR); -- 2.25.1
[RFC PATCH 4/4] powerpc/64s: Don't set FSCR bits in INIT_THREAD
Since the previous commit that saves the value of FSCR configured at boot into init_task.thread.fscr, the static initialisation in INIT_THREAD now no longer has any effect. So remove it. For non DT CPU features, the end result is the same, because __init_FSCR() is called on all CPUs that have an FSCR (Power8, Power9), and it sets FSCR_TAR & FSCR_EBB. Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/processor.h | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index 4e53df163b92..52a67835057a 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -302,7 +302,6 @@ struct thread_struct { .ksp = INIT_SP, \ .addr_limit = KERNEL_DS, \ .fpexc_mode = 0, \ - .fscr = FSCR_TAR | FSCR_EBB \ } #endif -- 2.25.1
[RFC PATCH 3/4] powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
At boot the FSCR is initialised via one of two paths. On most systems it's set to a hard coded value in __init_FSCR(). On newer skiboot systems we use the device tree CPU features binding, where firmware can tell Linux what bits to set in FSCR (and HFSCR). In both cases the value that's configured at boot is not propagated into the init_task.thread.fscr value prior to the initial fork of init (pid 1), which means the value is not used by any processes other than swapper (the idle task). For the __init_FSCR() case this is OK, because the value in init_task.thread.fscr is initialised to something sensible. However it does mean that the value set in __init_FSCR() is not used other than for swapper, which is odd and confusing. The bigger problem is for the device tree CPU features case it prevents firmware from setting (or clearing) FSCR bits for use by user space. This means all existing kernels can not have features enabled/disabled by firmware if those features require setting/clearing FSCR bits. We can handle both cases by saving the FSCR value into init_task.thread.fscr after we have initialised it at boot. This fixes the bug for device tree CPU features, and will allow us to simplify the initialisation for the __init_FSCR() case in a future patch. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Cc: sta...@vger.kernel.org # v4.12+ Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/prom.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 1dcf0e214a22..a74bfd09cb38 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -685,6 +685,23 @@ static void __init tm_init(void) static void tm_init(void) { } #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ +#ifdef CONFIG_PPC64 +static void __init save_fscr_to_task(void) +{ + /* +* Ensure the init_task (pid 0, aka swapper) uses the value of FSCR we +* have configured via the device tree features or via __init_FSCR(). +* That value will then be propagated to pid 1 (init) and all future +* processes. +*/ + if (early_cpu_has_feature(CPU_FTR_ARCH_207S)) + init_task.thread.fscr = mfspr(SPRN_FSCR); +} +#else +static inline void save_fscr_to_task(void) {}; +#endif + + void __init early_init_devtree(void *params) { phys_addr_t limit; @@ -773,6 +790,8 @@ void __init early_init_devtree(void *params) BUG(); } + save_fscr_to_task(); + #if defined(CONFIG_SMP) && defined(CONFIG_PPC64) /* We'll later wait for secondaries to check in; there are * NCPUS-1 non-boot CPUs :-) -- 2.25.1
[RFC PATCH 1/4] powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
__init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc: Add support for context switching the TAR register") (Feb 2013), and only set FSCR_TAR. At that point FSCR (Facility Status and Control Register) was not context switched, so the setting was permanent after boot. Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit 54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again that was permanent after boot. Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on POWER8") (Aug 2013) added a limited context switch of FSCR, just the FSCR_DSCR bit was context switched based on thread.dscr_inherit. That commit said "This clears the H/FSCR DSCR bit initially", but it didn't, it left the initialisation of FSCR_DSCR in __init_FSCR(). However the initial context switch from init_task to pid 1 would clear FSCR_DSCR because thread.dscr_inherit was 0. That commit also introduced the requirement that FSCR_DSCR be clear for user processes, so that we can take the facility unavailable interrupt in order to manage dscr_inherit. Then in commit 152d523e6307 ("powerpc: Create context switch helpers save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to thread_struct. However it still wasn't fully context switched, we just took the existing value and set FSCR_DSCR if the new thread had dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR | FSCR_TAR, but that value was not propagated into the thread_struct, so the initial context switch set FSCR_DSCR back to 0. Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context switching") (Jun 2016) added a full context switch of the FSCR, and added an initialisation of init_task.thread.fscr to FSCR_TAR | FSCR_EBB, but omitted FSCR_DSCR. The end result is that swapper runs with FSCR_DSCR set because of the initialisation in __init_FSCR(), but no other processes do, they use the value from init_task.thread.fscr. Having FSCR_DSCR set for swapper allows it to access SPR 3 from userspace, but swapper never runs userspace, so it has no useful effect. It's also confusing to have the value initialised in two places to two different values. So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the point where there's a single value of FSCR, even if it's still set in two places. Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/cpu_setup_power.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S index a460298c7ddb..f91ecb10d0ae 100644 --- a/arch/powerpc/kernel/cpu_setup_power.S +++ b/arch/powerpc/kernel/cpu_setup_power.S @@ -184,7 +184,7 @@ _GLOBAL(__restore_cpu_power9) __init_FSCR: mfspr r3,SPRN_FSCR - ori r3,r3,FSCR_TAR|FSCR_DSCR|FSCR_EBB + ori r3,r3,FSCR_TAR|FSCR_EBB mtspr SPRN_FSCR,r3 blr -- 2.25.1
Re: [PATCH] media: omap3isp: Shuffle cacheflush.h and include mm.h
Hi Nathan, (CC'ing Sakari Ailus and the linux-media mailing list) On Wed, May 27, 2020 at 01:13:37AM -0700, Nathan Chancellor wrote: > On Wed, May 27, 2020 at 09:02:51AM +0200, Geert Uytterhoeven wrote: > > On Wed, May 27, 2020 at 6:37 AM Nathan Chancellor wrote: > > > After mm.h was removed from the asm-generic version of cacheflush.h, > > > s390 allyesconfig shows several warnings of the following nature: > > > > > > In file included from ./arch/s390/include/generated/asm/cacheflush.h:1, > > > from drivers/media/platform/omap3isp/isp.c:42: > > > ./include/asm-generic/cacheflush.h:16:42: warning: 'struct mm_struct' > > > declared inside parameter list will not be visible outside of this > > > definition or declaration > > > > > > cacheflush.h does not include mm.h nor does it include any forward > > > declaration of these structures hence the warning. To avoid this, > > > include mm.h explicitly in this file and shuffle cacheflush.h below it. > > > > > > Fixes: 19c0054597a0 ("asm-generic: don't include in > > > cacheflush.h") > > > Signed-off-by: Nathan Chancellor > > > > Thanks for your patch! > > > > > I am aware the fixes tag is kind of irrelevant because that SHA will > > > change in the next linux-next revision and this will probably get folded > > > into the original patch anyways but still. > > > > > > The other solution would be to add forward declarations of these structs > > > to the top of cacheflush.h, I just chose to do what Christoph did in the > > > original patch. I am happy to do that instead if you all feel that is > > > better. > > > > That actually looks like a better solution to me, as it would address the > > problem for all users. Headers should be self-contained, so that would be the best fix in my opinion. This being said, as cacheflush.h isn't needed in isp.c, I think we should also drop it. It seems to have been included there since the first driver version, and was likely a left-over from the out-of-tree development. Manual cache handling was part of drivers/media/platform/omap3isp/ispqueue.c and has been removed in commit fbac1400bd1a ("[media] omap3isp: Move to videobuf2"). cacheflush.h can also be dropped from ispvideo.c which suffers from the same issue. > > > drivers/media/platform/omap3isp/isp.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/media/platform/omap3isp/isp.c > > > b/drivers/media/platform/omap3isp/isp.c > > > index a4ee6b86663e..54106a768e54 100644 > > > --- a/drivers/media/platform/omap3isp/isp.c > > > +++ b/drivers/media/platform/omap3isp/isp.c > > > @@ -39,8 +39,6 @@ > > > * Troy Laramy > > > */ > > > > > > -#include > > > - > > > #include > > > #include > > > #include > > > @@ -49,6 +47,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -58,6 +57,8 @@ > > > #include > > > #include > > > > > > +#include > > > + > > > #ifdef CONFIG_ARM_DMA_USE_IOMMU > > > #include > > > #endif > > > > Why does this file need at all? > > It doesn't call any of the flush_*() functions, and seems to compile fine > > without (on arm32). > > > > Perhaps it was included at the top intentionally, to override the > > definitions > > of copy_{to,from}_user_page()? Fortunately that doesn't seem to be the > > case, from a quick look at the assembler output. > > > > So let's just remove the #include instead? > > Sounds good to me. I can send a patch if needed or I suppose Andrew can > just make a small fixup patch for it. Let me know what I should do. -- Regards, Laurent Pinchart
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
> I don't really know a lot about the networking subsystem, and as it was > pointed out in another email on patch 7 by Andrew, networking needs to > atomically gather and display statistics in order to make them consistent, > and currently this is not supported by stats_fs but could be added in > future. Hi Emanuele Do you have any idea how you will support atomic access? It does not seem easy to implement in a filesystem based model. Andrew
Re: [PATCH v3 3/7] kunit: tests for stats_fs API
In order to run them, the kernel .config must set CONFIG_KUNIT=y and a new .kunitconfig file must be created with CONFIG_STATS_FS=y and CONFIG_STATS_FS_TEST=y It looks like CONFIG_STATS_FS is built-in, but it exports much of the functionality you are testing. However could the tests also be built as a module (i.e. make CONFIG_STATS_FS_TEST a tristate variable)? To test this you'd need to specify CONFIG_KUNIT=m and CONFIG_STATS_FS_TEST=m, and testing would simply be a case of "modprobe"ing the stats fs module and collecting results in /sys/kernel/debug/kunit/ (rather than running kunit.py). Are you relying on unexported internals in the the tests that would prevent building them as a module? I haven't checked it yet, but tests should work as separate module. I will look into it, thanks. Emanuele
Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
The file system is mounted on /sys/kernel/stats and would be already used by kvm. Statsfs was initially introduced by Paolo Bonzini [1]. What's the direct motivation for this work? Moving KVM stats out of debugfs? There's many reasons: one of these is not using debugfs for statistics, but also (and mainly) to try and have a single tool that automatically takes care and displays them, instead of leaving each subsystem "on its own". Sure, everyone gathers and processes stats in different ways, and the aim of this tool is to hopefully be extensible enough to cover all needs. In my experience stats belong in the API used for creating/enumerating objects, statsfs sounds like going in the exact opposite direction - creating a parallel structure / hierarchy for exposing stats. I know nothing about KVM but are you sure all the info that has to be exposed will be stats?I don't understand, what do you mean here? In case of networking we have the basic stats in sysfs, under the netdevice's kobject. But since we're not using sysfs much any more for config, new stats are added in netlink APIs. Again - same APIs used for enumeration and config. I don't really know a lot about the networking subsystem, and as it was pointed out in another email on patch 7 by Andrew, networking needs to atomically gather and display statistics in order to make them consistent, and currently this is not supported by stats_fs but could be added in future. In addition, right now it won't work properly if the networking namespaces are enabled. That is another issue to take into consideration. That's also why I marked patch 7 as "not for merge" Regarding the config, as I said the idea is to gather multiple subsystems' statistics, therefore there wouldn't be a single configuration method like in netlink. For example in kvm there are file descriptors for configuration, and creating them requires no privilege, contrary to the network interfaces. Thank you, Emanuele
Re: [PATCH v3 3/7] kunit: tests for stats_fs API
On Tue, 26 May 2020, Emanuele Giuseppe Esposito wrote: > Add kunit tests to extensively test the stats_fs API functionality. > I've added in the kunit-related folks. > In order to run them, the kernel .config must set CONFIG_KUNIT=y > and a new .kunitconfig file must be created with CONFIG_STATS_FS=y > and CONFIG_STATS_FS_TEST=y > It looks like CONFIG_STATS_FS is built-in, but it exports much of the functionality you are testing. However could the tests also be built as a module (i.e. make CONFIG_STATS_FS_TEST a tristate variable)? To test this you'd need to specify CONFIG_KUNIT=m and CONFIG_STATS_FS_TEST=m, and testing would simply be a case of "modprobe"ing the stats fs module and collecting results in /sys/kernel/debug/kunit/ (rather than running kunit.py). Are you relying on unexported internals in the the tests that would prevent building them as a module? Thanks! Alan
[PATCH V4 2/2] tools/perf: Add perf tools support for extended register capability in powerpc
From: Anju T Sudhakar Add extended regs to sample_reg_mask in the tool side to use with `-I?` option. Perf tools side uses extended mask to display the platform supported register names (with -I? option) to the user and also send this mask to the kernel to capture the extended registers in each sample. Hence decide the mask value based on the processor version. Signed-off-by: Anju T Sudhakar [Decide extended mask at run time based on platform] Signed-off-by: Athira Rajeev Reviewed-by: Madhavan Srinivasan --- tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++- tools/perf/arch/powerpc/include/perf_regs.h | 5 ++- tools/perf/arch/powerpc/util/perf_regs.c| 55 + 3 files changed, 72 insertions(+), 2 deletions(-) diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/tools/arch/powerpc/include/uapi/asm/perf_regs.h index f599064..485b1d5 100644 --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs { PERF_REG_POWERPC_DSISR, PERF_REG_POWERPC_SIER, PERF_REG_POWERPC_MMCRA, - PERF_REG_POWERPC_MAX, + /* Extended registers */ + PERF_REG_POWERPC_MMCR0, + PERF_REG_POWERPC_MMCR1, + PERF_REG_POWERPC_MMCR2, + /* Max regs without the extended regs */ + PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, }; + +#define PERF_REG_PMU_MASK ((1ULL << PERF_REG_POWERPC_MAX) - 1) + +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */ +#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \ + - PERF_REG_PMU_MASK) + #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ diff --git a/tools/perf/arch/powerpc/include/perf_regs.h b/tools/perf/arch/powerpc/include/perf_regs.h index e18a355..46ed00d 100644 --- a/tools/perf/arch/powerpc/include/perf_regs.h +++ b/tools/perf/arch/powerpc/include/perf_regs.h @@ -64,7 +64,10 @@ [PERF_REG_POWERPC_DAR] = "dar", [PERF_REG_POWERPC_DSISR] = "dsisr", [PERF_REG_POWERPC_SIER] = "sier", - [PERF_REG_POWERPC_MMCRA] = "mmcra" + [PERF_REG_POWERPC_MMCRA] = "mmcra", + [PERF_REG_POWERPC_MMCR0] = "mmcr0", + [PERF_REG_POWERPC_MMCR1] = "mmcr1", + [PERF_REG_POWERPC_MMCR2] = "mmcr2", }; static inline const char *perf_reg_name(int id) diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c index 0a52429..9179230 100644 --- a/tools/perf/arch/powerpc/util/perf_regs.c +++ b/tools/perf/arch/powerpc/util/perf_regs.c @@ -6,9 +6,14 @@ #include "../../../util/perf_regs.h" #include "../../../util/debug.h" +#include "../../../util/event.h" +#include "../../../util/header.h" +#include "../../../perf-sys.h" #include +#define PVR_POWER9 0x004E + const struct sample_reg sample_reg_masks[] = { SMPL_REG(r0, PERF_REG_POWERPC_R0), SMPL_REG(r1, PERF_REG_POWERPC_R1), @@ -55,6 +60,9 @@ SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR), SMPL_REG(sier, PERF_REG_POWERPC_SIER), SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA), + SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0), + SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1), + SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2), SMPL_REG_END }; @@ -163,3 +171,50 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op) return SDT_ARG_VALID; } + +uint64_t arch__intr_reg_mask(void) +{ + struct perf_event_attr attr = { + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CPU_CYCLES, + .sample_type= PERF_SAMPLE_REGS_INTR, + .precise_ip = 1, + .disabled = 1, + .exclude_kernel = 1, + }; + int fd, ret; + char buffer[64]; + u32 version; + u64 extended_mask = 0; + + /* Get the PVR value to set the extended +* mask specific to platform +*/ + get_cpuid(buffer, sizeof(buffer)); + ret = sscanf(buffer, "%u,", ); + + if (ret != 1) { + pr_debug("Failed to get the processor version, unable to output extended registers\n"); + return PERF_REGS_MASK; + } + + if (version == PVR_POWER9) + extended_mask = PERF_REG_PMU_MASK_300; + else + return PERF_REGS_MASK; + + attr.sample_regs_intr = extended_mask; + attr.sample_period = 1; + event_attr_init(); + + /* +* check if the pmu supports perf extended regs, before +* returning the register mask to sample. +*/ + fd = sys_perf_event_open(, 0, -1, -1, 0); + if (fd != -1) { + close(fd); + return (extended_mask | PERF_REGS_MASK); + } + return PERF_REGS_MASK; +} -- 1.8.3.1
[PATCH V4 1/2] powerpc/perf: Add support for outputting extended regs in perf intr_regs
From: Anju T Sudhakar Add support for perf extended register capability in powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the PMU which support extended registers. The generic code define the mask of extended registers as 0 for non supported architectures. Patch adds extended regs support for power9 platform by exposing MMCR0, MMCR1 and MMCR2 registers. REG_RESERVED mask needs update to include extended regs. `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers, is defined at runtime in the kernel based on platform since the supported registers may differ from one processor version to another and hence the MASK value. with patch -- available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26 r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2 PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0 ... intr regs: mask 0x ABI 64-bit r00xc012b77c r10xc03fe5e03930 r20xc1b0e000 r30xc03fdcddf800 r40xc03fc788 r50x9c422724be r60xc03fe5e03908 r70xff63bddc8706 r80x9e4 r90x0 r10 0x1 r11 0x0 r12 0xc01299c0 r13 0xc03c4800 r14 0x0 r15 0x7fffdd8b8b00 r16 0x0 r17 0x7fffdd8be6b8 r18 0x7e7076607730 r19 0x2f r20 0xc0001fc26c68 r21 0xc0002041e4227e00 r22 0xc0002018fb60 r23 0x1 r24 0xc03ffec4d900 r25 0x8000 r26 0x0 r27 0x1 r28 0x1 r29 0xc1be1260 r30 0x6008010 r31 0xc03ffebb7218 nip 0xc012b910 msr 0x90009033 orig_r3 0xc012b86c ctr 0xc01299c0 link 0xc012b77c xer 0x0 ccr 0x2800 softe 0x1 trap 0xf00 dar 0x0 dsisr 0x800 sier 0x0 mmcra 0x800 mmcr0 0x82008090 mmcr1 0x1e00 mmcr2 0x0 ... thread: perf:4784 Signed-off-by: Anju T Sudhakar [Defined PERF_REG_EXTENDED_MASK at run time to add support for different platforms ] Signed-off-by: Athira Rajeev Reviewed-by: Madhavan Srinivasan --- arch/powerpc/include/asm/perf_event_server.h | 8 +++ arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++- arch/powerpc/perf/core-book3s.c | 1 + arch/powerpc/perf/perf_regs.c| 34 +--- arch/powerpc/perf/power9-pmu.c | 6 + 5 files changed, 59 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3e9703f..1458e1a 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -15,6 +15,9 @@ #define MAX_EVENT_ALTERNATIVES 8 #define MAX_LIMITED_HWCOUNTERS 2 +extern u64 mask_var; +#define PERF_REG_EXTENDED_MASK mask_var + struct perf_event; /* @@ -55,6 +58,11 @@ struct power_pmu { int *blacklist_ev; /* BHRB entries in the PMU */ int bhrb_nr; + /* +* set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if +* the pmu supports extended perf regs capability +*/ + int capabilities; }; /* diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h b/arch/powerpc/include/uapi/asm/perf_regs.h index f599064..485b1d5 100644 --- a/arch/powerpc/include/uapi/asm/perf_regs.h +++ b/arch/powerpc/include/uapi/asm/perf_regs.h @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs { PERF_REG_POWERPC_DSISR, PERF_REG_POWERPC_SIER, PERF_REG_POWERPC_MMCRA, - PERF_REG_POWERPC_MAX, + /* Extended registers */ + PERF_REG_POWERPC_MMCR0, + PERF_REG_POWERPC_MMCR1, + PERF_REG_POWERPC_MMCR2, + /* Max regs without the extended regs */ + PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, }; + +#define PERF_REG_PMU_MASK ((1ULL << PERF_REG_POWERPC_MAX) - 1) + +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */ +#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \ + - PERF_REG_PMU_MASK) + #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 3dcfecf..7f63edf 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -2275,6 +2275,7 @@ int register_power_pmu(struct power_pmu *pmu) pmu->name); power_pmu.attr_groups = ppmu->attr_groups; + power_pmu.capabilities |= (ppmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS); #ifdef MSR_HV /* diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c index a213a0a..c8a7e8c
[PATCH V4 0/2] powerpc/perf: Add support for perf extended regs in powerpc
Patch set to add support for perf extended register capability in powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the PMU which support extended registers. The generic code define the mask of extended registers as 0 for non supported architectures. patch 1/2 defines this PERF_PMU_CAP_EXTENDED_REGS mask to output the values of mmcr0,mmcr1,mmcr2 for POWER9. Defines `PERF_REG_EXTENDED_MASK` at runtime which contains mask value of the supported registers under extended regs. Patch 2/2 adds extended regs to sample_reg_mask in the tool side to use with `-I?` option. Anju T Sudhakar (2): powerpc/perf: Add support for outputting extended regs in perf intr_regs tools/perf: Add perf tools support for extended register capability in powerpc --- Changes from v3 -> v4 - Addressed the comments for new line/tab issue and added "Reviewed-by" from Madhavan Srinivasn. Changes from v2 -> v3 - Split kernel and tools side patches as suggested by Arnaldo - Addressed review comment from Madhavan Srinivasn Changes from v1 -> v2 - PERF_REG_EXTENDED_MASK` is defined at runtime in the kernel based on platform. This will give flexibility in using extended regs for all processor versions where the supported registers may differ. - removed PERF_REG_EXTENDED_MASK from the perf tools side. Based on the processor version(from PVR value), tool side will return the appropriate extended mask - Since tool changes can handle without a "PERF_REG_EXTENDED_MASK" macro, dropped patch to set NO_AUXTRACE. - Addressed review comments from Ravi Bangoria for V1 --- arch/powerpc/include/asm/perf_event_server.h| 8 arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++- arch/powerpc/perf/core-book3s.c | 1 + arch/powerpc/perf/perf_regs.c | 34 +-- arch/powerpc/perf/power9-pmu.c | 6 +++ tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++- tools/perf/arch/powerpc/include/perf_regs.h | 5 ++- tools/perf/arch/powerpc/util/perf_regs.c| 55 + 8 files changed, 131 insertions(+), 6 deletions(-) -- 1.8.3.1
[PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work") caused that bpf_probe_read{, str}() functions were not longer available on architectures where the same logical address might have different content in kernel and user memory mapping. These architectures should use probe_read_{user,kernel}_str helpers. For backward compatibility, the problematic functions are still available on architectures where the user and kernel address spaces are not overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. At the moment, these backward compatible functions are enabled only on x86_64, arm, and arm64. Let's do it also on powerpc that has the non overlapping address space as well. Signed-off-by: Petr Mladek To: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org Cc: Daniel Borkmann Cc: Alexei Starovoitov Cc: Masami Hiramatsu Cc: Brendan Gregg Cc: Christoph Hellwig --- arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d13b5328ca10..b29d7cb38368 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -126,6 +126,7 @@ config PPC select ARCH_HAS_MMIOWB if PPC64 select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_API + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS -- 2.26.2
Re: [PATCH 2/3] powerpc/pci: unmap legacy INTx interrupts of passthrough IO adapters
Hello Michael, On 4/29/20 9:51 AM, Cédric Le Goater wrote: > When a passthrough IO adapter is removed from a pseries machine using > hash MMU and the XIVE interrupt mode, the POWER hypervisor, pHyp, > expects the guest OS to have cleared all page table entries related to > the adapter. If some are still present, the RTAS call which isolates > the PCI slot returns error 9001 "valid outstanding translations" and > the removal of the IO adapter fails. > > INTx interrupt numbers need special care because Linux maps the > interrupts automatically in the Linux interrupt number space if they > are presented in the device tree node describing the IO adapter. These > interrupts are not un-mapped automatically and in case of an hot-plug > adapter, the PCI hot-plug layer needs to handle the cleanup to make > sure that all the page table entries of the XIVE ESB pages are > cleared. It seems this patch needs more digging to make sure we are handling the IRQ unmapping in the correct PCI handler. Could you please keep it back for the moment ? Thanks, C.
Re: [PATCH 2/3] powerpc/pci: unmap legacy INTx interrupts of passthrough IO adapters
On 5/27/20 2:57 AM, Oliver O'Halloran wrote: > On Wed, Apr 29, 2020 at 5:51 PM Cédric Le Goater wrote: >> >> When a passthrough IO adapter is removed from a pseries machine using >> hash MMU and the XIVE interrupt mode, the POWER hypervisor, pHyp, >> expects the guest OS to have cleared all page table entries related to >> the adapter. If some are still present, the RTAS call which isolates >> the PCI slot returns error 9001 "valid outstanding translations" and >> the removal of the IO adapter fails. >> >> INTx interrupt numbers need special care because Linux maps the >> interrupts automatically in the Linux interrupt number space if they >> are presented in the device tree node describing the IO adapter. These >> interrupts are not un-mapped automatically and in case of an hot-plug >> adapter, the PCI hot-plug layer needs to handle the cleanup to make >> sure that all the page table entries of the XIVE ESB pages are >> cleared. >> >> Cc: "Oliver O'Halloran" >> Signed-off-by: Cédric Le Goater >> --- >> arch/powerpc/kernel/pci-hotplug.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/arch/powerpc/kernel/pci-hotplug.c >> b/arch/powerpc/kernel/pci-hotplug.c >> index bf83f76563a3..9e9c6befd7ea 100644 >> --- a/arch/powerpc/kernel/pci-hotplug.c >> +++ b/arch/powerpc/kernel/pci-hotplug.c >> @@ -57,6 +57,8 @@ void pcibios_release_device(struct pci_dev *dev) >> struct pci_controller *phb = pci_bus_to_host(dev->bus); >> struct pci_dn *pdn = pci_get_pdn(dev); >> >> + irq_dispose_mapping(dev->irq); > > What does the original mapping? Powerpc arch code or the PCI core? Powerpc. In pci_read_irq_line() when a device is added. > Tearing down the mapping in pcibios_release_device() seems a bit fishy > to me since the PCI core has already torn down the device state at > that point. If the release is delayed it's possible that another > pci_dev has mapped the IRQ before we get here, but maybe that's ok. Which scenario would that be ? multiple devices mapping the same INTx interrupt because all are used already ? Where should we drop the mapping ? Thanks, C. >> + >> eeh_remove_device(dev); >> >> if (phb->controller_ops.release_device) >> -- >> 2.25.4 >>
Re: [PATCH v8 5/5] powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_HEALTH
Vaibhav Jain writes: > This patch implements support for PDSM request 'PAPR_SCM_PDSM_HEALTH' > that returns a newly introduced 'struct nd_papr_pdsm_health' instance > containing dimm health information back to user space in response to > ND_CMD_CALL. This functionality is implemented in newly introduced > papr_scm_get_health() that queries the scm-dimm health information and > then copies this information to the package payload whose layout is > defined by 'struct nd_papr_pdsm_health'. > > The patch also introduces a new member 'struct papr_scm_priv.health' > thats an instance of 'struct nd_papr_pdsm_health' to cache the health > information of a nvdimm. As a result functions drc_pmem_query_health() > and flags_show() are updated to populate and use this new struct > instead of a u64 integer that was earlier used. > Reviewed-by: Aneesh Kumar K.V > Cc: "Aneesh Kumar K . V" > Cc: Dan Williams > Cc: Michael Ellerman > Cc: Ira Weiny > Signed-off-by: Vaibhav Jain -aneesh
Re: [PATCH v8 4/5] ndctl/papr_scm,uapi: Add support for PAPR nvdimm specific methods
Vaibhav Jain writes: > Introduce support for PAPR NVDIMM Specific Methods (PDSM) in papr_scm > module and add the command family to the white list of NVDIMM command > sets. Also advertise support for ND_CMD_CALL for the nvdimm > command mask and implement necessary scaffolding in the module to > handle ND_CMD_CALL ioctl and PDSM requests that we receive. > > The layout of the PDSM request as we expect from libnvdimm/libndctl is > described in newly introduced uapi header 'papr_scm_pdsm.h' which > defines a new 'struct nd_pdsm_cmd_pkg' header. This header is used > to communicate the PDSM request via member > 'nd_pkg_papr_scm->nd_command' and size of payload that need to be > sent/received for servicing the PDSM. > > A new function is_cmd_valid() is implemented that reads the args to > papr_scm_ndctl() and performs sanity tests on them. A new function > papr_scm_service_pdsm() is introduced and is called from > papr_scm_ndctl() in case of a PDSM request is received via ND_CMD_CALL > command from libnvdimm. > Reviewed-by: Aneesh Kumar K.V > Cc: "Aneesh Kumar K . V" > Cc: Dan Williams > Cc: Michael Ellerman > Cc: Ira Weiny > Signed-off-by: Vaibhav Jain > --- -aneesh
Re: [PATCH v8 3/5] powerpc/papr_scm: Fetch nvdimm health information from PHYP
Vaibhav Jain writes: > Implement support for fetching nvdimm health information via > H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair > of 64-bit bitmap, bitwise-and of which is then stored in > 'struct papr_scm_priv' and subsequently partially exposed to > user-space via newly introduced dimm specific attribute > 'papr/flags'. Since the hcall is costly, the health information is > cached and only re-queried, 60s after the previous successful hcall. > > The patch also adds a documentation text describing flags reported by > the the new sysfs attribute 'papr/flags' is also introduced at > Documentation/ABI/testing/sysfs-bus-papr-scm. > > [1] commit 58b278f568f0 ("powerpc: Provide initial documentation for > PAPR hcalls") > Reviewed-by: Aneesh Kumar K.V > Cc: "Aneesh Kumar K . V" > Cc: Dan Williams > Cc: Michael Ellerman > Cc: Ira Weiny > Signed-off-by: Vaibhav Jain
[PATCH] powerpc/fadump: account for memory_limit while reserving memory
If the memory chunk found for reserving memory overshoots the memory limit imposed, do not proceed with reserving memory. Default behavior was this until commit 140777a3d8df ("powerpc/fadump: consider reserved ranges while reserving memory") changed it unwittingly. Reported-by: kbuild test robot Fixes: 140777a3d8df ("powerpc/fadump: consider reserved ranges while reserving memory") Cc: sta...@vger.kernel.org Signed-off-by: Hari Bathini --- For reference: - https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-May/211136.html arch/powerpc/kernel/fadump.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 63aac8b..78ab9a6 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -603,7 +603,7 @@ int __init fadump_reserve_mem(void) */ base = fadump_locate_reserve_mem(base, size); - if (!base) { + if (!base || (base + size > mem_boundary)) { pr_err("Failed to find memory chunk for reservation!\n"); goto error_out; }
Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT
Le 27/05/2020 à 06:16, Paul Mackerras a écrit : On Wed, May 20, 2020 at 07:43:08PM +0200, Laurent Dufour wrote: The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls") added checks of secure bit of SRR1 to filter out the Hcall reserved to the Ultravisor. However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the context of the VM calling UV_ESM. This allows the Hypervisor to return to the guest without going through the Ultravisor. Thus the Secure bit of SRR1 is not set in that particular case. In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is not set in that case. Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls") Signed-off-by: Laurent Dufour Thanks, applied to my kvm-ppc-next branch. I expanded the comment in the code a little. Thanks, the comment is more explicit now. Laurent.
Re: [PATCH] media: omap3isp: Shuffle cacheflush.h and include mm.h
Hi Geert, On Wed, May 27, 2020 at 09:02:51AM +0200, Geert Uytterhoeven wrote: > Hi Nathan, > > CC Laurent > > On Wed, May 27, 2020 at 6:37 AM Nathan Chancellor > wrote: > > After mm.h was removed from the asm-generic version of cacheflush.h, > > s390 allyesconfig shows several warnings of the following nature: > > > > In file included from ./arch/s390/include/generated/asm/cacheflush.h:1, > > from drivers/media/platform/omap3isp/isp.c:42: > > ./include/asm-generic/cacheflush.h:16:42: warning: 'struct mm_struct' > > declared inside parameter list will not be visible outside of this > > definition or declaration > > > > cacheflush.h does not include mm.h nor does it include any forward > > declaration of these structures hence the warning. To avoid this, > > include mm.h explicitly in this file and shuffle cacheflush.h below it. > > > > Fixes: 19c0054597a0 ("asm-generic: don't include in > > cacheflush.h") > > Signed-off-by: Nathan Chancellor > > Thanks for your patch! > > > I am aware the fixes tag is kind of irrelevant because that SHA will > > change in the next linux-next revision and this will probably get folded > > into the original patch anyways but still. > > > > The other solution would be to add forward declarations of these structs > > to the top of cacheflush.h, I just chose to do what Christoph did in the > > original patch. I am happy to do that instead if you all feel that is > > better. > > That actually looks like a better solution to me, as it would address the > problem for all users. > > > drivers/media/platform/omap3isp/isp.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/media/platform/omap3isp/isp.c > > b/drivers/media/platform/omap3isp/isp.c > > index a4ee6b86663e..54106a768e54 100644 > > --- a/drivers/media/platform/omap3isp/isp.c > > +++ b/drivers/media/platform/omap3isp/isp.c > > @@ -39,8 +39,6 @@ > > * Troy Laramy > > */ > > > > -#include > > - > > #include > > #include > > #include > > @@ -49,6 +47,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -58,6 +57,8 @@ > > #include > > #include > > > > +#include > > + > > #ifdef CONFIG_ARM_DMA_USE_IOMMU > > #include > > #endif > > Why does this file need at all? > It doesn't call any of the flush_*() functions, and seems to compile fine > without (on arm32). > > Perhaps it was included at the top intentionally, to override the definitions > of copy_{to,from}_user_page()? Fortunately that doesn't seem to be the > case, from a quick look at the assembler output. > > So let's just remove the #include instead? Sounds good to me. I can send a patch if needed or I suppose Andrew can just make a small fixup patch for it. Let me know what I should do. Cheers, Nathan
Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
Hi Alexandre, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on powerpc/next] [also build test WARNING on linus/master v5.7-rc7 next-20200526] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Alexandre-Ghiti/vmalloc-kernel-mapping-and-relocatable-kernel/20200524-170109 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: riscv-allyesconfig (attached as .config) compiler: riscv64-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=riscv If you fix the issue, kindly add following tag as appropriate Reported-by: kbuild test robot All warnings (new ones prefixed by >>, old ones prefixed by <<): >> arch/riscv/mm/init.c:383:6: warning: no previous prototype for >> 'create_kernel_page_table' [-Wmissing-prototypes] 383 | void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size) | ^~~~ vim +/create_kernel_page_table +383 arch/riscv/mm/init.c 382 > 383 void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size) 384 { 385 uintptr_t va, end_va; 386 387 end_va = kernel_virt_addr + load_sz; 388 for (va = kernel_virt_addr; va < end_va; va += map_size) 389 create_pgd_mapping(pgdir, va, 390 load_pa + (va - kernel_virt_addr), 391 map_size, PAGE_KERNEL_EXEC); 392 } 393 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip
Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
Le 5/27/20 à 2:05 AM, Zong Li a écrit : On Wed, May 27, 2020 at 1:06 AM Alex Ghiti wrote: Hi Zong, Le 5/26/20 à 5:43 AM, Zong Li a écrit : On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti wrote: This is a preparatory patch for relocatable kernel. The kernel used to be linked at PAGE_OFFSET address and used to be loaded physically at the beginning of the main memory. Therefore, we could use the linear mapping for the kernel mapping. But the relocated kernel base address will be different from PAGE_OFFSET and since in the linear mapping, two different virtual addresses cannot point to the same physical address, the kernel mapping needs to lie outside the linear mapping. In addition, because modules and BPF must be close to the kernel (inside +-2GB window), the kernel is placed at the end of the vmalloc zone minus 2GB, which leaves room for modules and BPF. The kernel could not be placed at the beginning of the vmalloc zone since other vmalloc allocations from the kernel could get all the +-2GB window around the kernel which would prevent new modules and BPF programs to be loaded. Signed-off-by: Alexandre Ghiti --- arch/riscv/boot/loader.lds.S | 3 +- arch/riscv/include/asm/page.h| 10 +- arch/riscv/include/asm/pgtable.h | 37 +--- arch/riscv/kernel/head.S | 3 +- arch/riscv/kernel/module.c | 4 +-- arch/riscv/kernel/vmlinux.lds.S | 3 +- arch/riscv/mm/init.c | 58 +--- arch/riscv/mm/physaddr.c | 2 +- 8 files changed, 87 insertions(+), 33 deletions(-) diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S index 47a5003c2e28..62d94696a19c 100644 --- a/arch/riscv/boot/loader.lds.S +++ b/arch/riscv/boot/loader.lds.S @@ -1,13 +1,14 @@ /* SPDX-License-Identifier: GPL-2.0 */ #include +#include OUTPUT_ARCH(riscv) ENTRY(_start) SECTIONS { - . = PAGE_OFFSET; + . = KERNEL_LINK_ADDR; .payload : { *(.payload) diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h index 2d50f76efe48..48bb09b6a9b7 100644 --- a/arch/riscv/include/asm/page.h +++ b/arch/riscv/include/asm/page.h @@ -90,18 +90,26 @@ typedef struct page *pgtable_t; #ifdef CONFIG_MMU extern unsigned long va_pa_offset; +extern unsigned long va_kernel_pa_offset; extern unsigned long pfn_base; #define ARCH_PFN_OFFSET(pfn_base) #else #define va_pa_offset 0 +#define va_kernel_pa_offset0 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT) #endif /* CONFIG_MMU */ extern unsigned long max_low_pfn; extern unsigned long min_low_pfn; +extern unsigned long kernel_virt_addr; #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset)) -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset) +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset) +#define kernel_mapping_va_to_pa(x) \ + ((unsigned long)(x) - va_kernel_pa_offset) +#define __va_to_pa_nodebug(x) \ + (((x) >= PAGE_OFFSET) ? \ + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x)) #ifdef CONFIG_DEBUG_VIRTUAL extern phys_addr_t __virt_to_phys(unsigned long x); diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 35b60035b6b0..25213cfaf680 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -11,23 +11,29 @@ #include -#ifndef __ASSEMBLY__ - -/* Page Upper Directory not used in RISC-V */ -#include -#include -#include -#include - -#ifdef CONFIG_MMU +#ifndef CONFIG_MMU +#define KERNEL_VIRT_ADDR PAGE_OFFSET +#define KERNEL_LINK_ADDR PAGE_OFFSET +#else +/* + * Leave 2GB for modules and BPF that must lie within a 2GB range around + * the kernel. + */ +#define KERNEL_VIRT_ADDR (VMALLOC_END - SZ_2G + 1) +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1) #define VMALLOC_END (PAGE_OFFSET - 1) #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE) #define BPF_JIT_REGION_SIZE(SZ_128M) -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE) -#define BPF_JIT_REGION_END (VMALLOC_END) +#define BPF_JIT_REGION_START (kernel_virt_addr) +#define BPF_JIT_REGION_END (kernel_virt_addr + BPF_JIT_REGION_SIZE) It seems to have a potential risk here, the region of bpf is overlapping with kernel mapping, so if kernel size is bigger than 128MB, bpf region would be occupied and run out by kernel mapping. Is there the risk as I mentioned? Sorry I forgot to answer this one: I was confident that 128MB was large enough for kernel and BPF. But I see no reason to leave this risk so I'll change kernel_virt_addr for _end so BPF will have its 128MB reserved. Thanks ! Alex + +#ifdef CONFIG_64BIT +#define VMALLOC_MODULE_START BPF_JIT_REGION_END
Re: [PATCH] media: omap3isp: Shuffle cacheflush.h and include mm.h
Hi Nathan, CC Laurent On Wed, May 27, 2020 at 6:37 AM Nathan Chancellor wrote: > After mm.h was removed from the asm-generic version of cacheflush.h, > s390 allyesconfig shows several warnings of the following nature: > > In file included from ./arch/s390/include/generated/asm/cacheflush.h:1, > from drivers/media/platform/omap3isp/isp.c:42: > ./include/asm-generic/cacheflush.h:16:42: warning: 'struct mm_struct' > declared inside parameter list will not be visible outside of this > definition or declaration > > cacheflush.h does not include mm.h nor does it include any forward > declaration of these structures hence the warning. To avoid this, > include mm.h explicitly in this file and shuffle cacheflush.h below it. > > Fixes: 19c0054597a0 ("asm-generic: don't include in > cacheflush.h") > Signed-off-by: Nathan Chancellor Thanks for your patch! > I am aware the fixes tag is kind of irrelevant because that SHA will > change in the next linux-next revision and this will probably get folded > into the original patch anyways but still. > > The other solution would be to add forward declarations of these structs > to the top of cacheflush.h, I just chose to do what Christoph did in the > original patch. I am happy to do that instead if you all feel that is > better. That actually looks like a better solution to me, as it would address the problem for all users. > drivers/media/platform/omap3isp/isp.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/media/platform/omap3isp/isp.c > b/drivers/media/platform/omap3isp/isp.c > index a4ee6b86663e..54106a768e54 100644 > --- a/drivers/media/platform/omap3isp/isp.c > +++ b/drivers/media/platform/omap3isp/isp.c > @@ -39,8 +39,6 @@ > * Troy Laramy > */ > > -#include > - > #include > #include > #include > @@ -49,6 +47,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -58,6 +57,8 @@ > #include > #include > > +#include > + > #ifdef CONFIG_ARM_DMA_USE_IOMMU > #include > #endif Why does this file need at all? It doesn't call any of the flush_*() functions, and seems to compile fine without (on arm32). Perhaps it was included at the top intentionally, to override the definitions of copy_{to,from}_user_page()? Fortunately that doesn't seem to be the case, from a quick look at the assembler output. So let's just remove the #include instead? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH v4 6/7] KVM: MIPS: clean up redundant 'kvm_run' parameters
On 2020/4/27 13:40, Huacai Chen wrote: Reviewed-by: Huacai Chen On Mon, Apr 27, 2020 at 12:35 PM Tianjia Zhang wrote: In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu' structure. For historical reasons, many kvm-related function parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This patch does a unified cleanup of these remaining redundant parameters. Signed-off-by: Tianjia Zhang --- arch/mips/include/asm/kvm_host.h | 28 +--- arch/mips/kvm/emulate.c | 59 ++-- arch/mips/kvm/mips.c | 11 ++- arch/mips/kvm/trap_emul.c| 114 ++- arch/mips/kvm/vz.c | 26 +++ 5 files changed, 87 insertions(+), 151 deletions(-) Hi Huacai, These two patches(6/7 and 7/7) should be merged into the tree of the mips architecture separately. At present, there seems to be no good way to merge the whole architecture patchs. For this series of patches, some architectures have been merged, some need to update the patch. Thanks and best, Tianjia
Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
On Wed, May 27, 2020 at 1:06 AM Alex Ghiti wrote: > > Hi Zong, > > Le 5/26/20 à 5:43 AM, Zong Li a écrit : > > On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti wrote: > >> This is a preparatory patch for relocatable kernel. > >> > >> The kernel used to be linked at PAGE_OFFSET address and used to be loaded > >> physically at the beginning of the main memory. Therefore, we could use > >> the linear mapping for the kernel mapping. > >> > >> But the relocated kernel base address will be different from PAGE_OFFSET > >> and since in the linear mapping, two different virtual addresses cannot > >> point to the same physical address, the kernel mapping needs to lie outside > >> the linear mapping. > >> > >> In addition, because modules and BPF must be close to the kernel (inside > >> +-2GB window), the kernel is placed at the end of the vmalloc zone minus > >> 2GB, which leaves room for modules and BPF. The kernel could not be > >> placed at the beginning of the vmalloc zone since other vmalloc > >> allocations from the kernel could get all the +-2GB window around the > >> kernel which would prevent new modules and BPF programs to be loaded. > >> > >> Signed-off-by: Alexandre Ghiti > >> --- > >> arch/riscv/boot/loader.lds.S | 3 +- > >> arch/riscv/include/asm/page.h| 10 +- > >> arch/riscv/include/asm/pgtable.h | 37 +--- > >> arch/riscv/kernel/head.S | 3 +- > >> arch/riscv/kernel/module.c | 4 +-- > >> arch/riscv/kernel/vmlinux.lds.S | 3 +- > >> arch/riscv/mm/init.c | 58 +--- > >> arch/riscv/mm/physaddr.c | 2 +- > >> 8 files changed, 87 insertions(+), 33 deletions(-) > >> > >> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S > >> index 47a5003c2e28..62d94696a19c 100644 > >> --- a/arch/riscv/boot/loader.lds.S > >> +++ b/arch/riscv/boot/loader.lds.S > >> @@ -1,13 +1,14 @@ > >> /* SPDX-License-Identifier: GPL-2.0 */ > >> > >> #include > >> +#include > >> > >> OUTPUT_ARCH(riscv) > >> ENTRY(_start) > >> > >> SECTIONS > >> { > >> - . = PAGE_OFFSET; > >> + . = KERNEL_LINK_ADDR; > >> > >> .payload : { > >> *(.payload) > >> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h > >> index 2d50f76efe48..48bb09b6a9b7 100644 > >> --- a/arch/riscv/include/asm/page.h > >> +++ b/arch/riscv/include/asm/page.h > >> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t; > >> > >> #ifdef CONFIG_MMU > >> extern unsigned long va_pa_offset; > >> +extern unsigned long va_kernel_pa_offset; > >> extern unsigned long pfn_base; > >> #define ARCH_PFN_OFFSET(pfn_base) > >> #else > >> #define va_pa_offset 0 > >> +#define va_kernel_pa_offset0 > >> #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT) > >> #endif /* CONFIG_MMU */ > >> > >> extern unsigned long max_low_pfn; > >> extern unsigned long min_low_pfn; > >> +extern unsigned long kernel_virt_addr; > >> > >> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + > >> va_pa_offset)) > >> -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset) > >> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset) > >> +#define kernel_mapping_va_to_pa(x) \ > >> + ((unsigned long)(x) - va_kernel_pa_offset) > >> +#define __va_to_pa_nodebug(x) \ > >> + (((x) >= PAGE_OFFSET) ? \ > >> + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x)) > >> > >> #ifdef CONFIG_DEBUG_VIRTUAL > >> extern phys_addr_t __virt_to_phys(unsigned long x); > >> diff --git a/arch/riscv/include/asm/pgtable.h > >> b/arch/riscv/include/asm/pgtable.h > >> index 35b60035b6b0..25213cfaf680 100644 > >> --- a/arch/riscv/include/asm/pgtable.h > >> +++ b/arch/riscv/include/asm/pgtable.h > >> @@ -11,23 +11,29 @@ > >> > >> #include > >> > >> -#ifndef __ASSEMBLY__ > >> - > >> -/* Page Upper Directory not used in RISC-V */ > >> -#include > >> -#include > >> -#include > >> -#include > >> - > >> -#ifdef CONFIG_MMU > >> +#ifndef CONFIG_MMU > >> +#define KERNEL_VIRT_ADDR PAGE_OFFSET > >> +#define KERNEL_LINK_ADDR PAGE_OFFSET > >> +#else > >> +/* > >> + * Leave 2GB for modules and BPF that must lie within a 2GB range around > >> + * the kernel. > >> + */ > >> +#define KERNEL_VIRT_ADDR (VMALLOC_END - SZ_2G + 1) > >> +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR > >> > >> #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1) > >> #define VMALLOC_END (PAGE_OFFSET - 1) > >> #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE) > >> > >> #define BPF_JIT_REGION_SIZE(SZ_128M) > >> -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE) > >> -#define BPF_JIT_REGION_END (VMALLOC_END) > >> +#define BPF_JIT_REGION_START (kernel_virt_addr) > >> +#define BPF_JIT_REGION_END (kernel_virt_addr + BPF_JIT_REGION_SIZE) > > It seems to have a