Re: [PATCH] tracing: fix UAF caused by memory ordering issue
Mark Rutland 于2023年11月14日周二 06:17写道: > Hi, Mark and Steven Thank you so much for the detailed comments. > On Sun, Nov 12, 2023 at 11:00:30PM +0800, Kairui Song wrote: > > From: Kairui Song > > > > Following kernel panic was observed when doing ftrace stress test: > > Can you share some more details: > > * What test specifically are you running? Can you share this so that others > can > try to reproduce the issue? Yes, the panic happened when doing LTP ftrace stress test: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/tracing/ftrace_test/ftrace_stress_test.sh > > * Which machines are you testing on (i.e. which CPU microarchitecture is this > seen with) ? The panic was seen on a ARM64 VM, lscpu output: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: HiSilicon BIOS Vendor ID: QEMU Model name: Kunpeng-920 BIOS Model name:virt-rhel8.6.0 CPU @ 2.0GHz BIOS CPU family:1 Model: 0 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 4 Stepping: 0x1 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm The host machine is a Kunpeng-920 with 4 NUMA nodes and 128 cores. > > * Which compiler are you using? gcc 12.3.1 > > * The log shows this is with v6.1.61+. Can you reproduce this with a mainline > kernel? e.g. v6.6 or v6.7-rc1? It's reproducible with LTS, not tested with mainline, I'll try to reproduce this with the latest mainline. But due to the low reproducibility this may take a while. > > > Unable to handle kernel paging request at virtual address 9699b0f8ece28240 > > Mem abort info: > > ESR = 0x9604 > > EC = 0x25: DABT (current EL), IL = 32 bits > > SET = 0, FnV = 0 > > EA = 0, S1PTW = 0 > > FSC = 0x04: level 0 translation fault > > Data abort info: > > ISV = 0, ISS = 0x0004 > > CM = 0, WnR = 0 > > [9699b0f8ece28240] address between user and kernel address ranges > > Internal error: Oops: 9604 [#1] SMP > > Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill vfat fat loop > > fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache > > jbd2 sr_mod cdrom crct10dif_ce ghash_ce sha2_ce virtio_gpu virtio_dma_buf > > drm_shmem_helper virtio_blk drm_kms_helper syscopyarea sysfillrect > > sysimgblt fb_sys_fops virtio_console sha256_arm64 sha1_ce drm virtio_scsi > > i2c_core virtio_net net_failover failover virtio_mmio dm_multipath dm_mod > > autofs4 [last unloaded: ipmi_msghandler] > > CPU: 0 PID: 499719 Comm: sh Kdump: loaded Not tainted 6.1.61+ #2 > > Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 > > pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > pc : __kmem_cache_alloc_node+0x1dc/0x2e4 > > lr : __kmem_cache_alloc_node+0xac/0x2e4 > > sp : 8ad23aa0 > > x29: 8ad23ab0 x28: 0004052b8000 x27: c513863b > > x26: 0040 x25: c51384f21ca4 x24: > > x23: d615521430b1b1a5 x22: c51386044770 x21: > > x20: 0cc0 x19: c0001200 x18: > > x17: x16: x15: e65e1630 > > x14: 0004 x13: c513863e67a0 x12: c513863af6d8 > > x11: 0001 x10: 8ad23aa0 x9 : c51385058078 > > x8 : 0018 x7 : 0001 x6 : 0010 > > x5 : c09c2280 x4 : c51384f21ca4 x3 : 0040 > > x2 : 9699b0f8ece28240 x1 : c09c2280 x0 : 9699b0f8ece28200 > > Call trace: > > __kmem_cache_alloc_node+0x1dc/0x2e4 > > __kmalloc+0x6c/0x1c0 > > func_add+0x1a4/0x200 > > tracepoint_add_func+0x70/0x230 > > tracepoint_probe_register+0x6c/0xb4 > > trace_event_reg+0x8c/0xa0 > > __ftrace_event_enable_disable+0x17c/0x440 > > __ftrace_set_clr_event_nolock+0xe0/0x150 > > system_enable_write+0xe0/0x114 > > vfs_write+0xd0/0x2dc > > ksys_write+0x78/0x110 > > __arm64_sys_write+0x24/0x30 > > invoke_syscall.constprop.0+0x58/0xf0 > > el0_svc_common.constprop.0+0x54/0x160 > > do_el0_svc+0x2c/0x60 > > el0_svc+0x40/0x1ac > > el0t_64_sync_handler+0xf4/0x120 > > el0t_64_sync+0x19c/0x1a0 > > Code: b9402a63 f9405e77 8b030002 d5384101 (f8636803) > > > > Panic was caused by corrupted freelist pointer. After more debugging, > > I found the root
[PATCH] tracing: fix UAF caused by memory ordering issue
From: Kairui Song Following kernel panic was observed when doing ftrace stress test: Unable to handle kernel paging request at virtual address 9699b0f8ece28240 Mem abort info: ESR = 0x9604 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x0004 CM = 0, WnR = 0 [9699b0f8ece28240] address between user and kernel address ranges Internal error: Oops: 9604 [#1] SMP Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill vfat fat loop fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sr_mod cdrom crct10dif_ce ghash_ce sha2_ce virtio_gpu virtio_dma_buf drm_shmem_helper virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops virtio_console sha256_arm64 sha1_ce drm virtio_scsi i2c_core virtio_net net_failover failover virtio_mmio dm_multipath dm_mod autofs4 [last unloaded: ipmi_msghandler] CPU: 0 PID: 499719 Comm: sh Kdump: loaded Not tainted 6.1.61+ #2 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __kmem_cache_alloc_node+0x1dc/0x2e4 lr : __kmem_cache_alloc_node+0xac/0x2e4 sp : 8ad23aa0 x29: 8ad23ab0 x28: 0004052b8000 x27: c513863b x26: 0040 x25: c51384f21ca4 x24: x23: d615521430b1b1a5 x22: c51386044770 x21: x20: 0cc0 x19: c0001200 x18: x17: x16: x15: e65e1630 x14: 0004 x13: c513863e67a0 x12: c513863af6d8 x11: 0001 x10: 8ad23aa0 x9 : c51385058078 x8 : 0018 x7 : 0001 x6 : 0010 x5 : c09c2280 x4 : c51384f21ca4 x3 : 0040 x2 : 9699b0f8ece28240 x1 : c09c2280 x0 : 9699b0f8ece28200 Call trace: __kmem_cache_alloc_node+0x1dc/0x2e4 __kmalloc+0x6c/0x1c0 func_add+0x1a4/0x200 tracepoint_add_func+0x70/0x230 tracepoint_probe_register+0x6c/0xb4 trace_event_reg+0x8c/0xa0 __ftrace_event_enable_disable+0x17c/0x440 __ftrace_set_clr_event_nolock+0xe0/0x150 system_enable_write+0xe0/0x114 vfs_write+0xd0/0x2dc ksys_write+0x78/0x110 __arm64_sys_write+0x24/0x30 invoke_syscall.constprop.0+0x58/0xf0 el0_svc_common.constprop.0+0x54/0x160 do_el0_svc+0x2c/0x60 el0_svc+0x40/0x1ac el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x19c/0x1a0 Code: b9402a63 f9405e77 8b030002 d5384101 (f8636803) Panic was caused by corrupted freelist pointer. After more debugging, I found the root cause is UAF of slab allocated object in ftrace introduced by commit eecb91b9f98d ("tracing: Fix memleak due to race between current_tracer and trace"), and so far it's only reproducible on some ARM64 machines, the UAF and free stack is: UAF: kasan_report+0xa8/0x1bc __asan_report_load8_noabort+0x28/0x3c print_graph_function_flags+0x524/0x5a0 print_graph_function_event+0x28/0x40 print_trace_line+0x5c4/0x1030 s_show+0xf0/0x460 seq_read_iter+0x930/0xf5c seq_read+0x130/0x1d0 vfs_read+0x288/0x840 ksys_read+0x130/0x270 __arm64_sys_read+0x78/0xac invoke_syscall.constprop.0+0x90/0x224 do_el0_svc+0x118/0x3dc el0_svc+0x54/0x120 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x19c/0x1a0 Freed by: kasan_save_free_info+0x38/0x5c __kasan_slab_free+0xe8/0x154 slab_free_freelist_hook+0xfc/0x1e0 __kmem_cache_free+0x138/0x260 kfree+0xd0/0x1d0 graph_trace_close+0x60/0x90 s_start+0x610/0x910 seq_read_iter+0x274/0xf5c seq_read+0x130/0x1d0 vfs_read+0x288/0x840 ksys_read+0x130/0x270 __arm64_sys_read+0x78/0xac invoke_syscall.constprop.0+0x90/0x224 do_el0_svc+0x118/0x3dc el0_svc+0x54/0x120 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x19c/0x1a0 Despite the s_start and s_show being serialized by seq_file mutex, the tracer struct copy in s_start introduced by the commit mentioned above is not atomic nor guarenteened to be seen by all CPUs. So following seneriao is possible (and actually happened): CPU 1 CPU 2 seq_read_iter seq_read_iter mutex_lock(>lock); s_start // iter->trace is graph_trace iter->trace->close(iter); graph_trace_close kfree(data) <- *** data released here *** // copy current_trace to iter->trace // but not synced to CPU 2 *iter->trace = *tr->current_trace ... (goes on) mutex_unlock(>lock); mutex_lock(>lock); ... (s_start and other work) s_show print_trace_line(iter) // iter->trace is still // old value (graph_trace) iter->trace->print_line()
[PATCH] efi: memmap insertion should adjust the vaddr as well
Currently when efi_memmap_insert is called, only the physical memory addresses are re-calculated. The virt addresses of the split entries are untouched. If any later operation depends on the virt_addaress info, things will go wrong. One case it may fail is kexec on x86, after kexec, efi is already in virtual mode, kernel simply do fixed mapping reuse the recorded virt address. If the virt address is incorrect, the mapping will be invalid. Update the virt_addaress as well when inserting a memmap entry to fix this potential issue. Signed-off-by: Kairui Song --- drivers/firmware/efi/memmap.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c index 2ff1883dc788..de5c545b2074 100644 --- a/drivers/firmware/efi/memmap.c +++ b/drivers/firmware/efi/memmap.c @@ -292,7 +292,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf, { u64 m_start, m_end, m_attr; efi_memory_desc_t *md; - u64 start, end; + u64 start, end, virt_offset; void *old, *new; /* modifying range */ @@ -321,6 +321,11 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf, start = md->phys_addr; end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1; + if (md->virt_addr) + virt_offset = md->virt_addr - md->phys_addr; + else + virt_offset = -1; + if (m_start <= start && end <= m_end) md->attribute |= m_attr; @@ -337,6 +342,8 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf, md->phys_addr = m_end + 1; md->num_pages = (end - md->phys_addr + 1) >> EFI_PAGE_SHIFT; + if (virt_offset != -1) + md->virt_addr = md->phys_addr + virt_offset; } if ((start < m_start && m_start < end) && m_end < end) { @@ -351,6 +358,8 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf, md->phys_addr = m_start; md->num_pages = (m_end - m_start + 1) >> EFI_PAGE_SHIFT; + if (virt_offset != -1) + md->virt_addr = md->phys_addr + virt_offset; /* last part */ new += old_memmap->desc_size; memcpy(new, old, old_memmap->desc_size); @@ -358,6 +367,8 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf, md->phys_addr = m_end + 1; md->num_pages = (end - m_end) >> EFI_PAGE_SHIFT; + if (virt_offset != -1) + md->virt_addr = md->phys_addr + virt_offset; } if ((start < m_start && m_start < end) && @@ -373,6 +384,8 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf, md->num_pages = (end - md->phys_addr + 1) >> EFI_PAGE_SHIFT; md->attribute |= m_attr; + if (virt_offset != -1) + md->virt_addr = md->phys_addr + virt_offset; } } } -- 2.29.2
Re: [PATCH v4 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation
On Wed, Feb 24, 2021 at 1:45 AM Saeed Mirzamohammadi wrote: > > This adds crashkernel=auto feature to configure reserved memory for > vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for > different kernel distributions and different archs based on their > needs. > > Signed-off-by: Saeed Mirzamohammadi > Signed-off-by: John Donnelly > Tested-by: John Donnelly > --- > Documentation/admin-guide/kdump/kdump.rst | 3 ++- > .../admin-guide/kernel-parameters.txt | 6 ++ > arch/Kconfig | 20 +++ > kernel/crash_core.c | 7 +++ > 4 files changed, 35 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/kdump/kdump.rst > b/Documentation/admin-guide/kdump/kdump.rst > index 75a9dd98e76e..ae030111e22a 100644 > --- a/Documentation/admin-guide/kdump/kdump.rst > +++ b/Documentation/admin-guide/kdump/kdump.rst > @@ -285,7 +285,8 @@ This would mean: > 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M > 3) if the RAM size is larger than 2G, then reserve 128M > > - > +Or you can use crashkernel=auto to choose the crash kernel memory size > +based on the recommended configuration set for each arch. > > Boot into System Kernel > === > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 9e3cdb271d06..a5deda5c85fe 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -747,6 +747,12 @@ > a memory unit (amount[KMG]). See also > Documentation/admin-guide/kdump/kdump.rst for an > example. > > + crashkernel=auto > + [KNL] This parameter will set the reserved memory for > + the crash kernel based on the value of the > CRASH_AUTO_STR > + that is the best effort estimation for each arch. See > also > + arch/Kconfig for further details. > + > crashkernel=size[KMG],high > [KNL, X86-64] range could be above 4G. Allow kernel > to allocate physical memory region from top, so could > diff --git a/arch/Kconfig b/arch/Kconfig > index 24862d15f3a3..23d047548772 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -14,6 +14,26 @@ menu "General architecture-dependent options" > config CRASH_CORE > bool > > +config CRASH_AUTO_STR > + string "Memory reserved for crash kernel" > + depends on CRASH_CORE > + default "1G-64G:128M,64G-1T:256M,1T-:512M" > + help > + This configures the reserved memory dependent > + on the value of System RAM. The syntax is: > + crashkernel=:[,:,...][@offset] > + range=start-[end] > + > + For example: > + crashkernel=512M-2G:64M,2G-:128M > + > + This would mean: > + > + 1) if the RAM is smaller than 512M, then don't reserve anything > +(this is the "rescue" case) > + 2) if the RAM size is between 512M and 2G (exclusive), then > reserve 64M > + 3) if the RAM size is larger than 2G, then reserve 128M > + > config KEXEC_CORE > select CRASH_CORE > bool > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 825284baaf46..90f9e4bb6704 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -7,6 +7,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline, > if (suffix) > return parse_crashkernel_suffix(ck_cmdline, crash_size, > suffix); > +#ifdef CONFIG_CRASH_AUTO_STR > + if (strncmp(ck_cmdline, "auto", 4) == 0) { > + ck_cmdline = CONFIG_CRASH_AUTO_STR; > + pr_info("Using crashkernel=auto, the size chosen is a best > effort estimation.\n"); > + } > +#endif > /* > * if the commandline contains a ':', then that's the extended > * syntax -- if not, it must be the classic syntax > -- > 2.27.0 > > > ___ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec > Thanks for help pushing the crashkernel=auto to upstream This patch works well. Tested-by: Kairui Song -- Best Regards, Kairui Song
Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation
rashkernel(char *cmdline, > > if (suffix) > > return parse_crashkernel_suffix(ck_cmdline, crash_size, > > suffix); > > +#ifdef CONFIG_CRASH_AUTO_STR > > + if (strncmp(ck_cmdline, "auto", 4) == 0) { > > + ck_cmdline = CONFIG_CRASH_AUTO_STR; > > + pr_info("Using crashkernel=auto, the size chosen is a best > > effort estimation.\n"); > > + } > > +#endif > > /* > >* if the commandline contains a ':', then that's the extended > >* syntax -- if not, it must be the classic syntax > > -- > > 2.27.0 > > > > > ___ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec > -- Best Regards, Kairui Song
Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM
On Fri, Nov 20, 2020 at 4:28 AM Saeed Mirzamohammadi wrote: > > Hi, > > And I think crashkernel=auto could be used as an indicator that user > want the kernel to control the crashkernel size, so some further work > could be done to adjust the crashkernel more accordingly. eg. when > memory encryption is enabled, increase the crashkernel value for the > auto estimation, as it's known to consume more crashkernel memory. > > Thanks for the suggestion! I tried to keep it simple and leave it to the user > to change Kconfig in case a different range is needed. Based on experience, > these ranges work well for most of the regular cases. Yes, I think the current implementation is a very good start. There are some use cases, where kernel is expected to reserve more memory, like: - when memory encryption is enabled, an extra swiotlb size of memory should be reserved - on pcc, fadump will expect more memory to be reserved I believe there are a lot more cases like these. I tried to come up with some patches to let the kernel reserve more memory automatically, when such conditions are detected, but changing the crashkernel= specified value is really weird. But if we have a crashkernel=auto, then kernel automatically reserve more memory will make sense. > But why not make it arch-independent? This crashkernel=auto idea > should simply work with every arch. > > > Thanks! I’ll be making it arch-independent in the v2 patch. > > > #include > #include > @@ -41,6 +42,15 @@ static int __init parse_crashkernel_mem(char *cmdline, >unsigned long long *crash_base) > { >char *cur = cmdline, *tmp; > + unsigned long long total_mem = system_ram; > + > + /* > +* Firmware sometimes reserves some memory regions for it's own use. > +* so we get less than actual system memory size. > +* Workaround this by round up the total size to 128M which is > +* enough for most test cases. > +*/ > + total_mem = roundup(total_mem, SZ_128M); > > > I think this rounding may be better moved to the arch specified part > where parse_crashkernel is called? > > > Thanks for the suggestion. Could you please elaborate why do we need to do > that? Every arch gets their total memory value using different methods, (just check every parse_crashkernel call, and the system_ram param is filled in many different ways), so I'm really not sure if this rounding is always suitable. > > Thanks, > Saeed > > -- Best Regards, Kairui Song
Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM
Enable bzImage signature verification support. > > -config CRASH_DUMP > +menuconfig CRASH_DUMP > bool "kernel crash dumps" > depends on X86_64 || (X86_32 && HIGHMEM) > help > @@ -2049,6 +2049,30 @@ config CRASH_DUMP > (CONFIG_RELOCATABLE=y). > For more details see Documentation/admin-guide/kdump/kdump.rst > > +if CRASH_DUMP > + > +config CRASH_AUTO_STR > +string "Memory reserved for crash kernel" if X86_64 > + depends on CRASH_DUMP > +default "1G-64G:128M,64G-1T:256M,1T-:512M" > + help > + This configures the reserved memory dependent > + on the value of System RAM. The syntax is: > + crashkernel=:[,:,...][@offset] > + range=start-[end] > + > + For example: > + crashkernel=512M-2G:64M,2G-:128M > + > + This would mean: > + > + 1) if the RAM is smaller than 512M, then don't reserve anything > +(this is the "rescue" case) > + 2) if the RAM size is between 512M and 2G (exclusive), then > reserve 64M > + 3) if the RAM size is larger than 2G, then reserve 128M > + > +endif # CRASH_DUMP > + > config KEXEC_JUMP > bool "kexec jump" > depends on KEXEC && HIBERNATION > diff --git a/arch/x86/configs/x86_64_defconfig > b/arch/x86/configs/x86_64_defconfig > index 9936528e1939..7a87fbecf40b 100644 > --- a/arch/x86/configs/x86_64_defconfig > +++ b/arch/x86/configs/x86_64_defconfig > @@ -33,6 +33,7 @@ CONFIG_EFI_MIXED=y > CONFIG_HZ_1000=y > CONFIG_KEXEC=y > CONFIG_CRASH_DUMP=y > +# CONFIG_CRASH_AUTO_STR is not set > CONFIG_HIBERNATION=y > CONFIG_PM_DEBUG=y > CONFIG_PM_TRACE_RTC=y > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 106e4500fd53..a44cd9cc12c4 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -7,6 +7,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -41,6 +42,15 @@ static int __init parse_crashkernel_mem(char *cmdline, > unsigned long long *crash_base) > { > char *cur = cmdline, *tmp; > + unsigned long long total_mem = system_ram; > + > + /* > +* Firmware sometimes reserves some memory regions for it's own use. > +* so we get less than actual system memory size. > +* Workaround this by round up the total size to 128M which is > +* enough for most test cases. > +*/ > + total_mem = roundup(total_mem, SZ_128M); I think this rounding may be better moved to the arch specified part where parse_crashkernel is called? > > /* for each entry of the comma-separated list */ > do { > @@ -85,13 +95,13 @@ static int __init parse_crashkernel_mem(char *cmdline, > return -EINVAL; > } > cur = tmp; > - if (size >= system_ram) { > + if (size >= total_mem) { > pr_warn("crashkernel: invalid size\n"); > return -EINVAL; > } > > /* match ? */ > - if (system_ram >= start && system_ram < end) { > + if (total_mem >= start && total_mem < end) { > *crash_size = size; > break; > } > @@ -250,6 +260,12 @@ static int __init __parse_crashkernel(char *cmdline, > if (suffix) > return parse_crashkernel_suffix(ck_cmdline, crash_size, > suffix); > +#ifdef CONFIG_CRASH_AUTO_STR > + if (strncmp(ck_cmdline, "auto", 4) == 0) { > + ck_cmdline = CONFIG_CRASH_AUTO_STR; > + pr_info("Using crashkernel=auto, the size chosen is a best > effort estimation.\n"); > + } > +#endif > /* > * if the commandline contains a ':', then that's the extended > * syntax -- if not, it must be the classic syntax > -- > 2.18.4 > -- Best Regards, Kairui Song
[tip: x86/urgent] x86/kexec: Use up-to-dated screen_info copy to fill boot params
The following commit has been merged into the x86/urgent branch of tip: Commit-ID: afc18069a2cb7ead5f86623a5f3d4ad6e21f940d Gitweb: https://git.kernel.org/tip/afc18069a2cb7ead5f86623a5f3d4ad6e21f940d Author:Kairui Song AuthorDate:Wed, 14 Oct 2020 17:24:28 +08:00 Committer: Ingo Molnar CommitterDate: Wed, 14 Oct 2020 17:05:03 +02:00 x86/kexec: Use up-to-dated screen_info copy to fill boot params kexec_file_load() currently reuses the old boot_params.screen_info, but if drivers have change the hardware state, boot_param.screen_info could contain invalid info. For example, the video type might be no longer VGA, or the frame buffer address might be changed. If the kexec kernel keeps using the old screen_info, kexec'ed kernel may attempt to write to an invalid framebuffer memory region. There are two screen_info instances globally available, boot_params.screen_info and screen_info. Later one is a copy, and is updated by drivers. So let kexec_file_load use the updated copy. [ mingo: Tidied up the changelog. ] Signed-off-by: Kairui Song Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20201014092429.1415040-2-kas...@redhat.com --- arch/x86/kernel/kexec-bzimage64.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 57c2ecf..ce831f9 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch; /* Copying screen_info will do? */ - memcpy(>screen_info, _params.screen_info, - sizeof(struct screen_info)); + memcpy(>screen_info, _info, sizeof(struct screen_info)); /* Fill in memsize later */ params->screen_info.ext_mem_k = 0;
[tip: x86/urgent] hyperv_fb: Update screen_info after removing old framebuffer
The following commit has been merged into the x86/urgent branch of tip: Commit-ID: 3cb73bc3fa2a3cb80b88aa63b48409939e0d996b Gitweb: https://git.kernel.org/tip/3cb73bc3fa2a3cb80b88aa63b48409939e0d996b Author:Kairui Song AuthorDate:Wed, 14 Oct 2020 17:24:29 +08:00 Committer: Ingo Molnar CommitterDate: Wed, 14 Oct 2020 17:05:26 +02:00 hyperv_fb: Update screen_info after removing old framebuffer On gen2 HyperV VM, hyperv_fb will remove the old framebuffer, and the new allocated framebuffer address could be at a differnt location, and it might be no longer a VGA framebuffer. Update screen_info so that after kexec the kernel won't try to reuse the old invalid/stale framebuffer address as VGA, corrupting memory. [ mingo: Tidied up the changelog. ] Signed-off-by: Kairui Song Signed-off-by: Ingo Molnar Cc: Dexuan Cui Cc: Jake Oshins Cc: Wei Hu Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Link: https://lore.kernel.org/r/20201014092429.1415040-3-kas...@redhat.com --- drivers/video/fbdev/hyperv_fb.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c index 02411d8..e36fb1a 100644 --- a/drivers/video/fbdev/hyperv_fb.c +++ b/drivers/video/fbdev/hyperv_fb.c @@ -1114,8 +1114,15 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info) getmem_done: remove_conflicting_framebuffers(info->apertures, KBUILD_MODNAME, false); - if (!gen2vm) + + if (gen2vm) { + /* framebuffer is reallocated, clear screen_info to avoid misuse from kexec */ + screen_info.lfb_size = 0; + screen_info.lfb_base = 0; + screen_info.orig_video_isVGA = 0; + } else { pci_dev_put(pdev); + } kfree(info->apertures); return 0;
[PATCH 1/2] x86/kexec: Use up-to-dated screen_info copy to fill boot params
kexec_file_load now just reuse the old boot_params.screen_info. But if drivers have change the hardware state, boot_param.screen_info could contain invalid info. For example, the video type might be no longer VGA, or frame buffer address changed. If kexec kernel keep using the old screen_info, kexec'ed kernel may attempt to write to an invalid framebuffer memory region. There are two screen_info globally available, boot_params.screen_info and screen_info. Later one is a copy, and could be updated by drivers. So let kexec_file_load use the updated copy. Signed-off-by: Kairui Song --- arch/x86/kernel/kexec-bzimage64.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 57c2ecf43134..ce831f9448e7 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch; /* Copying screen_info will do? */ - memcpy(>screen_info, _params.screen_info, - sizeof(struct screen_info)); + memcpy(>screen_info, _info, sizeof(struct screen_info)); /* Fill in memsize later */ params->screen_info.ext_mem_k = 0; -- 2.28.0
[PATCH 0/2] x86/hyperv: fix kexec/kdump hang on some VMs
On some HyperV machines, if kexec_file_load is used to load the kexec kernel, second kernel could hang with following stacktrace: [0.591705] efifb: probing for efifb [0.596869] efifb: framebuffer at 0xf800, using 3072k, total 3072k [0.605894] efifb: mode is 1024x768x32, linelength=4096, pages=1 [0.617926] efifb: scrolling: redraw [0.622715] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 [ 28.039046] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:1] [ 28.039046] Modules linked in: [ 28.039046] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-230.el8.x86_64 #1 [ 28.039046] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 [ 28.039046] RIP: 0010:cfb_imageblit+0x450/0x4c0 [ 28.039046] Code: 89 f8 b9 08 00 00 00 48 89 04 24 eb 2d 41 0f be 30 29 e9 4c 8d 5f 04 d3 fe 44 21 ee 41 8b 04 b6 44 21 c8 89 c6 44 31 d6 89 37 <85> c9 75 09 49 83 c0 01 b9 08 00 00 00 4c 89 df 48 39 df 75 ce 83 [ 28.039046] RSP: 0018:c9087830 EFLAGS: 00010246 ORIG_RAX: ff12 [ 28.039046] RAX: RBX: c9542000 RCX: 0003 [ 28.039046] RDX: 000e RSI: RDI: c9541bf0 [ 28.039046] RBP: 0001 R08: 8880f555c8df R09: 00aa [ 28.039046] R10: R11: c9541bf4 R12: 1000 [ 28.039046] R13: 0001 R14: 81e9a460 R15: 8880f555c880 [ 28.039046] FS: () GS:8880f100() knlGS: [ 28.039046] CS: 0010 DS: ES: CR0: 80050033 [ 28.039046] CR2: 7f7b223b8000 CR3: f3a0a004 CR4: 003606b0 [ 28.039046] DR0: DR1: DR2: [ 28.039046] DR3: DR6: fffe0ff0 DR7: 0400 [ 28.039046] Call Trace: [ 28.039046] bit_putcs+0x2a1/0x550 [ 28.039046] ? fbcon_switch+0x33e/0x5b0 [ 28.039046] ? bit_clear+0x120/0x120 [ 28.039046] fbcon_putcs+0xe7/0x100 [ 28.039046] do_update_region+0x154/0x1a0 [ 28.039046] redraw_screen+0x209/0x240 [ 28.039046] ? vc_do_resize+0x5c9/0x660 [ 28.039046] fbcon_prepare_logo+0x3b3/0x430 [ 28.039046] fbcon_init+0x436/0x630 [ 28.039046] visual_init+0xce/0x130 [ 28.039046] do_bind_con_driver+0x1df/0x2d0 [ 28.039046] do_take_over_console+0x113/0x180 [ 28.039046] do_fbcon_takeover+0x58/0xb0 [ 28.039046] register_framebuffer+0x225/0x2f0 [ 28.039046] efifb_probe.cold.5+0x51a/0x55d [ 28.039046] platform_drv_probe+0x38/0x90 [ 28.039046] really_probe+0x212/0x440 [ 28.039046] driver_probe_device+0x49/0xc0 [ 28.039046] device_driver_attach+0x50/0x60 [ 28.039046] __driver_attach+0x61/0x130 [ 28.039046] ? device_driver_attach+0x60/0x60 [ 28.039046] bus_for_each_dev+0x77/0xc0 [ 28.039046] ? klist_add_tail+0x57/0x70 [ 28.039046] bus_add_driver+0x14d/0x1e0 [ 28.039046] ? vesafb_driver_init+0x13/0x13 [ 28.039046] ? do_early_param+0x91/0x91 [ 28.039046] driver_register+0x6b/0xb0 [ 28.039046] ? vesafb_driver_init+0x13/0x13 [ 28.039046] do_one_initcall+0x46/0x1c3 [ 28.039046] ? do_early_param+0x91/0x91 [ 28.039046] kernel_init_freeable+0x1b4/0x25d [ 28.039046] ? rest_init+0xaa/0xaa [ 28.039046] kernel_init+0xa/0xfa [ 28.039046] ret_from_fork+0x35/0x40 The root cause is that hyperv_fb driver will relocate the framebuffer address in first kernel, but kexec_file_load simply reuse the old framebuffer info from boot_params, which is now invalid, so second kernel will write to an invalid framebuffer address. This series fix this problem by: 1. Let kexec_file_load use the updated copy of screen_info. Instead of using boot_params.screen_info, use the globally available screen_info variable instead (which is just an copy of boot_params.screen_info on x86). This variable could be updated by arch indenpendent drivers. Just keep this variable updated should be a good way to keep screen_info consistent across kexec. 2. Let hyperv_fb clean the screen_info copy when the boot framebuffer is relocated outside the old framebuffer. After the relocation, the framebuffer is no longer a VGA framebuffer, so just clean it up should be good. Kairui Song (2): x86/kexec: Use up-to-dated screen_info copy to fill boot params hyperv_fb: Update screen_info after removing old framebuffer arch/x86/kernel/kexec-bzimage64.c | 3 +-- drivers/video/fbdev/hyperv_fb.c | 8 2 files changed, 9 insertions(+), 2 deletions(-) -- 2.28.0
[PATCH 2/2] hyperv_fb: Update screen_info after removing old framebuffer
On gen2 HyperV VM, hyperv_fb will remove the old framebuffer, the new allocated framebuffer address could be at a differnt location, and it's no longer VGA framebuffer. Update screen_info so that after kexec, kernel won't try to reuse the old invalid framebuffer address as VGA. Signed-off-by: Kairui Song --- drivers/video/fbdev/hyperv_fb.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c index 02411d89cb46..e36fb1a0ecdb 100644 --- a/drivers/video/fbdev/hyperv_fb.c +++ b/drivers/video/fbdev/hyperv_fb.c @@ -1114,8 +1114,15 @@ static int hvfb_getmem(struct hv_device *hdev, struct fb_info *info) getmem_done: remove_conflicting_framebuffers(info->apertures, KBUILD_MODNAME, false); - if (!gen2vm) + + if (gen2vm) { + /* framebuffer is reallocated, clear screen_info to avoid misuse from kexec */ + screen_info.lfb_size = 0; + screen_info.lfb_base = 0; + screen_info.orig_video_isVGA = 0; + } else { pci_dev_put(pdev); + } kfree(info->apertures); return 0; -- 2.28.0
Re: [RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem
On Thu, Sep 10, 2020 at 12:43 AM Kairui Song wrote: > > On Wed, Sep 9, 2020 at 10:04 PM Eric W. Biederman > wrote: > > > > Kairui Song writes: > > > > > Currently vmcore only supports reading, this patch series is an RFC > > > to add writing support to vmcore. It's x86_64 only yet, I'll add other > > > architecture later if there is no problem with this idea. > > > > > > My purpose of adding writing support is to reuse the crashed kernel's > > > old memory in kdump kernel, reduce kdump memory pressure, and > > > allow kdump to run with a smaller crashkernel reservation. > > > > > > This is doable because in most cases, after kernel panic, user only > > > interested in the crashed kernel itself, and userspace/cache/free > > > memory pages are not dumped. `makedumpfile` is widely used to skip > > > these pages. Kernel pages usually only take a small part of > > > the whole old memory. So there will be many reusable pages. > > > > > > By adding writing support, userspace then can use these pages as a fast > > > and temporary storage. This helps reduce memory pressure in many ways. > > > > > > For example, I've written a POC program based on this, it will find > > > the reusable pages, and creates an NBD device which maps to these pages. > > > The NBD device can then be used as swap, or to hold some temp files > > > which previouly live in RAM. > > > > > > The link of the POC tool: https://github.com/ryncsn/kdumpd > > > > A couple of thoughts. > > 1) Unless I am completely mistaken treating this as a exercise in > >memory hotplug would be much simpler. > > > >AKA just plug in the memory that is not needed as part of the kdump. > > > >I see below that you have problems doing this because > >of fragmentation. I still think hotplug is doable using some > >kind of fragmented memory zone. > > > > 2) The purpose of the memory reservation is because hardware is > >still potentially running agains the memory of the old kernel. > > > >By the time we have brought up a new kernel enough of the hardware > >may have been reinitialized that we don't have to worry about > >hardware randomly dma'ing into the memory used by the old kernel. > > > >With IOMMUs and care we may be able to guarantee for some machine > >configurations it is impossible for DMA to come from some piece of > >hardware that is present but the kernel does not have a driver > >loaded for.\ > > > > I really do not like this approach because it is fundamentlly doing the > > wrong thing. Adding write support to read-only drivers. I do not see > > anywhere that you even mentioned the hard problem and the reason we > > reserve memory in the first place. Hardware spontaneously DMA'ing onto > > it. > > > That POC tool looks ugly for now as it only a draft to prove this > works, sorry about it. > > For the patch, yes, it is expecting IOMMU to lower the chance of > potential DMA issue, and expecting DMA will not hit userspace/free > page, or at least won't override a massive amount of reusable old > memory. And I thought about some solutions for the potential DMA > issue. > > As old memories are used as a block device, which is proxied by > userspace, so upon each IO, the userspace tool could do an integrity > check of the corresponding data stored in old mem, and keep multiple > copies of the data. (eg. use 512M of old memory to hold a 128M block > device). These copies will be kept far away from each other regarding > the physical memory location. The reusable old memories are sparse so > the actual memory containing the data should be also sparse. > So if some part is corrupted, it is still recoverable. Unless the DMA > went very wrong and wiped a large region of memory, but if such thing > happens, it's most likely kernel pages are also being wiped by DMA, so > the vmcore is already corrupted and kdump may not help. But at least > it won't fail silently, the userspace tool can still do something like > dump some available data to an easy to setup target. > > And also that's one of the reasons not using old memory as kdump's > memory directly. > > > > It's have been a long time issue that kdump suffers from OOM issue > > > with limited crashkernel memory. So reusing old memory could be very > > > helpful. > > > > There is a very fine line here between reusing existing code (aka > > drivers and userspace) and doing something that should work. > > > > It might ma
Re: [RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem
On Wed, Sep 9, 2020 at 10:04 PM Eric W. Biederman wrote: > > Kairui Song writes: > > > Currently vmcore only supports reading, this patch series is an RFC > > to add writing support to vmcore. It's x86_64 only yet, I'll add other > > architecture later if there is no problem with this idea. > > > > My purpose of adding writing support is to reuse the crashed kernel's > > old memory in kdump kernel, reduce kdump memory pressure, and > > allow kdump to run with a smaller crashkernel reservation. > > > > This is doable because in most cases, after kernel panic, user only > > interested in the crashed kernel itself, and userspace/cache/free > > memory pages are not dumped. `makedumpfile` is widely used to skip > > these pages. Kernel pages usually only take a small part of > > the whole old memory. So there will be many reusable pages. > > > > By adding writing support, userspace then can use these pages as a fast > > and temporary storage. This helps reduce memory pressure in many ways. > > > > For example, I've written a POC program based on this, it will find > > the reusable pages, and creates an NBD device which maps to these pages. > > The NBD device can then be used as swap, or to hold some temp files > > which previouly live in RAM. > > > > The link of the POC tool: https://github.com/ryncsn/kdumpd > > A couple of thoughts. > 1) Unless I am completely mistaken treating this as a exercise in >memory hotplug would be much simpler. > >AKA just plug in the memory that is not needed as part of the kdump. > >I see below that you have problems doing this because >of fragmentation. I still think hotplug is doable using some >kind of fragmented memory zone. > > 2) The purpose of the memory reservation is because hardware is >still potentially running agains the memory of the old kernel. > >By the time we have brought up a new kernel enough of the hardware >may have been reinitialized that we don't have to worry about >hardware randomly dma'ing into the memory used by the old kernel. > >With IOMMUs and care we may be able to guarantee for some machine >configurations it is impossible for DMA to come from some piece of >hardware that is present but the kernel does not have a driver >loaded for.\ > > I really do not like this approach because it is fundamentlly doing the > wrong thing. Adding write support to read-only drivers. I do not see > anywhere that you even mentioned the hard problem and the reason we > reserve memory in the first place. Hardware spontaneously DMA'ing onto > it. > That POC tool looks ugly for now as it only a draft to prove this works, sorry about it. For the patch, yes, it is expecting IOMMU to lower the chance of potential DMA issue, and expecting DMA will not hit userspace/free page, or at least won't override a massive amount of reusable old memory. And I thought about some solutions for the potential DMA issue. As old memories are used as a block device, which is proxied by userspace, so upon each IO, the userspace tool could do an integrity check of the corresponding data stored in old mem, and keep multiple copies of the data. (eg. use 512M of old memory to hold a 128M block device). These copies will be kept far away from each other regarding the physical memory location. The reusable old memories are sparse so the actual memory containing the data should be also sparse. So if some part is corrupted, it is still recoverable. Unless the DMA went very wrong and wiped a large region of memory, but if such thing happens, it's most likely kernel pages are also being wiped by DMA, so the vmcore is already corrupted and kdump may not help. But at least it won't fail silently, the userspace tool can still do something like dump some available data to an easy to setup target. And also that's one of the reasons not using old memory as kdump's memory directly. > > It's have been a long time issue that kdump suffers from OOM issue > > with limited crashkernel memory. So reusing old memory could be very > > helpful. > > There is a very fine line here between reusing existing code (aka > drivers and userspace) and doing something that should work. > > It might make sense to figure out what is using so much memory > that an OOM is triggered. > > Ages ago I did something that was essentially dumping the kernels printk > buffer to the serial console in case of a crash and I had things down to > something comparatively miniscule like 8M or less. > > My memory is that historically it has been high performance scsi raid > drivers or something like that, that are behind the need to have such > large memory reservations. > > Now that I think
[RFC PATCH 1/3] vmcore: simplify read_from_olemem
Simplify the code logic, also helps reduce object size and stack usage. Stack usage: Before: fs/proc/vmcore.c:106:9:read_from_oldmem.part.0 80 static fs/proc/vmcore.c:106:9:read_from_oldmem 16 static After: fs/proc/vmcore.c:106:9:read_from_oldmem 80 static Size of vmcore.o: textdata bss dec hex filename Before: 7677 109 8878741ec2 fs/proc/vmcore.o After: 7669 109 8878661eba fs/proc/vmcore.o Signed-off-by: Kairui Song --- fs/proc/vmcore.c | 27 ++- 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index c3a345c28a93..124c2066f3e5 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -108,25 +108,19 @@ ssize_t read_from_oldmem(char *buf, size_t count, bool encrypted) { unsigned long pfn, offset; - size_t nr_bytes; - ssize_t read = 0, tmp; + size_t nr_bytes, to_copy = count; + ssize_t tmp; - if (!count) - return 0; - - offset = (unsigned long)(*ppos % PAGE_SIZE); + offset = (unsigned long)(*ppos & (PAGE_SIZE - 1)); pfn = (unsigned long)(*ppos / PAGE_SIZE); - do { - if (count > (PAGE_SIZE - offset)) - nr_bytes = PAGE_SIZE - offset; - else - nr_bytes = count; + while (to_copy) { + nr_bytes = min(to_copy, PAGE_SIZE - offset); /* If pfn is not ram, return zeros for sparse dump files */ - if (pfn_is_ram(pfn) == 0) + if (pfn_is_ram(pfn) == 0) { memset(buf, 0, nr_bytes); - else { + } else { if (encrypted) tmp = copy_oldmem_page_encrypted(pfn, buf, nr_bytes, @@ -140,14 +134,13 @@ ssize_t read_from_oldmem(char *buf, size_t count, return tmp; } *ppos += nr_bytes; - count -= nr_bytes; buf += nr_bytes; - read += nr_bytes; + to_copy -= nr_bytes; ++pfn; offset = 0; - } while (count); + } - return read; + return count; } /* -- 2.26.2
[RFC PATCH 3/3] x86_64: implement copy_to_oldmem_page
Previous commit introduced writing support for vmcore, it requires per-architecture implementation for the writing function. Signed-off-by: Kairui Song --- arch/x86/kernel/crash_dump_64.c | 49 +++-- 1 file changed, 40 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c index 045e82e8945b..ec80da75b287 100644 --- a/arch/x86/kernel/crash_dump_64.c +++ b/arch/x86/kernel/crash_dump_64.c @@ -13,7 +13,7 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize, unsigned long offset, int userbuf, - bool encrypted) + bool encrypted, bool is_write) { void *vaddr; @@ -28,13 +28,25 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize, if (!vaddr) return -ENOMEM; - if (userbuf) { - if (copy_to_user((void __user *)buf, vaddr + offset, csize)) { - iounmap((void __iomem *)vaddr); - return -EFAULT; + if (is_write) { + if (userbuf) { + if (copy_from_user(vaddr + offset, (void __user *)buf, csize)) { + iounmap((void __iomem *)vaddr); + return -EFAULT; + } + } else { + memcpy(vaddr + offset, buf, csize); } - } else - memcpy(buf, vaddr + offset, csize); + } else { + if (userbuf) { + if (copy_to_user((void __user *)buf, vaddr + offset, csize)) { + iounmap((void __iomem *)vaddr); + return -EFAULT; + } + } else { + memcpy(buf, vaddr + offset, csize); + } + } set_iounmap_nonlazy(); iounmap((void __iomem *)vaddr); @@ -57,7 +69,7 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize, ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize, unsigned long offset, int userbuf) { - return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, false); + return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, false, false); } /** @@ -68,7 +80,26 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize, ssize_t copy_oldmem_page_encrypted(unsigned long pfn, char *buf, size_t csize, unsigned long offset, int userbuf) { - return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, true); + return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, true, false); +} + +/** + * copy_to_oldmem_page - similar to copy_oldmem_page but in opposite direction. + */ +ssize_t copy_to_oldmem_page(unsigned long pfn, char *src, size_t csize, + unsigned long offset, int userbuf) +{ + return __copy_oldmem_page(pfn, src, csize, offset, userbuf, false, true); +} + +/** + * copy_to_oldmem_page_encrypted - similar to copy_oldmem_page_encrypted but + * in opposite direction. + */ +ssize_t copy_to_oldmem_page_encrypted(unsigned long pfn, char *src, size_t csize, + unsigned long offset, int userbuf) +{ + return __copy_oldmem_page(pfn, src, csize, offset, userbuf, true, true); } ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos) -- 2.26.2
[RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem
Currently vmcore only supports reading, this patch series is an RFC to add writing support to vmcore. It's x86_64 only yet, I'll add other architecture later if there is no problem with this idea. My purpose of adding writing support is to reuse the crashed kernel's old memory in kdump kernel, reduce kdump memory pressure, and allow kdump to run with a smaller crashkernel reservation. This is doable because in most cases, after kernel panic, user only interested in the crashed kernel itself, and userspace/cache/free memory pages are not dumped. `makedumpfile` is widely used to skip these pages. Kernel pages usually only take a small part of the whole old memory. So there will be many reusable pages. By adding writing support, userspace then can use these pages as a fast and temporary storage. This helps reduce memory pressure in many ways. For example, I've written a POC program based on this, it will find the reusable pages, and creates an NBD device which maps to these pages. The NBD device can then be used as swap, or to hold some temp files which previouly live in RAM. The link of the POC tool: https://github.com/ryncsn/kdumpd I tested it on x86_64 on latest Fedora by using it as swap with following step in kdump kernel: 1. Install this tool in kdump initramfs 2. Execute following command in kdump: /sbin/modprobe nbd nbds_max=1 /bin/kdumpd & /sbin/mkswap /dev/nbd0 /sbin/swapon /dev/nbd0 3. Observe the swap is being used: SwapTotal:131068 kB SwapFree: 121852 kB It helped to reduce the crashkernel from 168M to 110M for a successful kdump run over NFSv3. There are still many workitems that could be done based on this idea, eg. move the initramfs content to the old memory, which may help reduce another ~10-20M of memory. It's have been a long time issue that kdump suffers from OOM issue with limited crashkernel memory. So reusing old memory could be very helpful. This method have it's limitation: - Swap only works for userspace. But kdump userspace is a major memory consumer, so in general this should be helpful enough. - For users who want to dump the whole memory area, this won't help as there is no reusable page. I've tried other ways to improve the crashkernel value, eg. - Reserve some smaller memory segments in first kernel for crashkernel: It's only a suppliment of the default crashkernel reservation and only make crashkernel value more adjustable, still not solving the real problem. - Reuse old memory, but hotplug chunk of reusable old memory into kdump kernel's memory: It's hard to find large chunk of continuous memory, especially on systems with heavy workload, the reusable regions could be very fragmental. So it can only hotplug small fragments of memories, which looks hackish, and may have a high page table overhead. - Implement the old memory based based block device as a kernel module. It doesn't looks good to have a module for this sole usage and it don't have much performance/implementation advantage compared to this RFC. Besides, keeping all the complex logic of parsing reusing old memory logic in userspace seems a better idea. And as a plus, this could make it more doable and reasonable to have n crashkernel=auto param. If there is a swap, then userspace will have less memory pressure. crashkernel=auto can focus on the kernel usage. Kairui Song (3): vmcore: simplify read_from_olemem vmcore: Add interface to write to old mem x86_64: implement copy_to_oldmem_page arch/x86/kernel/crash_dump_64.c | 49 -- fs/proc/vmcore.c| 154 ++-- include/linux/crash_dump.h | 18 +++- 3 files changed, 180 insertions(+), 41 deletions(-) -- 2.26.2
[RFC PATCH 2/3] vmcore: Add interface to write to old mem
vmcore is used as the interface to access crashed kernel's memory in kdump, and currently vmcore only supports reading. Adding writing support is useful for enabling userspace making better use of the old memory. For kdump, `makedumpfile` is widely used to reduce the dumped vmcore size, and in most setup, it will drop user space memory, caches. This means these memory pages are reusable. Kdump runs in limited pre-reserved memory region, so if these old memory pages are reused, it can help reduce memory pressure in kdump kernel, hence allow first kernel to reserve less memory for kdump. Adding write support to vmcore is the first step, then user space can do IO on the old mem. There are multiple ways to reuse the memory, for example, userspace can register a NBD device, and redirect the IO on the device to old memory. The NBD device can be used as swap, or used to hold some temp files. Signed-off-by: Kairui Song --- fs/proc/vmcore.c | 129 + include/linux/crash_dump.h | 18 -- 2 files changed, 131 insertions(+), 16 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 124c2066f3e5..23acc0f2ecd7 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -103,9 +103,9 @@ static int pfn_is_ram(unsigned long pfn) } /* Reads a page from the oldmem device from given offset. */ -ssize_t read_from_oldmem(char *buf, size_t count, -u64 *ppos, int userbuf, -bool encrypted) +static ssize_t oldmem_rw_page(char *buf, size_t count, + u64 *ppos, int userbuf, + bool encrypted, bool is_write) { unsigned long pfn, offset; size_t nr_bytes, to_copy = count; @@ -119,20 +119,33 @@ ssize_t read_from_oldmem(char *buf, size_t count, /* If pfn is not ram, return zeros for sparse dump files */ if (pfn_is_ram(pfn) == 0) { - memset(buf, 0, nr_bytes); - } else { - if (encrypted) - tmp = copy_oldmem_page_encrypted(pfn, buf, -nr_bytes, -offset, -userbuf); + if (is_write) + return -EINVAL; else - tmp = copy_oldmem_page(pfn, buf, nr_bytes, - offset, userbuf); + memset(buf, 0, nr_bytes); + } else { + if (encrypted) { + tmp = is_write ? + copy_to_oldmem_page_encrypted(pfn, buf, + nr_bytes, + offset, + userbuf) : + copy_oldmem_page_encrypted(pfn, buf, + nr_bytes, + offset, + userbuf); + } else { + tmp = is_write ? + copy_to_oldmem_page(pfn, buf, nr_bytes, + offset, userbuf) : + copy_oldmem_page(pfn, buf, nr_bytes, + offset, userbuf); + } if (tmp < 0) return tmp; } + *ppos += nr_bytes; buf += nr_bytes; to_copy -= nr_bytes; @@ -143,6 +156,22 @@ ssize_t read_from_oldmem(char *buf, size_t count, return count; } +/* Reads a page from the oldmem device from given offset. */ +ssize_t read_from_oldmem(char *buf, size_t count, +u64 *ppos, int userbuf, +bool encrypted) +{ + return oldmem_rw_page(buf, count, ppos, userbuf, encrypted, 0); +} + +/* Writes a page to the oldmem device of given offset. */ +ssize_t write_to_oldmem(char *buf, size_t count, + u64 *ppos, int userbuf, + bool encrypted) +{ + return oldmem_rw_page(buf, count, ppos, userbuf, encrypted, 1); +} + /* * Architectures may override this function to allocate ELF header in 2nd kernel */ @@ -184,6 +213,26 @@ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma, return remap_pfn_range(vma, from, pfn, size, prot); } +/* + * Architectures which support writ
Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel
On Thu, Jul 23, 2020 at 8:00 AM Bjorn Helgaas wrote: > > On Wed, Jul 22, 2020 at 03:50:48PM -0600, Jerry Hoemann wrote: > > On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote: > > > On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote: > > > > > I think I didn't make one thing clear, The PCI UR error never arrives > > > > in kernel, it's the iLo BMC on that HPE machine caught the error, and > > > > send kernel an NMI. kernel is panicked by NMI, I'm still trying to > > > > figure out why the NMI hanged kernel, even with panic=-1, > > > > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the > > > > NMI by shutdown the devices in right order, that's also a solution. > > ACPI v6.3, chapter 18, does mention NMIs several times, e.g., Table > 18-394 and sec 18.4. I'm not familiar enough with APEI to know > whether Linux correctly supports all those cases. Maybe this is a > symptom that we don't? > > > > I'm not sure how much sympathy to have for this situation. A PCIe UR > > > is fatal for the transaction and maybe even the device, but from the > > > overall system point of view, it *should* be a recoverable error and > > > we shouldn't panic. > > > > > > Errors like that should be reported via the normal AER or ACPI/APEI > > > mechanisms. It sounds like in this case, the platform has decided > > > these aren't enough and it is trying to force a reboot? If this is > > > "special" platform behavior, I'm not sure how much we need to cater > > > for it. > > > > Are these AER errors the type processed by the GHES code? > > My understanding from ACPI v6.3, sec 18.3.2, is that the Hardware > Error Source Table may contain Error Source Descriptors of types like: > > IA-32 Machine Check Exception > IA-32 Corrected Machine Check > IA-32 Non-Maskable Interrupt > PCIe Root Port AER > PCIe Device AER > Generic Hardware Error Source (GHES) > Hardware Error Notification > IA-32 Deferred Machine Check > > I would naively expect PCIe UR errors to be reported via one of the > PCIe Error Sources, not GHES, but maybe there's some reason to use > GHES. > > The kernel should already know how to deal with the PCIe AER errors, > but we'd have to add new device-specific code to handle things > reported via GHES, along the lines of what Shiju is doing here: > > https://lore.kernel.org/r/20200722104245.1060-1-shiju.j...@huawei.com > > > I'll note that RedHat runs their crash kernel with: hest_disable. > > So, the ghes code is disabled in the crash kernel. > > That would disable all the HEST error sources, including the PCIe AER > ones as well as GHES ones. If we turn off some of the normal error > handling mechanisms, I guess we have to expect that some errors won't > be handled correctly. Hi, that's true, hest_disable is added by default to reduce memory usage in special cases. But even if I remove hest_disable and have GHES enabled, but the hanging issue still exists, from the iLO console log, it's still sending an NMI to kernel, and kernel hanged. The NMI won't hang the kernel for 100 percent, sometime it will just panic and reboot and sometimes it hangs. This behavior didn't change after/before enabled the GHES. Maybe this is a "special platform behavior". I'm also not 100 percent sure if/how we can cover this in a good way for now. I'll try to figure how the NMI actually hanged the kernel and see if it could be fixed in other ways. -- Best Regards, Kairui Song
Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel
On Fri, Mar 6, 2020 at 5:38 PM Baoquan He wrote: > > On 03/04/20 at 08:53pm, Deepa Dinamani wrote: > > On Wed, Mar 4, 2020 at 7:53 PM Baoquan He wrote: > > > > > > +Joerg to CC. > > > > > > On 03/03/20 at 01:01pm, Deepa Dinamani wrote: > > > > I looked at this some more. Looks like we do not clear irqs when we do > > > > a kexec reboot. And, the bootup code maintains the same table for the > > > > kexec-ed kernel. I'm looking at the following code in > > > > > > I guess you are talking about kdump reboot here, right? Kexec and kdump > > > boot take the similar mechanism, but differ a little. > > > > Right I meant kdump kernel here. And, clearly the is_kdump_kernel() case > > below. > > > > > > > > > intel_irq_remapping.c: > > > > > > > > if (ir_pre_enabled(iommu)) { > > > > if (!is_kdump_kernel()) { > > > > pr_warn("IRQ remapping was enabled on %s but > > > > we are not in kdump mode\n", > > > > iommu->name); > > > > clear_ir_pre_enabled(iommu); > > > > iommu_disable_irq_remapping(iommu); > > > > } else if (iommu_load_old_irte(iommu)) > > > > > > Here, it's for kdump kernel to copy old ir table from 1st kernel. > > > > Correct. > > > > > > pr_err("Failed to copy IR table for %s from > > > > previous kernel\n", > > > >iommu->name); > > > > else > > > > pr_info("Copied IR table for %s from previous > > > > kernel\n", > > > > iommu->name); > > > > } > > > > > > > > Would cleaning the interrupts(like in the non kdump path above) just > > > > before shutdown help here? This should clear the interrupts enabled > > > > for all the devices in the current kernel. So when kdump kernel > > > > starts, it starts clean. This should probably help block out the > > > > interrupts from a device that does not have a driver. > > > > > > I think stopping those devices out of control from continue sending > > > interrupts is a good idea. While not sure if only clearing the interrupt > > > will be enough. Those devices which will be initialized by their driver > > > will brake, but devices which drivers are not loaded into kdump kernel > > > may continue acting. Even though interrupts are cleaning at this time, > > > the on-flight DMA could continue triggerring interrupt since the ir > > > table and iopage table are rebuilt. > > > > This should be handled by the IOMMU, right? And, hence you are getting > > UR. This seems like the correct execution flow to me. > > Sorry for late reply. > Yes, this is initializing IOMMU device. > > > > > Anyway, you could just test this theory by removing the > > is_kdump_kernel() check above and see if it solves your problem. > > Obviously, check the VT-d spec to figure out the exact sequence to > > turn off the IR. > > OK, I will talk to Kairui and get a machine to test it. Thanks for your > nice idea, if you have a draft patch, we are happy to test it. > > > > > Note that the device that is causing the problem here is a legit > > device. We want to have interrupts from devices we don't know about > > blocked anyway because we can have compromised firmware/ devices that > > could cause a DoS attack. So blocking the unwanted interrupts seems > > like the right thing to do here. > > Kairui said it's a device which driver is not loaded in kdump kernel > because it's not needed by kdump. We try to only load kernel modules > which are needed, e.g one device is the dump target, its driver has to > be loaded in. In this case, the device is more like a out of control > device to kdump kernel. > Hi Bao, Deepa, sorry for this very late response. The test machine was not available for sometime, and I restarted to work on this problem. For the workaround mention by Deepa (by remote the is_kdump_kernel() check), it didn't work, the machine still hangs upon shutdown. The devices that were left in an unknown state and sending interrupt could be a problem, but it's irrelevant to this hanging problem. I think I didn't make one thing clear, The PCI UR error never arrives in kernel, it's the iLo BMC on that HPE machine caught the error, and send kernel an NMI. kernel is panicked by NMI, I'm still trying to figure out why the NMI hanged kernel, even with panic=-1, panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the NMI by shutdown the devices in right order, that's also a solution. -- Best Regards, Kairui Song
Re: [PATCH v2] x86, efi: never relocate kernel below lowest acceptable address
On Wed, Sep 25, 2019 at 5:55 PM Baoquan He wrote: > > On 09/20/19 at 12:05am, Kairui Song wrote: > > Currently, kernel fails to boot on some HyperV VMs when using EFI. > > And it's a potential issue on all platforms. > > > > It's caused a broken kernel relocation on EFI systems, when below three > > conditions are met: > > > > 1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR) > >by the loader. > > 2. There isn't enough room to contain the kernel, starting from the > >default load address (eg. something else occupied part the region). > > 3. In the memmap provided by EFI firmware, there is a memory region > >starts below LOAD_PHYSICAL_ADDR, and suitable for containing the > >kernel. > > Thanks for the effort, Kairui. > > Let me summarize what I got from this issue, please correct me if > anything missed: > > *** > Problem: > This bug is reported on Hyper-V platform. The kernel will reset to > firmware w/o any console printing in 1st kernel and kdump kernel > sometime. > > *** > Root cause: > With debugging, the resetting to firmware is triggered when execute > 'rep movsq' line of /boot/compressed/head_64.S. The reason is that > efi boot stub may put kernel image below 16M, then later head_64.S will > relocate kernel to 16M directly. That relocation will conflict with some > efi reserved region, then cause the resetting. > > A more detail process based on the problem occurred on that HyperV > machine: > > - kernel (INIT_SIZE: 56820K) got loaded at 0x3c881000 (not aligned, > and not equal to pref_address 0x100), need to relocate. > > - efi_relocate_kernel is called, try to allocate INIT_SIZE of memory > at pref_address, failed, something else occupied this region. > > - efi_relocate_kernel call efi_low_alloc as fallback, and got the address > 0x80 (Below 0x100) > > - Later in arch/x86/boot/compressed/head_64.S:108, LOAD_PHYSICAL_ADDR is > force used as the new load address as the current address is lower than > that. Then kernel try relocate to 0x100. > > - However the memory starting from 0x100 is not allocated from EFI > firmware, writing to this region caused the system to reset. > > *** > Solution: > Alwasys search area above LOAD_PHYSICAL_ADDR, namely 16M to put kernel > image in /boot/compressed/eboot.c. Then efi boot stub in eboot.c will > search an suitable area in efi memmap, to make sure no any reserved > region will conflict with the target area of kernel image. Besides, > kernel won't be relocated in /boot/compressed/head_64.S since it has > been above 16M. > > #ifdef CONFIG_RELOCATABLE > leaqstartup_32(%rip) /* - $startup_32 */, %rbp > movlBP_kernel_alignment(%rsi), %eax > decl%eax > addq%rax, %rbp > notq%rax > andq%rax, %rbp > cmpq$LOAD_PHYSICAL_ADDR, %rbp > jge 1f > #endif > movq$LOAD_PHYSICAL_ADDR, %rbp > 1: > > /* Target address to relocate to for decompression */ > movlBP_init_size(%rsi), %ebx > subl$_end, %ebx > addq%rbp, %rbx > Hi Baoquan, Yes, it's all correct. Thanks for adding these details. > > *** > I have one concerns about this patch: > > Why this only happen in Hyper-V platform. Qemu/kvm, baremetal, vmware > ESI don't have this issue? What's the difference? Let me post part the efi memmap on that machine (and btw the kernel size is 55M): kernel: efi: mem00: type=7, attr=0xf, range=[0x-0x0008) (0MB) kernel: efi: mem01: type=4, attr=0xf, range=[0x0008-0x00081000) (0MB) kernel: efi: mem02: type=2, attr=0xf, range=[0x00081000-0x00082000) (0MB) kernel: efi: mem03: type=7, attr=0xf, range=[0x00082000-0x000a) (0MB) kernel: efi: mem04: type=4, attr=0xf, range=[0x0010-0x0062a000) (5MB) kernel: efi: mem05: type=7, attr=0xf, range=[0x0062a000-0x0420) (59MB) kernel: efi: mem06: type=4, attr=0xf, range=[0x0420-0x0440) (2MB) kernel: efi: mem07: type=7, attr=0xf, range=[0x0440-0x045c6000) (1MB) kernel: efi: mem08: type=4, attr=0xf, range=[0x045c6000-0x045e6000) (0MB) kernel: efi: mem09: type=3, attr=0xf, range=[0x045e6000-0x0460b000) (0MB) kernel: efi: mem10: type=4, attr=0xf, range=[0x0460b000-0x04613000) (0MB) kernel: efi: mem11: type=3, attr=0xf, range=[0x04613000-0x0462b000) (0MB) kernel: efi: mem12: type=7, attr=0xf, range=[0x0462b000-0x0480) (1MB) kernel: efi: mem13: type=2, attr=0xf, range=[0x0480-0x00
[tip:x86/boot] x86/kexec: Add the ACPI NVS region to the ident map
Commit-ID: 5a949b38839e284b1307540c56b03caf57da9736 Gitweb: https://git.kernel.org/tip/5a949b38839e284b1307540c56b03caf57da9736 Author: Kairui Song AuthorDate: Mon, 10 Jun 2019 15:36:17 +0800 Committer: Borislav Petkov CommitDate: Mon, 10 Jun 2019 22:00:26 +0200 x86/kexec: Add the ACPI NVS region to the ident map With the recent addition of RSDP parsing in the decompression stage, a kexec-ed kernel now needs ACPI tables to be covered by the identity mapping. And in commit 6bbeb276b71f ("x86/kexec: Add the EFI system tables and ACPI tables to the ident map") the ACPI tables memory region was added to the ident map. But some machines have only an ACPI NVS memory region and the ACPI tables are located in that region. In such case, the kexec-ed kernel will still fail when trying to access ACPI tables if they're not mapped. So add the NVS memory region to the ident map as well. [ bp: Massage. ] Fixes: 6bbeb276b71f ("x86/kexec: Add the EFI system tables and ACPI tables to the ident map") Suggested-by: Junichi Nomura Signed-off-by: Kairui Song Signed-off-by: Borislav Petkov Tested-by: Junichi Nomura Cc: Baoquan He Cc: Chao Fan Cc: Dave Young Cc: Dirk van der Merwe Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: ke...@lists.infradead.org Cc: Lianbo Jiang Cc: "Rafael J. Wysocki" Cc: Thomas Gleixner Cc: x86-ml Link: https://lkml.kernel.org/r/20190610073617.19767-1-kas...@redhat.com --- arch/x86/kernel/machine_kexec_64.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 3c77bdf7b32a..b2b88dcaaf88 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -54,14 +54,26 @@ static int mem_region_callback(struct resource *res, void *arg) static int map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p) { - unsigned long flags = IORESOURCE_MEM | IORESOURCE_BUSY; struct init_pgtable_data data; + unsigned long flags; + int ret; data.info = info; data.level4p = level4p; flags = IORESOURCE_MEM | IORESOURCE_BUSY; - return walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1, - , mem_region_callback); + + ret = walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1, + , mem_region_callback); + if (ret && ret != -EINVAL) + return ret; + + /* ACPI tables could be located in ACPI Non-volatile Storage region */ + ret = walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, + , mem_region_callback); + if (ret && ret != -EINVAL) + return ret; + + return 0; } #else static int map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p) { return 0; }
[tip:x86/boot] x86/kexec: Add the EFI system tables and ACPI tables to the ident map
Commit-ID: 6bbeb276b71f06c5267bfd154629b1bec82e7136 Gitweb: https://git.kernel.org/tip/6bbeb276b71f06c5267bfd154629b1bec82e7136 Author: Kairui Song AuthorDate: Mon, 29 Apr 2019 08:23:18 +0800 Committer: Borislav Petkov CommitDate: Thu, 6 Jun 2019 20:13:48 +0200 x86/kexec: Add the EFI system tables and ACPI tables to the ident map Currently, only the whole physical memory is identity-mapped for the kexec kernel and the regions reserved by firmware are ignored. However, the recent addition of RSDP parsing in the decompression stage and especially: 33f0df8d843d ("x86/boot: Search for RSDP in the EFI tables") which tries to access EFI system tables and to dig out the RDSP address from there, becomes a problem because in certain configurations, they might not be mapped in the kexec'ed kernel's address space. What is more, this problem doesn't appear on all systems because the kexec kernel uses gigabyte pages to build the identity mapping. And the EFI system tables and ACPI tables can, depending on the system configuration, end up being mapped as part of all physical memory, if they share the same 1 GB area with the physical memory. Therefore, make sure they're always mapped. [ bp: productize half-baked patch: - rewrite commit message. - correct the map_acpi_tables() function name in the !ACPI case. ] Signed-off-by: Kairui Song Signed-off-by: Baoquan He Signed-off-by: Borislav Petkov Tested-by: Dirk van der Merwe Cc: dyo...@redhat.com Cc: fanc.f...@cn.fujitsu.com Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: j-nom...@ce.jp.nec.com Cc: ke...@lists.infradead.org Cc: "Kirill A. Shutemov" Cc: Lianbo Jiang Cc: Tetsuo Handa Cc: Thomas Gleixner Cc: x86-ml Link: https://lkml.kernel.org/r/20190429002318.GA25400@MiWiFi-R3L-srv --- arch/x86/kernel/machine_kexec_64.c | 75 ++ 1 file changed, 75 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index ceba408ea982..3c77bdf7b32a 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -29,6 +30,43 @@ #include #include +#ifdef CONFIG_ACPI +/* + * Used while adding mapping for ACPI tables. + * Can be reused when other iomem regions need be mapped + */ +struct init_pgtable_data { + struct x86_mapping_info *info; + pgd_t *level4p; +}; + +static int mem_region_callback(struct resource *res, void *arg) +{ + struct init_pgtable_data *data = arg; + unsigned long mstart, mend; + + mstart = res->start; + mend = mstart + resource_size(res) - 1; + + return kernel_ident_mapping_init(data->info, data->level4p, mstart, mend); +} + +static int +map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p) +{ + unsigned long flags = IORESOURCE_MEM | IORESOURCE_BUSY; + struct init_pgtable_data data; + + data.info = info; + data.level4p = level4p; + flags = IORESOURCE_MEM | IORESOURCE_BUSY; + return walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1, + , mem_region_callback); +} +#else +static int map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p) { return 0; } +#endif + #ifdef CONFIG_KEXEC_FILE const struct kexec_file_ops * const kexec_file_loaders[] = { _bzImage64_ops, @@ -36,6 +74,31 @@ const struct kexec_file_ops * const kexec_file_loaders[] = { }; #endif +static int +map_efi_systab(struct x86_mapping_info *info, pgd_t *level4p) +{ +#ifdef CONFIG_EFI + unsigned long mstart, mend; + + if (!efi_enabled(EFI_BOOT)) + return 0; + + mstart = (boot_params.efi_info.efi_systab | + ((u64)boot_params.efi_info.efi_systab_hi<<32)); + + if (efi_enabled(EFI_64BIT)) + mend = mstart + sizeof(efi_system_table_64_t); + else + mend = mstart + sizeof(efi_system_table_32_t); + + if (!mstart) + return 0; + + return kernel_ident_mapping_init(info, level4p, mstart, mend); +#endif + return 0; +} + static void free_transition_pgtable(struct kimage *image) { free_page((unsigned long)image->arch.p4d); @@ -159,6 +222,18 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable) return result; } + /* +* Prepare EFI systab and ACPI tables for kexec kernel since they are +* not covered by pfn_mapped. +*/ + result = map_efi_systab(, level4p); + if (result) + return result; + + result = map_acpi_tables(, level4p); + if (result) + return result; + return init_transition_pgtable(image, level4p); }
[PATCH v5] vmcore: Add a kernel parameter novmcoredd
Since commit 2724273e8fd0 ("vmcore: add API to collect hardware dump in second kernel"), drivers is allowed to add device related dump data to vmcore as they want by using the device dump API. This have a potential issue, the data is stored in memory, drivers may append too much data and use too much memory. The vmcore is typically used in a kdump kernel which runs in a pre-reserved small chunk of memory. So as a result it will make kdump unusable at all due to OOM issues. So introduce new 'novmcoredd' command line option. User can disable device dump to reduce memory usage. This is helpful if device dump is using too much memory, disabling device dump could make sure a regular vmcore without device dump data is still available. Signed-off-by: Kairui Song Reviewed-by: Bhupesh Sharma Acked-by: Dave Young --- Hi Andrew, sorry for the trouble but could you help pick up this one instead for "vmcore: Add a kernel parameter novmcoredd" patch? Previous one is in mm tree but failed compile when CONFIG_MODULES is not set, I fixed this issue and carried something else like your doc fix, thanks! Update from V4: - Document adjust by Andrew Morton, also move the text to a better position - Fix compile error when CONFIG_MODULES is not set - Return EPERM instead of EINVAL when device dump is disabled as suggested by Dave Young Update from V3: - Use novmcoredd instead of vmcore_device_dump. Use vmcore_device_dump and make it off by default is confusing, novmcoredd is a cleaner way to let user space be able to disable device dump to save memory. Update from V2: - Improve related docs Update from V1: - Use bool parameter to turn it on/off instead of letting user give the size limit. Size of device dump is hard to determine. Documentation/admin-guide/kernel-parameters.txt | 11 +++ fs/proc/Kconfig | 3 ++- fs/proc/vmcore.c| 9 + 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 138f6664b2e2..90b25234d965 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3088,6 +3088,17 @@ nosync [HW,M68K] Disables sync negotiation for all devices. + novmcoredd [KNL,KDUMP] + Disable device dump. Device dump allows drivers to + append dump data to vmcore so you can collect driver + specified debug info. Drivers can append the data + without any limit and this data is stored in memory, + so this may cause significant memory stress. Disabling + device dump can help save memory but the driver debug + data will be no longer available. This parameter + is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP + is set. + nowatchdog [KNL] Disable both lockup detectors, i.e. soft-lockup and NMI watchdog (hard-lockup). diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig index 62ee41b4bbd0..b74ea844abd5 100644 --- a/fs/proc/Kconfig +++ b/fs/proc/Kconfig @@ -58,7 +58,8 @@ config PROC_VMCORE_DEVICE_DUMP snapshot. If you say Y here, the collected device dumps will be added - as ELF notes to /proc/vmcore. + as ELF notes to /proc/vmcore. You can still disable device + dump using the kernel command line option 'novmcoredd'. config PROC_SYSCTL bool "Sysctl support (/proc/sys)" if EXPERT diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 7bb96fdd38ad..936e9dbbfbec 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -54,6 +55,9 @@ static struct proc_dir_entry *proc_vmcore; /* Device Dump list and mutex to synchronize access to list */ static LIST_HEAD(vmcoredd_list); static DEFINE_MUTEX(vmcoredd_mutex); + +static bool vmcoredd_disabled; +core_param(novmcoredd, vmcoredd_disabled, bool, 0); #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */ /* Device Dump Size */ @@ -1452,6 +1456,11 @@ int vmcore_add_device_dump(struct vmcoredd_data *data) size_t data_size; int ret; + if (vmcoredd_disabled) { + pr_err_once("Device dump is disabled\n"); + return -EPERM; + } + if (!data || !strlen(data->dump_name) || !data->vmcoredd_callback || !data->size) return -EINVAL; -- 2.21.0
Re: Getting empty callchain from perf_callchain_kernel()
On Sat, May 25, 2019 at 7:23 AM Josh Poimboeuf wrote: > > On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote: > > On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf wrote: > > > > > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > > > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf > > > > wrote: > > > > > > > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > > > > > Hi Josh, this still won't fix the problem. > > > > > > > > > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what > > > > > > > > actually went > > > > > > > > wrong is with the JITed bpf code. > > > > > > > > > > > > > > There seem to be a bunch of issues. My patch at least fixes the > > > > > > > failing > > > > > > > selftest reported by Alexei for ORC. > > > > > > > > > > > > > > How can I recreate your issue? > > > > > > > > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with > > > > > > that > > > > > > fix stack trace is still invalid. > > > > > > > > > > > > CMD I used with bcc: > > > > > > python3 ./tools/stackcount.py t:sched:sched_fork > > > > > > > > > > I've had problems in the past getting bcc to build, so I was hoping it > > > > > was reproducible with a standalone selftest. > > > > > > > > > > > And I just had another try applying your patch, self test is also > > > > > > failing. > > > > > > > > > > Is it the same selftest reported by Alexei? > > > > > > > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > > > > err -1 errno 2 > > > > > > > > > > > I'm applying on my local master branch, a few days older than > > > > > > upstream, I can update and try again, am I missing anything? > > > > > > > > > > The above patch had some issues, so with some configs you might see an > > > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > > > > > fix the test_stacktrace_map selftest. > > > > > > > > > > Here's the latest version which should fix it in all cases (based on > > > > > tip/master): > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > > > > > > > Hmm, I still get the failure: > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > > > err -1 errno 2 > > > > > > > > And I didn't see how this will fix the issue. As long as ORC need to > > > > unwind through the JITed code it will fail. And that will happen > > > > before reaching ___bpf_prog_run. > > > > > > Ok, I was able to recreate by doing > > > > > > echo 1 > /proc/sys/net/core/bpf_jit_enable > > > > > > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. > > > > > > > Yes, with JIT off it will be fixed. I can confirm that. > > Here's a tentative BPF fix for the JIT frame pointer issue. It was a > bit harder than I expected. Encoding r12 as a base register requires a > SIB byte, so I had to add support for encoding that. I also simplified > the prologue to resemble a GCC prologue, which decreases the prologue > size quite a bit. > > Next week I can work on the corresponding ORC change. Then I can clean > all the patches up and submit them properly. > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index afabf597c855..c9b4503558c9 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size) > /* > * The following table maps BPF registers to x86-64 registers. > * > - * x86-64 register R12 is unused, since if used as base address > - * register in load/store instructions, it always needs an > - * extra byte of encoding and is callee saved. > + * RBP isn't used; it needs to be preserved to allow the unwinder to move > + * through generated code stacks. > * > * Also x86-64 register R9 is unused. x86-64 register R10 is > * used fo
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf wrote: > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf > > wrote: > > > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > > > Hi Josh, this still won't fix the problem. > > > > > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually > > > > > > went > > > > > > wrong is with the JITed bpf code. > > > > > > > > > > There seem to be a bunch of issues. My patch at least fixes the > > > > > failing > > > > > selftest reported by Alexei for ORC. > > > > > > > > > > How can I recreate your issue? > > > > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that > > > > fix stack trace is still invalid. > > > > > > > > CMD I used with bcc: > > > > python3 ./tools/stackcount.py t:sched:sched_fork > > > > > > I've had problems in the past getting bcc to build, so I was hoping it > > > was reproducible with a standalone selftest. > > > > > > > And I just had another try applying your patch, self test is also > > > > failing. > > > > > > Is it the same selftest reported by Alexei? > > > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err > > > -1 errno 2 > > > > > > > I'm applying on my local master branch, a few days older than > > > > upstream, I can update and try again, am I missing anything? > > > > > > The above patch had some issues, so with some configs you might see an > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > > > fix the test_stacktrace_map selftest. > > > > > > Here's the latest version which should fix it in all cases (based on > > > tip/master): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > > > Hmm, I still get the failure: > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > err -1 errno 2 > > > > And I didn't see how this will fix the issue. As long as ORC need to > > unwind through the JITed code it will fail. And that will happen > > before reaching ___bpf_prog_run. > > Ok, I was able to recreate by doing > > echo 1 > /proc/sys/net/core/bpf_jit_enable > > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. > Yes, with JIT off it will be fixed. I can confirm that. -- Best Regards, Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf wrote: > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > Hi Josh, this still won't fix the problem. > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > > > wrong is with the JITed bpf code. > > > > > > There seem to be a bunch of issues. My patch at least fixes the failing > > > selftest reported by Alexei for ORC. > > > > > > How can I recreate your issue? > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that > > fix stack trace is still invalid. > > > > CMD I used with bcc: > > python3 ./tools/stackcount.py t:sched:sched_fork > > I've had problems in the past getting bcc to build, so I was hoping it > was reproducible with a standalone selftest. > > > And I just had another try applying your patch, self test is also failing. > > Is it the same selftest reported by Alexei? > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 > errno 2 > > > I'm applying on my local master branch, a few days older than > > upstream, I can update and try again, am I missing anything? > > The above patch had some issues, so with some configs you might see an > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > fix the test_stacktrace_map selftest. > > Here's the latest version which should fix it in all cases (based on > tip/master): > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix Hmm, I still get the failure: test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2 And I didn't see how this will fix the issue. As long as ORC need to unwind through the JITed code it will fail. And that will happen before reaching ___bpf_prog_run. > > > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > > > unshift it properly then it will work. > > > > > > Yeah, that looks like a frame pointer bug in emit_prologue(). > > > > > > > I tried below code, and problem is fixed (only for frame pointer > > > > unwinder though). Need to find a better way to detect and do any > > > > similar trick for bpf part, if this is a feasible way to fix it: > > > > > > > > diff --git a/arch/x86/kernel/unwind_frame.c > > > > b/arch/x86/kernel/unwind_frame.c > > > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > > > --- a/arch/x86/kernel/unwind_frame.c > > > > +++ b/arch/x86/kernel/unwind_frame.c > > > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > > > > } > > > > > > > > /* Move to the next frame if it's safe: */ > > > > - if (!update_stack_state(state, next_bp)) > > > > - goto bad_address; > > > > + if (!update_stack_state(state, next_bp)) { > > > > + // Try again with shifted BP > > > > + state->bp += 5; // see AUX_STACK_SPACE > > > > + next_bp = (unsigned long > > > > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > > > > + // Clean and refetch stack info, it's marked as error > > > > outed > > > > + state->stack_mask = 0; > > > > + get_stack_info(next_bp, state->task, > > > > >stack_info, >stack_mask); > > > > + if (!update_stack_state(state, next_bp)) { > > > > + goto bad_address; > > > > + } > > > > + } > > > > > > > > return true; > > > > > > Nack. > > > > > > > For ORC unwinder, I think the unwinder can't find any info about the > > > > JITed part. Maybe if can let it just skip the JITed part and go to > > > > kernel context, then should be good enough. > > > > > > If it's starting from a fake pt_regs then that's going to be a > > > challenge. > > > > > > Will the JIT code always have the same stack layout? If so then we > > > could hard code that knowledge in ORC. Or even better, create a generic > > > interface for ORC to query the creator of the generated code about the > > > stack layout. > > > > I think yes. > > > > Not sure why we have the BP
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 9:32 PM Josh Poimboeuf wrote: > > On Thu, May 23, 2019 at 02:48:11PM +0800, Kairui Song wrote: > > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > > > > > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > > > > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > > > > > The one that is broken is prog_tests/stacktrace_map.c > > > > > There we attach bpf to standard tracepoint where > > > > > kernel suppose to collect pt_regs before calling into bpf. > > > > > And that's what bpf_get_stackid_tp() is doing. > > > > > It passes pt_regs (that was collected before any bpf) > > > > > into bpf_get_stackid() which calls get_perf_callchain(). > > > > > Same thing with kprobes, uprobes. > > > > > > > > Is it trying to unwind through ___bpf_prog_run()? > > > > > > > > If so, that would at least explain why ORC isn't working. Objtool > > > > currently ignores that function because it can't follow the jump table. > > > > > > Here's a tentative fix (for ORC, at least). I'll need to make sure this > > > doesn't break anything else. > > > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > > index 242a643af82f..1d9a7cc4b836 100644 > > > --- a/kernel/bpf/core.c > > > +++ b/kernel/bpf/core.c > > > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > > > bpf_insn *insn, u64 *stack) > > > BUG_ON(1); > > > return 0; > > > } > > > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > > > > > > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > > > #define DEFINE_BPF_PROG_RUN(stack_size) \ > > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > > > index 172f99195726..2567027fce95 100644 > > > --- a/tools/objtool/check.c > > > +++ b/tools/objtool/check.c > > > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > > > objtool_file *file, > > > if (text_rela->type == R_X86_64_PC32) > > > table_offset += 4; > > > > > > - /* > > > -* Make sure the .rodata address isn't associated with a > > > -* symbol. gcc jump tables are anonymous data. > > > -*/ > > > - if (find_symbol_containing(rodata_sec, table_offset)) > > > - continue; > > > - > > > rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > > > if (rodata_rela) { > > > /* > > > > Hi Josh, this still won't fix the problem. > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > wrong is with the JITed bpf code. > > There seem to be a bunch of issues. My patch at least fixes the failing > selftest reported by Alexei for ORC. > > How can I recreate your issue? Hmm, I used bcc's example to attach bpf to trace point, and with that fix stack trace is still invalid. CMD I used with bcc: python3 ./tools/stackcount.py t:sched:sched_fork And I just had another try applying your patch, self test is also failing. I'm applying on my local master branch, a few days older than upstream, I can update and try again, am I missing anything? > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > unshift it properly then it will work. > > Yeah, that looks like a frame pointer bug in emit_prologue(). > > > I tried below code, and problem is fixed (only for frame pointer > > unwinder though). Need to find a better way to detect and do any > > similar trick for bpf part, if this is a feasible way to fix it: > > > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > --- a/arch/x86/kernel/unwind_frame.c > > +++ b/arch/x86/kernel/unwind_frame.c > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > > } > > > > /* Move to the next frame if it's safe: */ > > - if (!update_stack_state(state, next_bp)) > > - goto bad_address; > > + if (!update_stack_state(state, next_bp)) { > > + // Try again with shifted BP > > + state->bp +=
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 4:28 PM Song Liu wrote: > > > On May 22, 2019, at 11:48 PM, Kairui Song wrote: > > > > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > >> > >> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > >>> On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > >>>> The one that is broken is prog_tests/stacktrace_map.c > >>>> There we attach bpf to standard tracepoint where > >>>> kernel suppose to collect pt_regs before calling into bpf. > >>>> And that's what bpf_get_stackid_tp() is doing. > >>>> It passes pt_regs (that was collected before any bpf) > >>>> into bpf_get_stackid() which calls get_perf_callchain(). > >>>> Same thing with kprobes, uprobes. > >>> > >>> Is it trying to unwind through ___bpf_prog_run()? > >>> > >>> If so, that would at least explain why ORC isn't working. Objtool > >>> currently ignores that function because it can't follow the jump table. > >> > >> Here's a tentative fix (for ORC, at least). I'll need to make sure this > >> doesn't break anything else. > >> > >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > >> index 242a643af82f..1d9a7cc4b836 100644 > >> --- a/kernel/bpf/core.c > >> +++ b/kernel/bpf/core.c > >> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > >> bpf_insn *insn, u64 *stack) > >>BUG_ON(1); > >>return 0; > >> } > >> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > >> > >> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > >> #define DEFINE_BPF_PROG_RUN(stack_size) \ > >> diff --git a/tools/objtool/check.c b/tools/objtool/check.c > >> index 172f99195726..2567027fce95 100644 > >> --- a/tools/objtool/check.c > >> +++ b/tools/objtool/check.c > >> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > >> objtool_file *file, > >>if (text_rela->type == R_X86_64_PC32) > >>table_offset += 4; > >> > >> - /* > >> -* Make sure the .rodata address isn't associated with a > >> -* symbol. gcc jump tables are anonymous data. > >> -*/ > >> - if (find_symbol_containing(rodata_sec, table_offset)) > >> - continue; > >> - > >>rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > >>if (rodata_rela) { > >>/* > > > > Hi Josh, this still won't fix the problem. > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > wrong is with the JITed bpf code. > > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > unshift it properly then it will work. > > > > I tried below code, and problem is fixed (only for frame pointer > > unwinder though). Need to find a better way to detect and do any > > similar trick for bpf part, if this is a feasible way to fix it: > > > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > --- a/arch/x86/kernel/unwind_frame.c > > +++ b/arch/x86/kernel/unwind_frame.c > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > >} > > > >/* Move to the next frame if it's safe: */ > > - if (!update_stack_state(state, next_bp)) > > - goto bad_address; > > + if (!update_stack_state(state, next_bp)) { > > + // Try again with shifted BP > > + state->bp += 5; // see AUX_STACK_SPACE > > + next_bp = (unsigned long > > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > > + // Clean and refetch stack info, it's marked as error outed > > + state->stack_mask = 0; > > + get_stack_info(next_bp, state->task, > > >stack_info, >stack_mask); > > + if (!update_stack_state(state, next_bp)) { > > + goto bad_address; > > + } > > + } > > > >return true; > > > > For ORC unwinder, I think the unwinder can't find any info about the >
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > > > The one that is broken is prog_tests/stacktrace_map.c > > > There we attach bpf to standard tracepoint where > > > kernel suppose to collect pt_regs before calling into bpf. > > > And that's what bpf_get_stackid_tp() is doing. > > > It passes pt_regs (that was collected before any bpf) > > > into bpf_get_stackid() which calls get_perf_callchain(). > > > Same thing with kprobes, uprobes. > > > > Is it trying to unwind through ___bpf_prog_run()? > > > > If so, that would at least explain why ORC isn't working. Objtool > > currently ignores that function because it can't follow the jump table. > > Here's a tentative fix (for ORC, at least). I'll need to make sure this > doesn't break anything else. > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 242a643af82f..1d9a7cc4b836 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > bpf_insn *insn, u64 *stack) > BUG_ON(1); > return 0; > } > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > #define DEFINE_BPF_PROG_RUN(stack_size) \ > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > index 172f99195726..2567027fce95 100644 > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > objtool_file *file, > if (text_rela->type == R_X86_64_PC32) > table_offset += 4; > > - /* > -* Make sure the .rodata address isn't associated with a > -* symbol. gcc jump tables are anonymous data. > -*/ > - if (find_symbol_containing(rodata_sec, table_offset)) > - continue; > - > rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > if (rodata_rela) { > /* Hi Josh, this still won't fix the problem. Problem is not (or not only) with ___bpf_prog_run, what actually went wrong is with the JITed bpf code. For frame pointer unwinder, it seems the JITed bpf code will have a shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can unshift it properly then it will work. I tried below code, and problem is fixed (only for frame pointer unwinder though). Need to find a better way to detect and do any similar trick for bpf part, if this is a feasible way to fix it: diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c index 9b9fd4826e7a..2c0fa2aaa7e4 100644 --- a/arch/x86/kernel/unwind_frame.c +++ b/arch/x86/kernel/unwind_frame.c @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) } /* Move to the next frame if it's safe: */ - if (!update_stack_state(state, next_bp)) - goto bad_address; + if (!update_stack_state(state, next_bp)) { + // Try again with shifted BP + state->bp += 5; // see AUX_STACK_SPACE + next_bp = (unsigned long *)READ_ONCE_TASK_STACK(state->task, *state->bp); + // Clean and refetch stack info, it's marked as error outed + state->stack_mask = 0; + get_stack_info(next_bp, state->task, >stack_info, >stack_mask); + if (!update_stack_state(state, next_bp)) { + goto bad_address; + } + } return true; For ORC unwinder, I think the unwinder can't find any info about the JITed part. Maybe if can let it just skip the JITed part and go to kernel context, then should be good enough. -- Best Regards, Kairui Song
Re: [PATCH v2] perf/x86: always include regs->ip in callchain
On Thu, May 23, 2019 at 1:34 PM Song Liu wrote: > > Commit d15d356887e7 removes regs->ip for !perf_hw_regs(regs) case. This > patch adds regs->ip back. > > Fixes: d15d356887e7 ("perf/x86: Make perf callchains work without > CONFIG_FRAME_POINTER") > Cc: Kairui Song > Cc: Peter Zijlstra (Intel) > Signed-off-by: Song Liu > --- > arch/x86/events/core.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index f315425d8468..7b8a9eb4d5fd 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -2402,9 +2402,9 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx > *entry, struct pt_regs *re > return; > } > > + if (perf_callchain_store(entry, regs->ip)) > + return; > if (perf_hw_regs(regs)) { > - if (perf_callchain_store(entry, regs->ip)) > - return; > unwind_start(, current, regs, NULL); > } else { > unwind_start(, current, NULL, (void *)regs->sp); > -- > 2.17.1 > Hi, this will make !perf_hw_regs(regs) case print a double first level stack trace, which is wrong. And the actual problem that unwinder give empty calltrace in bpf is still not fixed. -- Best Regards, Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Sat, May 18, 2019 at 5:48 AM Song Liu wrote: > > > > > On May 17, 2019, at 2:06 PM, Alexei Starovoitov wrote: > > > > On 5/17/19 11:40 AM, Song Liu wrote: > >> +Alexei, Daniel, and bpf > >> > >>> On May 17, 2019, at 2:10 AM, Peter Zijlstra wrote: > >>> > >>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > >>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > >>>> some other bfp functions) is now broken, or, strating an unwind > >>>> directly inside a bpf program will end up strangely. It have following > >>>> kernel message: > >>> > >>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > >>> follow. > >> > >> I guess we need something like the following? (we should be able to > >> optimize the PER_CPU stuff). > >> > >> Thanks, > >> Song > >> > >> > >> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c > >> index f92d6ad5e080..c525149028a7 100644 > >> --- i/kernel/trace/bpf_trace.c > >> +++ w/kernel/trace/bpf_trace.c > >> @@ -696,11 +696,13 @@ static const struct bpf_func_proto > >> bpf_perf_event_output_proto_tp = { > >> .arg5_type = ARG_CONST_SIZE_OR_ZERO, > >> }; > >> > >> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs); > >> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map, > >>u64, flags) > >> { > >> - struct pt_regs *regs = *(struct pt_regs **)tp_buff; > >> + struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs); > >> > >> + perf_fetch_caller_regs(regs); > > > > No. pt_regs is already passed in. It's the first argument. > > If we call perf_fetch_caller_regs() again the stack trace will be wrong. > > bpf prog should not see itself, interpreter or all the frames in between. > > Thanks Alexei! I get it now. > > In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field > of tp_buff: > > struct pt_regs *regs = *(struct pt_regs **)tp_buff; > > tp_buff points to something like > > struct sched_switch_args { > unsigned long long pad; > char prev_comm[16]; > int prev_pid; > int prev_prio; > long long prev_state; > char next_comm[16]; > int next_pid; > int next_prio; > }; > > where the first field "pad" is a pointer to pt_regs. > > @Kairui, I think you confirmed that current code will give empty call trace > with ORC unwinder? If that's the case, can we add regs->ip back? (as in the > first email of this thread. > > Thanks, > Song > Hi thanks for the suggestion, yes we can add it should be good an idea to always have IP when stack trace is not available. But stack trace is actually still broken, it will always give only one level of stacktrace (the IP). -- Best Regards, Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra wrote: > > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > > some other bfp functions) is now broken, or, strating an unwind > > directly inside a bpf program will end up strangely. It have following > > kernel message: > > Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > follow. bpf_get_stackid_tp will just use the regs passed to it from the trace point. And then it will eventually call perf_get_callchain to get the call chain. With a tracepoint we have the fake regs, so unwinder will start from where it is called, and use the fake regs as the indicator of the target frame it want, and keep unwinding until reached the actually callsite. But if the stack trace is started withing a bpf func call then it's broken... If the unwinder could trace back through the bpf func call then there will be no such problem. For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED) before bpf call, and ensure bp is also dumped could fix it (so it will start using the regs for bpf calls, like before the commit d15d356887e7). But for ORC I don't see a clear way to fix the problem. First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI, DX, R10, R13, else?) in the trace point handler, then use the flag to indicate ORC to do one more unwind (because can't get caller's regs, so get callee's regs instaed) before actually giving output? I had a try, for framepointer unwinder, mark the indicator flag before calling bpf functions, and dump bp as well in the trace point. Then with frame pointer, it works, test passed: diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 1392d5e6e8d6..6f1192e9776b 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs); #include +#ifdef CONFIG_FRAME_POINTER +static inline unsigned long caller_frame_pointer(void) +{ + return (unsigned long)__builtin_frame_address(1); +} +#else +static inline unsigned long caller_frame_pointer(void) +{ + return 0; +} +#endif + /* * We abuse bit 3 from flags to pass exact information, see perf_misc_flags * and the comment with PERF_EFLAGS_EXACT. */ #define perf_arch_fetch_caller_regs(regs, __ip){ \ (regs)->ip = (__ip);\ + (regs)->bp = caller_frame_pointer();\ (regs)->sp = (unsigned long)__builtin_frame_address(0); \ (regs)->cs = __KERNEL_CS; \ regs->flags = 0;\ diff --git a/kernel/events/core.c b/kernel/events/core.c index abbd4b3b96c2..ca7b95ee74f0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx, struct task_struct *task) { if (bpf_prog_array_valid(call)) { + regs->flags |= X86_EFLAGS_FIXED; *(struct pt_regs **)raw_data = regs; if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) { perf_swevent_put_recursion_context(rctx); @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event, int ret = 0; ctx.regs = perf_arch_bpf_user_pt_regs(regs); + ctx.regs->flags |= X86_EFLAGS_FIXED; + preempt_disable(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) goto out; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..e1fa656677dc 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, }; perf_fetch_caller_regs(regs); + regs->flags |= X86_EFLAGS_FIXED; + perf_sample_data_init(sd, 0, 0); sd->raw = @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); perf_fetch_caller_regs(regs); + regs->flags |= X86_EFLAGS_FIXED; + return bpf_perf_event_output(regs, map, flags, data, size); } @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args, struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); perf_fetch_caller_regs(regs); + regs->flags |= X86_EFLAGS_FIXED; + /* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */ return bpf_get_stackid((unsigned long) regs, (unsigned long) map, flags, 0, 0); @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 4:15 PM Kairui Song wrote: > > On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra wrote: > > > > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: > > > On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > > > > Hi, > > > > > > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map > > > > (on bpf/master > > > > branch). > > > > > > > > After digging into the code, we found that perf_callchain_kernel() is > > > > giving empty > > > > callchain for tracepoint sched/sched_switch. And it seems related to > > > > commit > > > > > > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > > > > > > > Before this commit, perf_callchain_kernel() returns callchain with > > > > regs->ip. With > > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. > > > > > > So while I think the below is indeed right; we should store regs->ip > > > regardless of the unwind path chosen, I still think there's something > > > fishy if this results in just the 1 entry. > > > > > > The sched/sched_switch event really should have a non-trivial stack. > > > > > > Let me see if I can reproduce with just perf. > > > > $ perf record -g -e "sched:sched_switch" -- make clean > > $ perf report -D > > > > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: > > 0x81c29562 period: 1 addr: 0 > > ... FP chain: nr:10 > > . 0: ff80 > > . 1: 81c29562 > > . 2: 81c29933 > > . 3: 8111f688 > > . 4: 81120b9d > > . 5: 81120ce5 > > . 6: 8100254a > > . 7: 81e0007d > > . 8: fe00 > > . 9: 7f9b6cd9682a > > ... thread: sh:7236 > > .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux > > > > > > IOW, it seems to 'work'. > > > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > some other bfp functions) is now broken, or, strating an unwind > directly inside a bpf program will end up strangely. It have following > kernel message: > > WARNING: kernel stack frame pointer at 70cad07c in > test_progs:1242 has bad value ffd4497e > > And in the debug message: > > [ 160.460287] 6e117175: aa23a0e8 > (get_perf_callchain+0x148/0x280) > [ 160.460287] 02e8715f: 00c6bba0 (0xc6bba0) > [ 160.460288] b3d07758: 9ce3f979 (0x9ce3f979) > [ 160.460289] 55c97836: 9ce3f979 (0x9ce3f979) > [ 160.460289] 7cbb140a: 0001007f (0x1007f) > [ 160.460290] 7dc62ac9: ... > [ 160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000) > [ 160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460292] f9a16017: 007f (0x7f) > [ 160.460292] a8e62d44: ... > [ 160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000) > [ 160.460293] 92842522: 007f (0x7f) > [ 160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50) > [ 160.460296] 0777e4cf: aa225d97 (bpf_get_stackid+0x77/0x470) > [ 160.460296] 9942ea16: ... > [ 160.460297] a08006b1: 0001 (0x1) > [ 160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30) > [ 160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90) > [ 160.460300] 3a13d702: ... > [ 160.460300] e2e2496d: 9ce3 (0x9ce3) > [ 160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460301] a8cf6d55: ... > [ 160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0) > [ 160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60) > [ 160.460305] 97498bf2: aa1f4791 > (bpf_get_stackid_tp+0x11/0x20) > [ 160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78) > [ 160.460307] dacd0ce5: c0405676 (0xc0405676) > [ 160.460307] a81f2714: ... > > # Note here is the invalid frame pointer > [ 160.460308] 70cad07c: b89d
Re: Getting empty callchain from perf_callchain_kernel()
ab651be0 (event_sched_migrate_task+0xa0/0xa0) [ 160.460316] 355cf319: ... [ 160.460316] 3b67f2ad: d89cffc3ae80 (0xd89cffc3ae80) [ 160.460316] 9a77e20b: 9ce3fba25000 (0x9ce3fba25000) [ 160.460317] 32cf7376: 0001 (0x1) [ 160.460317] 0051db74: b89d00c6bd20 (0xb89d00c6bd20) [ 160.460318] 40eb3bf7: aa22be81 (perf_trace_run_bpf_submit+0x41/0xb0) Simply store the IP still won't really fix the problem, it just passed the test. Just had a try to have bpf functions set the X86_EFLAGS_FIXED for the flags and always dump the bp, it bypassed this specified problem. Using frame pointer unwinder for testing this, and seems ORC is fine with this. -- Best Regards, Kairui Song
[tip:perf/core] perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER
Commit-ID: d15d356887e770c5f2dcf963b52c7cb510c9e42d Gitweb: https://git.kernel.org/tip/d15d356887e770c5f2dcf963b52c7cb510c9e42d Author: Kairui Song AuthorDate: Tue, 23 Apr 2019 00:26:52 +0800 Committer: Ingo Molnar CommitDate: Mon, 29 Apr 2019 08:25:05 +0200 perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER Currently perf callchain doesn't work well with ORC unwinder when sampling from trace point. We'll get useless in kernel callchain like this: perf 6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL be23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) The root cause is that, for trace point events, it doesn't provide a real snapshot of the hardware registers. Instead perf tries to get required caller's registers and compose a fake register snapshot which suppose to contain enough information for start a unwinding. However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the frame pointer, so current frame pointer is returned instead. We get a invalid register combination which confuse the unwinder, and end the stacktrace early. So in such case just don't try dump BP, and let the unwinder start directly when the register is not a real snapshot. Use SP as the skip mark, unwinder will skip all the frames until it meet the frame of the trace point caller. Tested with frame pointer unwinder and ORC unwinder, this makes perf callchain get the full kernel space stacktrace again like this: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL b523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) b52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) b5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Co-developed-by: Josh Poimboeuf Signed-off-by: Kairui Song Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Borislav Petkov Cc: Dave Young Cc: Jiri Olsa Cc: Linus Torvalds Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190422162652.15483-1-kas...@redhat.com Signed-off-by: Ingo Molnar --- arch/x86/events/core.c| 21 + arch/x86/include/asm/perf_event.h | 7 +-- arch/x86/include/asm/stacktrace.h | 13 - include/linux/perf_event.h| 14 ++ 4 files changed, 28 insertions(+), 27 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index de1a924a4914..f315425d8468 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event *event, cyc2ns_read_end(); } +/* + * Determine whether the regs were taken from an irq/exception handler rather + * than from perf_arch_fetch_caller_regs(). + */ +static bool perf_hw_regs(struct pt_regs *regs) +{ + return regs->flags & X86_EFLAGS_FIXED; +} + void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { @@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re return; } - if (perf_callchain_store(entry, regs->ip)) - return; + if (perf_hw_regs(regs)) { + if (perf_callchain_store(entry, regs->ip)) + return; + unwind_start(, current, regs, NULL); + } else { + unwind_start(, current, NULL, (void *)regs->sp); + } - for (unwind_start(, current, regs, NULL); !unwind_done(); -unwind_next_frame()) { + for (; !unwind_done(); unwind_next_frame()) { addr = unwind_get_return_address(); if (!addr || perf_callchain_store(entry, addr)) re
Re: [RFC PATCH v4] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Tue, Apr 23, 2019 at 7:35 AM Peter Zijlstra wrote: > > On Tue, Apr 23, 2019 at 12:26:52AM +0800, Kairui Song wrote: > > Currently perf callchain doesn't work well with ORC unwinder > > when sampling from trace point. We'll get useless in kernel callchain > > like this: > > > > perf 6429 [000]22.498450: kmem:mm_page_alloc: > > page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL > > be23e32e __alloc_pages_nodemask+0x22e > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) > > 5651468729c1 [unknown] (/usr/bin/perf) > > 5651467ee82a main+0x69a (/usr/bin/perf) > > 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) > > 5541f689495641d7 [unknown] ([unknown]) > > > > The root cause is that, for trace point events, it doesn't provide a > > real snapshot of the hardware registers. Instead perf tries to get > > required caller's registers and compose a fake register snapshot > > which suppose to contain enough information for start a unwinding. > > However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the > > frame pointer, so current frame pointer is returned instead. We get > > a invalid register combination which confuse the unwinder, and end the > > stacktrace early. > > > > So in such case just don't try dump BP, and let the unwinder start > > directly when the register is not a real snapshot. And Use SP > > as the skip mark, unwinder will skip all the frames until it meet > > the frame of the trace point caller. > > > > Tested with frame pointer unwinder and ORC unwinder, this make perf > > callchain get the full kernel space stacktrace again like this: > > > > perf 6503 [000] 1567.570191: kmem:mm_page_alloc: > > page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL > > b523e2ae __alloc_pages_nodemask+0x22e > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52383bd __get_free_pages+0xd > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52fe3e2 do_sys_poll+0x252 > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52ff027 __x64_sys_poll+0x37 > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b500418b do_syscall_64+0x5b > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b5a0008c entry_SYSCALL_64_after_hwframe+0x44 > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) > > 55a22960d9c1 [unknown] (/usr/bin/perf) > > 55a22958982a main+0x69a (/usr/bin/perf) > > 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) > > 5541f689495641d7 [unknown] ([unknown]) > > > > Co-developed-by: Josh Poimboeuf > > Signed-off-by: Kairui Song > > Thanks! > > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > > index e47ef764f613..ab135abe62e0 100644 > > --- a/include/linux/perf_event.h > > +++ b/include/linux/perf_event.h > > @@ -1059,7 +1059,7 @@ static inline void perf_arch_fetch_caller_regs(struct > > pt_regs *regs, unsigned lo > > * the nth caller. We only need a few of the regs: > > * - ip for PERF_SAMPLE_IP > > * - cs for user_mode() tests > > - * - bp for callchains > > + * - sp for callchains > > * - eflags, for future purposes, just in case > > */ > > static inline void perf_fetch_caller_regs(struct pt_regs *regs) > > I've extended that like so: > > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1058,12 +1058,18 @@ static inline void perf_arch_fetch_calle > #endif > > /* > - * Take a snapshot of the regs. Skip ip and frame pointer to > - * the nth caller. We only need a few of the regs: > + * When generating a perf sample in-line, instead of from an interrupt / > + * exception, we lack a pt_regs. This is typically used from software events > + * like: SW_CONTEXT_SWITCHES, SW_MIGRATIONS and the tie-in with tracepoints. > + * > + * We typically don't need a full set, but (for x86) do require: > * - ip for PERF_SAMPLE_IP > * - cs for user_mode() tests > - * - sp for callchains > - * - eflags, for future purposes, just in case > + * - sp for PERF_SAMPLE_CALLCHAIN > + * - eflags for MISC bits and CALLCHAIN (see: perf_hw_regs()) > + * > + * NOTE: assumes @regs is otherwise already 0 filled; this is important for > + * things like PERF_SAMPLE_REGS_INTR. > */ > static inline void perf_fetch_caller_regs(struct pt_regs *regs) > { Sure, the updated comments looks much better. Will the maintainer squash the comment update or should I send a V5? -- Best Regards, Kairui Song
[RFC PATCH v4] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
Currently perf callchain doesn't work well with ORC unwinder when sampling from trace point. We'll get useless in kernel callchain like this: perf 6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL be23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) The root cause is that, for trace point events, it doesn't provide a real snapshot of the hardware registers. Instead perf tries to get required caller's registers and compose a fake register snapshot which suppose to contain enough information for start a unwinding. However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the frame pointer, so current frame pointer is returned instead. We get a invalid register combination which confuse the unwinder, and end the stacktrace early. So in such case just don't try dump BP, and let the unwinder start directly when the register is not a real snapshot. And Use SP as the skip mark, unwinder will skip all the frames until it meet the frame of the trace point caller. Tested with frame pointer unwinder and ORC unwinder, this make perf callchain get the full kernel space stacktrace again like this: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL b523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) b52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) b5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Co-developed-by: Josh Poimboeuf Signed-off-by: Kairui Song --- Update from V3: - Alway start the unwinding directly on fake registers, so we have a unified path for both with/without frame pointer and simplify the code, as posted by Josh Poimboeuf Update from V2: - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as the indicator of where the pt_regs is valid for unwinding. As suggested by Peter Zijlstra - Update some comments accordingly. Update from V1: Get rid of a lot of unneccessary code and just don't dump a inaccurate BP, and use SP as the marker for target frame. arch/x86/events/core.c| 21 + arch/x86/include/asm/perf_event.h | 7 +-- arch/x86/include/asm/stacktrace.h | 13 - include/linux/perf_event.h| 2 +- 4 files changed, 19 insertions(+), 24 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 81911e11a15d..9856b5b91b9c 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2348,6 +2348,15 @@ void arch_perf_update_userpage(struct perf_event *event, cyc2ns_read_end(); } +/* + * Determine whether the regs were taken from an irq/exception handler rather + * than from perf_arch_fetch_caller_regs(). + */ +static bool perf_hw_regs(struct pt_regs *regs) +{ + return regs->flags & X86_EFLAGS_FIXED; +} + void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { @@ -2359,11 +2368,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re return; } - if (perf_callchain_store(entry, regs->ip)) - return; + if (perf_hw_regs(regs)) { + if (perf_callchain_store(entry, regs->ip)) + return; + unwind_start(, current, regs, NULL); + } else { + unwind_start(, current, NULL, (void *)regs->sp); + } - for (unwind_start(, current, regs, NULL); !unwind_done(); -unwind_next_frame()) { + for (; !unwind_done(); unwind_next_frame()) { addr = unwind_get_return_address(); if (!addr || perf_callchain_store(entry, addr)) return; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 8bdf74902293..f4854cd0905b 100644 --- a/arch
Re: [RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Fri, Apr 19, 2019 at 5:43 PM Peter Zijlstra wrote: > > On Fri, Apr 19, 2019 at 10:17:49AM +0800, Kairui Song wrote: > > On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf wrote: > > > > > > I still don't like using regs->bp because it results in different code > > > paths for FP and ORC. In the FP case, the regs are treated like real > > > regs even though they're fake. > > > > > > Something like the below would be much simpler. Would this work? I don't > > > know if any other code relies on the fake regs->bp or regs->sp. > > > > Works perfectly. My only concern is that FP path used to work very > > well, not sure it's a good idea to change it, and this may bring some > > extra overhead for FP path. > > Given Josh wrote all that code, I'm fairly sure it is still OK :-) > > But also looking at the code in unwind_frame.c, __unwind_start() seems > to pretty much do what the removed caller_frame_pointer() did (when > .regs=NULL) but better. > OK, with FP we will also need to do a few more extra unwinding, previously it start directly from the frame of the trace point, now have to trace back to the trace point first. If that's fine I could post another update (that will be pretty much just copy from the Josh's code he posted :P , is this OK?) -- Best Regards, Kairui Song
Re: [RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf wrote: > > I still don't like using regs->bp because it results in different code > paths for FP and ORC. In the FP case, the regs are treated like real > regs even though they're fake. > > Something like the below would be much simpler. Would this work? I don't > know if any other code relies on the fake regs->bp or regs->sp. Works perfectly. My only concern is that FP path used to work very well, not sure it's a good idea to change it, and this may bring some extra overhead for FP path. > > (BTW, tomorrow is a US holiday so I may not be very responsive until > Monday.) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index de1a924a4914..f315425d8468 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event > *event, > cyc2ns_read_end(); > } > > +/* > + * Determine whether the regs were taken from an irq/exception handler rather > + * than from perf_arch_fetch_caller_regs(). > + */ > +static bool perf_hw_regs(struct pt_regs *regs) > +{ > + return regs->flags & X86_EFLAGS_FIXED; > +} > + > void > perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs > *regs) > { > @@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx > *entry, struct pt_regs *re > return; > } > > - if (perf_callchain_store(entry, regs->ip)) > - return; > + if (perf_hw_regs(regs)) { > + if (perf_callchain_store(entry, regs->ip)) > + return; > + unwind_start(, current, regs, NULL); > + } else { > + unwind_start(, current, NULL, (void *)regs->sp); > + } > > - for (unwind_start(, current, regs, NULL); !unwind_done(); > -unwind_next_frame()) { > + for (; !unwind_done(); unwind_next_frame()) { > addr = unwind_get_return_address(); > if (!addr || perf_callchain_store(entry, addr)) > return; > diff --git a/arch/x86/include/asm/perf_event.h > b/arch/x86/include/asm/perf_event.h > index 04768a3a5454..1392d5e6e8d6 100644 > --- a/arch/x86/include/asm/perf_event.h > +++ b/arch/x86/include/asm/perf_event.h > @@ -308,14 +308,9 @@ extern unsigned long perf_misc_flags(struct pt_regs > *regs); > */ > #define perf_arch_fetch_caller_regs(regs, __ip){ \ > (regs)->ip = (__ip);\ > - (regs)->bp = caller_frame_pointer();\ > + (regs)->sp = (unsigned long)__builtin_frame_address(0); \ > (regs)->cs = __KERNEL_CS; \ > regs->flags = 0;\ > - asm volatile( \ > - _ASM_MOV "%%"_ASM_SP ", %0\n" \ > - : "=m" ((regs)->sp) \ > - :: "memory" \ > - ); \ > } > > struct perf_guest_switch_msr { > diff --git a/arch/x86/include/asm/stacktrace.h > b/arch/x86/include/asm/stacktrace.h > index d6d758a187b6..a8d0cdf48616 100644 > --- a/arch/x86/include/asm/stacktrace.h > +++ b/arch/x86/include/asm/stacktrace.h > @@ -100,19 +100,6 @@ struct stack_frame_ia32 { > u32 return_address; > }; > > -static inline unsigned long caller_frame_pointer(void) > -{ > - struct stack_frame *frame; > - > - frame = __builtin_frame_address(0); > - > -#ifdef CONFIG_FRAME_POINTER > - frame = frame->next_frame; > -#endif > - > - return (unsigned long)frame; > -} > - > void show_opcodes(struct pt_regs *regs, const char *loglvl); > void show_ip(struct pt_regs *regs, const char *loglvl); > #endif /* _ASM_X86_STACKTRACE_H */ > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index f3864e1c5569..0f560069aeec 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1062,7 +1062,7 @@ static inline void perf_arch_fetch_caller_regs(struct > pt_regs *regs, unsigned lo > * the nth caller. We only need a few of the regs: > * - ip for PERF_SAMPLE_IP > * - cs for user_mode() tests > - * - bp for callchains > + * - sp for callchains > * - eflags, for future purposes, just in case > */ > static inline void perf_fetch_caller_regs(struct pt_regs *regs) -- Best Regards, Kairui Song
[RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
Currently perf callchain doesn't work well when sampling from trace point, with ORC unwinder enabled and CONFIG_FRAME_POINTER disabled. We'll get useless in kernel callchain like this: perf 6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL be23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) The root cause is within a trace point perf will try to dump the required caller's registers, but without CONFIG_FRAME_POINTER we can't get caller's BP as the frame pointer, so current frame pointer is returned instead. We get a invalid register combination which confuse the unwinder and end the stacktrace early. So in such case just don't try dump BP when doing partial register dump. And just let the unwinder start directly when the register is incapable of presenting a unwinding start point. Use SP as the skip mark, skip all the frames until we meet the frame we want. This make the callchain get the full kernel space stacktrace again: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL b523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) b52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) b5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Signed-off-by: Kairui Song --- Update from V2: - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as the indicator of where the pt_regs is valid for unwinding. As suggested by Peter Zijlstra - Update some comments accordingly. Update from V1: Get rid of a lot of unneccessary code and just don't dump a inaccurate BP, and use SP as the marker for target frame. arch/x86/events/core.c| 24 +--- arch/x86/include/asm/perf_event.h | 5 + arch/x86/include/asm/stacktrace.h | 9 +++-- include/linux/perf_event.h| 6 +++--- 4 files changed, 36 insertions(+), 8 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index e2b1447192a8..e181e195fe5d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2355,6 +2355,18 @@ void arch_perf_update_userpage(struct perf_event *event, cyc2ns_read_end(); } +static inline int +valid_unwinding_registers(struct pt_regs *regs) +{ + /* +* regs might be a fake one, it won't dump the flags reg, +* and without frame pointer, it won't have a valid BP. +*/ + if (IS_ENABLED(CONFIG_FRAME_POINTER)) + return 1; + return (regs->flags & PERF_EFLAGS_SNAP); +} + void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { @@ -2366,11 +2378,17 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re return; } - if (perf_callchain_store(entry, regs->ip)) + if (valid_unwinding_registers(regs)) { + if (perf_callchain_store(entry, regs->ip)) + return; + unwind_start(, current, regs, NULL); + } else if (regs->sp) { + unwind_start(, current, NULL, (unsigned long *)regs->sp); + } else { return; + } - for (unwind_start(, current, regs, NULL); !unwind_done(); -unwind_next_frame()) { + for (; !unwind_done(); unwind_next_frame()) { addr = unwind_get_return_address(); if (!addr || perf_callchain_store(entry, addr)) return; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 8bdf74902293..77c8519512ff 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -239,11 +239,16 @@ extern void perf_events_lapic_init(void); * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise * unused and ABI sp
Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Wed, Apr 17, 2019 at 4:16 AM Josh Poimboeuf wrote: > > On Wed, Apr 17, 2019 at 01:39:19AM +0800, Kairui Song wrote: > > On Tue, Apr 16, 2019 at 7:30 PM Kairui Song wrote: > > > > > > On Tue, Apr 16, 2019 at 12:59 AM Josh Poimboeuf > > > wrote: > > > > > > > > On Mon, Apr 15, 2019 at 05:36:22PM +0200, Peter Zijlstra wrote: > > > > > > > > > > I'll mostly defer to Josh on unwinding, but a few comments below. > > > > > > > > > > On Tue, Apr 09, 2019 at 12:59:42AM +0800, Kairui Song wrote: > > > > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > > > > > > index e2b1447192a8..6075a4f94376 100644 > > > > > > --- a/arch/x86/events/core.c > > > > > > +++ b/arch/x86/events/core.c > > > > > > @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct > > > > > > perf_event *event, > > > > > > cyc2ns_read_end(); > > > > > > } > > > > > > > > > > > > +static inline int > > > > > > +valid_perf_registers(struct pt_regs *regs) > > > > > > +{ > > > > > > + return (regs->ip && regs->bp && regs->sp); > > > > > > +} > > > > > > > > > > I'm unconvinced by this, with both guess and orc having !bp is > > > > > perfectly > > > > > valid. > > > > > > > > > > > void > > > > > > perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, > > > > > > struct pt_regs *regs) > > > > > > { > > > > > > @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct > > > > > > perf_callchain_entry_ctx *entry, struct pt_regs *re > > > > > > return; > > > > > > } > > > > > > > > > > > > - if (perf_callchain_store(entry, regs->ip)) > > > > > > + if (valid_perf_registers(regs)) { > > > > > > + if (perf_callchain_store(entry, regs->ip)) > > > > > > + return; > > > > > > + unwind_start(, current, regs, NULL); > > > > > > + } else if (regs->sp) { > > > > > > + unwind_start(, current, NULL, (unsigned long > > > > > > *)regs->sp); > > > > > > + } else { > > > > > > return; > > > > > > + } > > > > > > > > > > AFAICT if we, by pure accident, end up with !bp for ORC, then we > > > > > initialize the unwind wrong. > > > > > > > > > > Note that @regs is mostly trivially correct, except for that > > > > > tracepoint > > > > > case. So I don't think we should magic here. > > > > > > > > Ah, I didn't quite understand this code before, and I still don't > > > > really, but I guess the issue is that @regs can be either real or fake. > > > > > > > > In the real @regs case, we just want to always unwind starting from > > > > regs->sp. > > > > > > > > But in the fake @regs case, we should instead unwind from the current > > > > frame, skipping all frames until we hit the fake regs->sp. Because > > > > starting from fake/incomplete regs is most likely going to cause > > > > problems with ORC (or DWARF for other arches). > > > > > > > > The idea of a fake regs is fragile and confusing. Is it possible to > > > > just pass in the "skip" stack pointer directly instead? That should > > > > work for both FP and non-FP. And I _think_ there's no need to ever > > > > capture regs->bp anyway -- the stack pointer should be sufficient. > > > > > > Hi, that will break some other usage, if perf_callchain_kernel is > > > called but it won't unwind to the callsite (could be produced by > > > attach an ebpf call to kprobe), things will also go wrong. It should > > > start with given registers when the register is valid. > > > And it's true with omit frame pointer BP value could be anything, so 0 > > > is also valid, I think I need to find a better way to tell if we could > > > start with the registers value or direct start unwinding and skip > > > until got the stack. > > > > > > > Hi, sorry I might have some misu
Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Wed, Apr 17, 2019 at 1:45 AM Peter Zijlstra wrote: > > On Wed, Apr 17, 2019 at 01:39:19AM +0800, Kairui Song wrote: > > And I also think the "fake"/"real" reg is fragile, could we abuse > > another eflag (just like PERF_EFLAGS_EXACT) to indicate the regs are > > partially dumped fake registers? > > Sure, the SDM seems to suggest bits 1,3,5,15 are 'available'. We've > already used 3 and 5, and I think we can use !X86_EFLAGS_FIXED to > indicate a fake regs set. Any real regs set will always have that set. Thanks! This is a good idea. Will update accordingly in V3 later. -- Best Regards, Kairui Song
Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Tue, Apr 16, 2019 at 7:30 PM Kairui Song wrote: > > On Tue, Apr 16, 2019 at 12:59 AM Josh Poimboeuf wrote: > > > > On Mon, Apr 15, 2019 at 05:36:22PM +0200, Peter Zijlstra wrote: > > > > > > I'll mostly defer to Josh on unwinding, but a few comments below. > > > > > > On Tue, Apr 09, 2019 at 12:59:42AM +0800, Kairui Song wrote: > > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > > > > index e2b1447192a8..6075a4f94376 100644 > > > > --- a/arch/x86/events/core.c > > > > +++ b/arch/x86/events/core.c > > > > @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct perf_event > > > > *event, > > > > cyc2ns_read_end(); > > > > } > > > > > > > > +static inline int > > > > +valid_perf_registers(struct pt_regs *regs) > > > > +{ > > > > + return (regs->ip && regs->bp && regs->sp); > > > > +} > > > > > > I'm unconvinced by this, with both guess and orc having !bp is perfectly > > > valid. > > > > > > > void > > > > perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct > > > > pt_regs *regs) > > > > { > > > > @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct > > > > perf_callchain_entry_ctx *entry, struct pt_regs *re > > > > return; > > > > } > > > > > > > > - if (perf_callchain_store(entry, regs->ip)) > > > > + if (valid_perf_registers(regs)) { > > > > + if (perf_callchain_store(entry, regs->ip)) > > > > + return; > > > > + unwind_start(, current, regs, NULL); > > > > + } else if (regs->sp) { > > > > + unwind_start(, current, NULL, (unsigned long > > > > *)regs->sp); > > > > + } else { > > > > return; > > > > + } > > > > > > AFAICT if we, by pure accident, end up with !bp for ORC, then we > > > initialize the unwind wrong. > > > > > > Note that @regs is mostly trivially correct, except for that tracepoint > > > case. So I don't think we should magic here. > > > > Ah, I didn't quite understand this code before, and I still don't > > really, but I guess the issue is that @regs can be either real or fake. > > > > In the real @regs case, we just want to always unwind starting from > > regs->sp. > > > > But in the fake @regs case, we should instead unwind from the current > > frame, skipping all frames until we hit the fake regs->sp. Because > > starting from fake/incomplete regs is most likely going to cause > > problems with ORC (or DWARF for other arches). > > > > The idea of a fake regs is fragile and confusing. Is it possible to > > just pass in the "skip" stack pointer directly instead? That should > > work for both FP and non-FP. And I _think_ there's no need to ever > > capture regs->bp anyway -- the stack pointer should be sufficient. > > Hi, that will break some other usage, if perf_callchain_kernel is > called but it won't unwind to the callsite (could be produced by > attach an ebpf call to kprobe), things will also go wrong. It should > start with given registers when the register is valid. > And it's true with omit frame pointer BP value could be anything, so 0 > is also valid, I think I need to find a better way to tell if we could > start with the registers value or direct start unwinding and skip > until got the stack. > Hi, sorry I might have some misunderstanding. Adding an extra argument (eg. skip_sp) to indicate if it should just unwind from the current frame, and use SP as the "skip mark", should work well. And I also think the "fake"/"real" reg is fragile, could we abuse another eflag (just like PERF_EFLAGS_EXACT) to indicate the regs are partially dumped fake registers? So perf_callchain_kernel just check if it's a "partial registers", and in such case it can start unwinding and skip until it get to SP. This make it easier to tell if the registers are "fake". -- Best Regards, Kairui Song
Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Tue, Apr 16, 2019 at 12:59 AM Josh Poimboeuf wrote: > > On Mon, Apr 15, 2019 at 05:36:22PM +0200, Peter Zijlstra wrote: > > > > I'll mostly defer to Josh on unwinding, but a few comments below. > > > > On Tue, Apr 09, 2019 at 12:59:42AM +0800, Kairui Song wrote: > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > > > index e2b1447192a8..6075a4f94376 100644 > > > --- a/arch/x86/events/core.c > > > +++ b/arch/x86/events/core.c > > > @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct perf_event > > > *event, > > > cyc2ns_read_end(); > > > } > > > > > > +static inline int > > > +valid_perf_registers(struct pt_regs *regs) > > > +{ > > > + return (regs->ip && regs->bp && regs->sp); > > > +} > > > > I'm unconvinced by this, with both guess and orc having !bp is perfectly > > valid. > > > > > void > > > perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct > > > pt_regs *regs) > > > { > > > @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct > > > perf_callchain_entry_ctx *entry, struct pt_regs *re > > > return; > > > } > > > > > > - if (perf_callchain_store(entry, regs->ip)) > > > + if (valid_perf_registers(regs)) { > > > + if (perf_callchain_store(entry, regs->ip)) > > > + return; > > > + unwind_start(, current, regs, NULL); > > > + } else if (regs->sp) { > > > + unwind_start(, current, NULL, (unsigned long > > > *)regs->sp); > > > + } else { > > > return; > > > + } > > > > AFAICT if we, by pure accident, end up with !bp for ORC, then we > > initialize the unwind wrong. > > > > Note that @regs is mostly trivially correct, except for that tracepoint > > case. So I don't think we should magic here. > > Ah, I didn't quite understand this code before, and I still don't > really, but I guess the issue is that @regs can be either real or fake. > > In the real @regs case, we just want to always unwind starting from > regs->sp. > > But in the fake @regs case, we should instead unwind from the current > frame, skipping all frames until we hit the fake regs->sp. Because > starting from fake/incomplete regs is most likely going to cause > problems with ORC (or DWARF for other arches). > > The idea of a fake regs is fragile and confusing. Is it possible to > just pass in the "skip" stack pointer directly instead? That should > work for both FP and non-FP. And I _think_ there's no need to ever > capture regs->bp anyway -- the stack pointer should be sufficient. Hi, that will break some other usage, if perf_callchain_kernel is called but it won't unwind to the callsite (could be produced by attach an ebpf call to kprobe), things will also go wrong. It should start with given registers when the register is valid. And it's true with omit frame pointer BP value could be anything, so 0 is also valid, I think I need to find a better way to tell if we could start with the registers value or direct start unwinding and skip until got the stack. > > In other words, either regs should be "real", and skip_sp is NULL; or > regs should be NULL and skip_sp should have a value. > > -- > Josh -- Best Regards, Kairui Song
[RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
Currently perf callchain is not working properly with ORC unwinder, and sampling event from trace point. We'll get useless in kernel callchain like this: perf 6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL be23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) The root cause is within a trace point perf will try to dump the caller's register, but without CONFIG_FRAME_POINTER we can't get caller's BP as the frame pointer, so current frame pointer is returned instead. We get a register combination of caller IP and current BP, which confuse the unwinder and end the stacktrace early. So in such case don't dump BP, and just let the unwinder start directly and skip until we reached the stack we wanted. This make the callchain get the full kernel space stacktrace again: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL b523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) b52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) b5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Signed-off-by: Kairui Song --- Update from V1: Get rid of a lot of unneccessary code and just don't dump a inaccurate BP, and use SP as the marker for target frame. arch/x86/events/core.c| 18 +++--- arch/x86/include/asm/stacktrace.h | 9 +++-- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index e2b1447192a8..6075a4f94376 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct perf_event *event, cyc2ns_read_end(); } +static inline int +valid_perf_registers(struct pt_regs *regs) +{ + return (regs->ip && regs->bp && regs->sp); +} + void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re return; } - if (perf_callchain_store(entry, regs->ip)) + if (valid_perf_registers(regs)) { + if (perf_callchain_store(entry, regs->ip)) + return; + unwind_start(, current, regs, NULL); + } else if (regs->sp) { + unwind_start(, current, NULL, (unsigned long *)regs->sp); + } else { return; + } - for (unwind_start(, current, regs, NULL); !unwind_done(); -unwind_next_frame()) { + for (; !unwind_done(); unwind_next_frame()) { addr = unwind_get_return_address(); if (!addr || perf_callchain_store(entry, addr)) return; diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h index f335aad404a4..226077e20412 100644 --- a/arch/x86/include/asm/stacktrace.h +++ b/arch/x86/include/asm/stacktrace.h @@ -98,18 +98,23 @@ struct stack_frame_ia32 { u32 return_address; }; +#ifdef CONFIG_FRAME_POINTER static inline unsigned long caller_frame_pointer(void) { struct stack_frame *frame; frame = __builtin_frame_address(0); -#ifdef CONFIG_FRAME_POINTER frame = frame->next_frame; -#endif return (unsigned long)frame; } +#else +static inline unsigned long caller_frame_pointer(void) +{ + return 0; +} +#endif void show_opcodes(struct pt_regs *regs, const char *loglvl); void show_ip(struct pt_regs *regs, const char *loglvl); -- 2.20.1
Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Sat, Apr 6, 2019 at 1:27 AM Josh Poimboeuf wrote: > > On Sat, Apr 06, 2019 at 01:05:55AM +0800, Kairui Song wrote: > > On Sat, Apr 6, 2019 at 12:57 AM Josh Poimboeuf wrote: > > > > > > On Fri, Apr 05, 2019 at 11:13:02PM +0800, Kairui Song wrote: > > > > Hi Josh, thanks for the review, I tried again, using latest upstream > > > > kernel commit ea2cec24c8d429ee6f99040e4eb6c7ad627fe777: > > > > # uname -a > > > > Linux localhost.localdomain 5.1.0-rc3+ #29 SMP Fri Apr 5 22:53:05 CST > > > > 2019 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > Having following config: > > > > > CONFIG_UNWINDER_ORC=y > > > > > # CONFIG_UNWINDER_FRAME_POINTER is not set > > > > and CONFIG_FRAME_POINTER is off too. > > > > > > > > Then record something with perf (also latest upstream version): > > > > ./perf record -g -e kmem:* -c 1 > > > > > > > > Interrupt it, then view the output: > > > > perf script | less > > > > > > > > Then I notice the stacktrace in kernle is incomplete like following. > > > > Did I miss anything? > > > > -- > > > > lvmetad 617 [000]55.600786: kmem:kfree: > > > > call_site=b219e269 ptr=(nil) > > > > b22b2d1c kfree+0x11c > > > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > > > 7fba7e58fd0f __select+0x5f (/usr/lib64/libc-2.28.so) > > > > > > > > kworker/u2:5-rp 171 [000]55.628529: > > > > kmem:kmem_cache_alloc: call_site=b20e963d > > > > ptr=0xa07f39c581e0 bytes_req=80 bytes_alloc=80 > > > > gfp_flags=GFP_ATOMIC > > > > b22b0dec kmem_cache_alloc+0x13c > > > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > > > - > > > > > > > > And for the patch, I debugged the problem, and found how it happend: > > > > The reason is that we use following code for fetching the registers on > > > > a trace point: > > > > ...snip... > > > > #define perf_arch_fetch_caller_regs(regs, __ip) { \ > > > > (regs)->ip = (__ip); \ > > > > (regs)->bp = caller_frame_pointer(); \ > > > > (regs)->cs = __KERNEL_CS; > > > > ...snip... > > > > > > Thanks, I was able to recreate. It only happens when unwinding from a > > > tracepoint. I haven't investigated yet, but > > > perf_arch_fetch_caller_regs() looks highly suspect, since it's doing > > > (regs)->bp = caller_frame_pointer(), even for ORC. > > > > > > My only explanation for how your patch works is that RBP just happens to > > > point to somewhere higher on the stack, causing the unwinder to start at > > > a semi-random location. I suspect the real "fix" is that you're no > > > longer passing the regs to unwind_start(). > > > > > > > Yes that's right. Simply not passing regs to unwind_start will let the > > unwind start from the perf sample handling functions, and introduce a > > lot of "noise", so I let it skipped the frames until it reached the > > frame of the trace point. The regs->bp should still points to the > > stack base of the function which get called in the tracepoint that > > trigger perf sample, so let unwinder skip all the frames above it made > > it work. > > Ah, now I think I understand, thanks. perf_arch_fetch_caller_regs() > puts it in regs->bp, and then perf_callchain_kernel() reads that value > to tell the unwinder where to start dumping the stack trace. I guess > that explains why your patch works, though it still seems very odd that > perf_arch_fetch_caller_regs() is using regs->bp to store the frame > address. Maybe regs->sp would be more appropriate. > > -- > Josh Right, thanks for the comment. And after second thought there are some other issues here in the patch indeed, it still won't fix the problem when used with ebpf and tracepoint, I made some mistake about handling the callchain with different ways, will rethink about this and post an update later. -- Best Regards, Kairui Song
Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Sat, Apr 6, 2019 at 12:57 AM Josh Poimboeuf wrote: > > On Fri, Apr 05, 2019 at 11:13:02PM +0800, Kairui Song wrote: > > Hi Josh, thanks for the review, I tried again, using latest upstream > > kernel commit ea2cec24c8d429ee6f99040e4eb6c7ad627fe777: > > # uname -a > > Linux localhost.localdomain 5.1.0-rc3+ #29 SMP Fri Apr 5 22:53:05 CST > > 2019 x86_64 x86_64 x86_64 GNU/Linux > > > > Having following config: > > > CONFIG_UNWINDER_ORC=y > > > # CONFIG_UNWINDER_FRAME_POINTER is not set > > and CONFIG_FRAME_POINTER is off too. > > > > Then record something with perf (also latest upstream version): > > ./perf record -g -e kmem:* -c 1 > > > > Interrupt it, then view the output: > > perf script | less > > > > Then I notice the stacktrace in kernle is incomplete like following. > > Did I miss anything? > > -- > > lvmetad 617 [000]55.600786: kmem:kfree: > > call_site=b219e269 ptr=(nil) > > b22b2d1c kfree+0x11c (/lib/modules/5.1.0-rc3+/build/vmlinux) > > 7fba7e58fd0f __select+0x5f (/usr/lib64/libc-2.28.so) > > > > kworker/u2:5-rp 171 [000]55.628529: > > kmem:kmem_cache_alloc: call_site=b20e963d > > ptr=0xa07f39c581e0 bytes_req=80 bytes_alloc=80 > > gfp_flags=GFP_ATOMIC > > b22b0dec kmem_cache_alloc+0x13c > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > - > > > > And for the patch, I debugged the problem, and found how it happend: > > The reason is that we use following code for fetching the registers on > > a trace point: > > ...snip... > > #define perf_arch_fetch_caller_regs(regs, __ip) { \ > > (regs)->ip = (__ip); \ > > (regs)->bp = caller_frame_pointer(); \ > > (regs)->cs = __KERNEL_CS; > > ...snip... > > Thanks, I was able to recreate. It only happens when unwinding from a > tracepoint. I haven't investigated yet, but > perf_arch_fetch_caller_regs() looks highly suspect, since it's doing > (regs)->bp = caller_frame_pointer(), even for ORC. > > My only explanation for how your patch works is that RBP just happens to > point to somewhere higher on the stack, causing the unwinder to start at > a semi-random location. I suspect the real "fix" is that you're no > longer passing the regs to unwind_start(). > Yes that's right. Simply not passing regs to unwind_start will let the unwind start from the perf sample handling functions, and introduce a lot of "noise", so I let it skipped the frames until it reached the frame of the trace point. The regs->bp should still points to the stack base of the function which get called in the tracepoint that trigger perf sample, so let unwinder skip all the frames above it made it work. -- Best Regards, Kairui Song
Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Fri, Apr 5, 2019 at 3:17 PM Peter Zijlstra wrote: > > And you forgot to Cc Josh.. > Hi, thanks for the reply and Cc more people, I just copies the list from ./scripts/get_maintainer.pl, will pay more attention next time. > > > > Just found with ORC unwinder the perf callchain is unusable, and this > > seems fixes it well, any suggestion is welcome, thanks! > > That whole .direct stuff is horrible crap. > Sorry if I did anything dumb, but I didn't find a better way to make it work so sent this RFC... Would you mind tell me what I'm doing wrong, or give any suggestion about how should I improve it? -- Best Regards, Kairui Song
Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
On Fri, Apr 5, 2019 at 10:09 PM Josh Poimboeuf wrote: > > On Fri, Apr 05, 2019 at 01:25:45AM +0800, Kairui Song wrote: > > Currently perf callchain is not working properly with ORC unwinder, > > we'll get useless in kernel callchain like this: > > > > perf 6429 [000]22.498450: kmem:mm_page_alloc: > > page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL > > be23e32e __alloc_pages_nodemask+0x22e > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) > > 5651468729c1 [unknown] (/usr/bin/perf) > > 5651467ee82a main+0x69a (/usr/bin/perf) > > 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) > > 5541f689495641d7 [unknown] ([unknown]) > > > > Without CONFIG_FRAME_POINTER, bp is not reserved as frame pointer so > > can't get callers frame pointer, instead current frame pointer is > > returned when trying to fetch caller registers. The unwinder will error > > out early, and end the stacktrace early. > > > > So instead of let the unwinder start with the dumped register, we start > > it right where the unwinding started when the stacktrace is triggered by > > trace event directly. And skip until the frame pointer is reached. > > > > This makes the callchain get the full in kernel stacktrace again: > > > > perf 6503 [000] 1567.570191: kmem:mm_page_alloc: > > page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL > > b523e2ae __alloc_pages_nodemask+0x22e > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52383bd __get_free_pages+0xd > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52fd28a __pollwait+0x8a > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b521426f perf_poll+0x2f > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52fe3e2 do_sys_poll+0x252 > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b52ff027 __x64_sys_poll+0x37 > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b500418b do_syscall_64+0x5b > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > b5a0008c entry_SYSCALL_64_after_hwframe+0x44 > > (/lib/modules/5.1.0-rc3+/build/vmlinux) > > 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) > > 55a22960d9c1 [unknown] (/usr/bin/perf) > > 55a22958982a main+0x69a (/usr/bin/perf) > > 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) > > 5541f689495641d7 [unknown] ([unknown]) > > > > > > > > Just found with ORC unwinder the perf callchain is unusable, and this > > seems fixes it well, any suggestion is welcome, thanks! > > Hi Kairui, > > Without CONFIG_FRAME_POINTER, the BP register has no meaning, so I don't > see how this patch could work. > > Also, perf stack traces seem to work fine for me with ORC. Can you give > some details on how to recreate the issue? > > -- > Josh Hi Josh, thanks for the review, I tried again, using latest upstream kernel commit ea2cec24c8d429ee6f99040e4eb6c7ad627fe777: # uname -a Linux localhost.localdomain 5.1.0-rc3+ #29 SMP Fri Apr 5 22:53:05 CST 2019 x86_64 x86_64 x86_64 GNU/Linux Having following config: > CONFIG_UNWINDER_ORC=y > # CONFIG_UNWINDER_FRAME_POINTER is not set and CONFIG_FRAME_POINTER is off too. Then record something with perf (also latest upstream version): ./perf record -g -e kmem:* -c 1 Interrupt it, then view the output: perf script | less Then I notice the stacktrace in kernle is incomplete like following. Did I miss anything? -- lvmetad 617 [000]55.600786: kmem:kfree: call_site=b219e269 ptr=(nil) b22b2d1c kfree+0x11c (/lib/modules/5.1.0-rc3+/build/vmlinux) 7fba7e58fd0f __select+0x5f (/usr/lib64/libc-2.28.so) kworker/u2:5-rp 171 [000]55.628529: kmem:kmem_cache_alloc: call_site=b20e963d ptr=0xa07f39c581e0 bytes_req=80 bytes_alloc=80 gfp_flags=GFP_ATOMIC b22b0dec kmem_cache_alloc+0x13c (/lib/modules/5.1.0-rc3+/build/vmlinux) - And for the patch, I debugged the problem, and found how it happend: The reason is that we use following code for fetching the registers on a trace point: ...snip... #define perf_arch_fetch_caller_regs(regs, __ip) { \ (regs)->ip = (__ip); \ (regs)->bp = caller_frame_pointer(); \ (regs)->cs = __KERNEL_CS; ...snip... It tries to dump the registers of caller, but in the definition of caller_frame_pointer: static inline unsigned long caller_frame_pointer(void) { struct
[RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
Currently perf callchain is not working properly with ORC unwinder, we'll get useless in kernel callchain like this: perf 6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL be23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Without CONFIG_FRAME_POINTER, bp is not reserved as frame pointer so can't get callers frame pointer, instead current frame pointer is returned when trying to fetch caller registers. The unwinder will error out early, and end the stacktrace early. So instead of let the unwinder start with the dumped register, we start it right where the unwinding started when the stacktrace is triggered by trace event directly. And skip until the frame pointer is reached. This makes the callchain get the full in kernel stacktrace again: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL b523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) b52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) b5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Just found with ORC unwinder the perf callchain is unusable, and this seems fixes it well, any suggestion is welcome, thanks! --- arch/x86/events/core.c | 34 -- include/linux/perf_event.h | 3 ++- kernel/bpf/stackmap.c | 4 ++-- kernel/events/callchain.c | 13 +++-- kernel/events/core.c | 2 +- 5 files changed, 44 insertions(+), 12 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index e2b1447192a8..3f3e110794ac 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2355,8 +2355,9 @@ void arch_perf_update_userpage(struct perf_event *event, cyc2ns_read_end(); } -void -perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) +static void +__perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs, + bool direct_call) { struct unwind_state state; unsigned long addr; @@ -2366,17 +2367,38 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re return; } - if (perf_callchain_store(entry, regs->ip)) - return; + /* +* Without frame pointer, we can't get a reliable caller bp value. +* If this is called directly from a trace point, just start the +* unwind from here and skip until the frame is reached. +*/ + if (IS_ENABLED(CONFIG_FRAME_POINTER) || !direct_call) { + if (perf_callchain_store(entry, regs->ip)) + return; + unwind_start(, current, regs, NULL); + } else { + unwind_start(, current, NULL, (unsigned long*)regs->bp); + } - for (unwind_start(, current, regs, NULL); !unwind_done(); -unwind_next_frame()) { + for (; !unwind_done(); unwind_next_frame()) { addr = unwind_get_return_address(); if (!addr || perf_callchain_store(entry, addr)) return; } } +void +perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) +{ + __perf_callchain_kernel(entry, regs, false); +} + +void +perf_callchain_kernel_direct(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) +{ + __perf_callchain_kernel(entry, regs, true); +} + static inline int valid_user_frame(const void __user *fp, unsigned long size) { diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index e47ef764f613..b0e33ba36695 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1154,9 +1154,10 @@ DECLARE_PER_CPU(struct perf_callchain_entry,
[tip:x86/urgent] x86/gart: Exclude GART aperture from kcore
Commit-ID: ffc8599aa9763f39f6736a79da4d1575e7006f9a Gitweb: https://git.kernel.org/tip/ffc8599aa9763f39f6736a79da4d1575e7006f9a Author: Kairui Song AuthorDate: Fri, 8 Mar 2019 11:05:08 +0800 Committer: Thomas Gleixner CommitDate: Sat, 23 Mar 2019 12:11:49 +0100 x86/gart: Exclude GART aperture from kcore On machines where the GART aperture is mapped over physical RAM, /proc/kcore contains the GART aperture range. Accessing the GART range via /proc/kcore results in a kernel crash. vmcore used to have the same issue, until it was fixed with commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', leveraging existing hook infrastructure in vmcore to let /proc/vmcore return zeroes when attempting to read the aperture region, and so it won't read from the actual memory. Apply the same workaround for kcore. First implement the same hook infrastructure for kcore, then reuse the hook functions introduced in the previous vmcore fix. Just with some minor adjustment, rename some functions for more general usage, and simplify the hook infrastructure a bit as there is no module usage yet. Suggested-by: Baoquan He Signed-off-by: Kairui Song Signed-off-by: Thomas Gleixner Reviewed-by: Jiri Bohac Acked-by: Baoquan He Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Alexey Dobriyan Cc: Andrew Morton Cc: Omar Sandoval Cc: Dave Young Link: https://lkml.kernel.org/r/20190308030508.13548-1-kas...@redhat.com --- arch/x86/kernel/aperture_64.c | 20 +--- fs/proc/kcore.c | 27 +++ include/linux/kcore.h | 2 ++ 3 files changed, 42 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 58176b56354e..294ed4392a0e 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -14,6 +14,7 @@ #define pr_fmt(fmt) "AGP: " fmt #include +#include #include #include #include @@ -57,7 +58,7 @@ int fallback_aper_force __initdata; int fix_aperture __initdata = 1; -#ifdef CONFIG_PROC_VMCORE +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) /* * If the first kernel maps the aperture over e820 RAM, the kdump kernel will * use the same range because it will remain configured in the northbridge. @@ -66,20 +67,25 @@ int fix_aperture __initdata = 1; */ static unsigned long aperture_pfn_start, aperture_page_count; -static int gart_oldmem_pfn_is_ram(unsigned long pfn) +static int gart_mem_pfn_is_ram(unsigned long pfn) { return likely((pfn < aperture_pfn_start) || (pfn >= aperture_pfn_start + aperture_page_count)); } -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) +static void __init exclude_from_core(u64 aper_base, u32 aper_order) { aperture_pfn_start = aper_base >> PAGE_SHIFT; aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT; - WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram)); +#ifdef CONFIG_PROC_VMCORE + WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram)); +#endif +#ifdef CONFIG_PROC_KCORE + WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram)); +#endif } #else -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) +static void exclude_from_core(u64 aper_base, u32 aper_order) { } #endif @@ -474,7 +480,7 @@ out: * may have allocated the range over its e820 RAM * and fixed up the northbridge */ - exclude_from_vmcore(last_aper_base, last_aper_order); + exclude_from_core(last_aper_base, last_aper_order); return 1; } @@ -520,7 +526,7 @@ out: * overlap with the first kernel's memory. We can't access the * range through vmcore even though it should be part of the dump. */ - exclude_from_vmcore(aper_alloc, aper_order); + exclude_from_core(aper_alloc, aper_order); /* Fix up the north bridges */ for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) { diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index bbcc185062bb..d29d869abec1 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -54,6 +54,28 @@ static LIST_HEAD(kclist_head); static DECLARE_RWSEM(kclist_lock); static int kcore_need_update = 1; +/* + * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error + * Same as oldmem_pfn_is_ram in vmcore + */ +static int (*mem_pfn_is_ram)(unsigned long pfn); + +int __init register_mem_pfn_is_ram(int (*fn)(unsigned long pfn)) +{ + if (mem_pfn_is_ram) + return -EBUSY; + mem_pfn_is_ram = fn; + return 0; +} + +static int pfn_is_ram(unsigned long pfn) +{ + if (mem_pfn_is_ram) + return mem_pfn_is_ram(pfn); + else + return 1; +} + /* This doesn't grab kclist_lock, so it shou
Re: [PATCH v5] x86/gart/kcore: Exclude GART aperture from kcore
On Fri, Mar 8, 2019 at 11:06 AM Kairui Song wrote: > > On machines where the GART aperture is mapped over physical RAM, > /proc/kcore contains the GART aperture range and reading it may lead > to kernel panic. > > Vmcore used to have the same issue, until we fixed it in > commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', > leveraging existing hook infrastructure in vmcore to let /proc/vmcore > return zeroes when attempting to read the aperture region, and so it > won't read from the actual memory. > > We apply the same workaround for kcore. First implement the same hook > infrastructure for kcore, then reuse the hook functions introduced in > previous vmcore fix. Just with some minor adjustment, rename some > functions for more general usage, and simplify the hook infrastructure > a bit as there is no module usage yet. > > Suggested-by: Baoquan He > Signed-off-by: Kairui Song > > --- > > Update from V4: > - Remove the unregistering funtion and move functions never used after > init to .init > > Update from V3: > - Reuse the approach in V2, as Jiri noticed V3 approach may fail > some use case. It introduce overlapped region in kcore, and can't > garenteen the read request will fall into the region we wanted. > - Improve some function naming suggested by Baoquan in V2. > - Simplify the hook registering and checking, we are not exporting the > hook register function for now, no need to make it that complex. > > Update from V2: > Instead of repeating the same hook infrastructure for kcore, introduce > a new kcore area type to avoid reading from, and let kcore always bypass > this kind of area. > > Update from V1: > Fix a complie error when CONFIG_PROC_KCORE is not set > > arch/x86/kernel/aperture_64.c | 20 +--- > fs/proc/kcore.c | 27 +++ > include/linux/kcore.h | 2 ++ > 3 files changed, 42 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c > index 58176b56354e..294ed4392a0e 100644 > --- a/arch/x86/kernel/aperture_64.c > +++ b/arch/x86/kernel/aperture_64.c > @@ -14,6 +14,7 @@ > #define pr_fmt(fmt) "AGP: " fmt > > #include > +#include > #include > #include > #include > @@ -57,7 +58,7 @@ int fallback_aper_force __initdata; > > int fix_aperture __initdata = 1; > > -#ifdef CONFIG_PROC_VMCORE > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) > /* > * If the first kernel maps the aperture over e820 RAM, the kdump kernel will > * use the same range because it will remain configured in the northbridge. > @@ -66,20 +67,25 @@ int fix_aperture __initdata = 1; > */ > static unsigned long aperture_pfn_start, aperture_page_count; > > -static int gart_oldmem_pfn_is_ram(unsigned long pfn) > +static int gart_mem_pfn_is_ram(unsigned long pfn) > { > return likely((pfn < aperture_pfn_start) || > (pfn >= aperture_pfn_start + aperture_page_count)); > } > > -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) > +static void __init exclude_from_core(u64 aper_base, u32 aper_order) > { > aperture_pfn_start = aper_base >> PAGE_SHIFT; > aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT; > - WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram)); > +#ifdef CONFIG_PROC_VMCORE > + WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram)); > +#endif > +#ifdef CONFIG_PROC_KCORE > + WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram)); > +#endif > } > #else > -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) > +static void exclude_from_core(u64 aper_base, u32 aper_order) > { > } > #endif > @@ -474,7 +480,7 @@ int __init gart_iommu_hole_init(void) > * may have allocated the range over its e820 RAM > * and fixed up the northbridge > */ > - exclude_from_vmcore(last_aper_base, last_aper_order); > + exclude_from_core(last_aper_base, last_aper_order); > > return 1; > } > @@ -520,7 +526,7 @@ int __init gart_iommu_hole_init(void) > * overlap with the first kernel's memory. We can't access the > * range through vmcore even though it should be part of the dump. > */ > - exclude_from_vmcore(aper_alloc, aper_order); > + exclude_from_core(aper_alloc, aper_order); > > /* Fix up the north bridges */ > for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) { > dif
[PATCH v5] x86/gart/kcore: Exclude GART aperture from kcore
On machines where the GART aperture is mapped over physical RAM, /proc/kcore contains the GART aperture range and reading it may lead to kernel panic. Vmcore used to have the same issue, until we fixed it in commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', leveraging existing hook infrastructure in vmcore to let /proc/vmcore return zeroes when attempting to read the aperture region, and so it won't read from the actual memory. We apply the same workaround for kcore. First implement the same hook infrastructure for kcore, then reuse the hook functions introduced in previous vmcore fix. Just with some minor adjustment, rename some functions for more general usage, and simplify the hook infrastructure a bit as there is no module usage yet. Suggested-by: Baoquan He Signed-off-by: Kairui Song --- Update from V4: - Remove the unregistering funtion and move functions never used after init to .init Update from V3: - Reuse the approach in V2, as Jiri noticed V3 approach may fail some use case. It introduce overlapped region in kcore, and can't garenteen the read request will fall into the region we wanted. - Improve some function naming suggested by Baoquan in V2. - Simplify the hook registering and checking, we are not exporting the hook register function for now, no need to make it that complex. Update from V2: Instead of repeating the same hook infrastructure for kcore, introduce a new kcore area type to avoid reading from, and let kcore always bypass this kind of area. Update from V1: Fix a complie error when CONFIG_PROC_KCORE is not set arch/x86/kernel/aperture_64.c | 20 +--- fs/proc/kcore.c | 27 +++ include/linux/kcore.h | 2 ++ 3 files changed, 42 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 58176b56354e..294ed4392a0e 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -14,6 +14,7 @@ #define pr_fmt(fmt) "AGP: " fmt #include +#include #include #include #include @@ -57,7 +58,7 @@ int fallback_aper_force __initdata; int fix_aperture __initdata = 1; -#ifdef CONFIG_PROC_VMCORE +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) /* * If the first kernel maps the aperture over e820 RAM, the kdump kernel will * use the same range because it will remain configured in the northbridge. @@ -66,20 +67,25 @@ int fix_aperture __initdata = 1; */ static unsigned long aperture_pfn_start, aperture_page_count; -static int gart_oldmem_pfn_is_ram(unsigned long pfn) +static int gart_mem_pfn_is_ram(unsigned long pfn) { return likely((pfn < aperture_pfn_start) || (pfn >= aperture_pfn_start + aperture_page_count)); } -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) +static void __init exclude_from_core(u64 aper_base, u32 aper_order) { aperture_pfn_start = aper_base >> PAGE_SHIFT; aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT; - WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram)); +#ifdef CONFIG_PROC_VMCORE + WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram)); +#endif +#ifdef CONFIG_PROC_KCORE + WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram)); +#endif } #else -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) +static void exclude_from_core(u64 aper_base, u32 aper_order) { } #endif @@ -474,7 +480,7 @@ int __init gart_iommu_hole_init(void) * may have allocated the range over its e820 RAM * and fixed up the northbridge */ - exclude_from_vmcore(last_aper_base, last_aper_order); + exclude_from_core(last_aper_base, last_aper_order); return 1; } @@ -520,7 +526,7 @@ int __init gart_iommu_hole_init(void) * overlap with the first kernel's memory. We can't access the * range through vmcore even though it should be part of the dump. */ - exclude_from_vmcore(aper_alloc, aper_order); + exclude_from_core(aper_alloc, aper_order); /* Fix up the north bridges */ for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) { diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index bbcc185062bb..d29d869abec1 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -54,6 +54,28 @@ static LIST_HEAD(kclist_head); static DECLARE_RWSEM(kclist_lock); static int kcore_need_update = 1; +/* + * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error + * Same as oldmem_pfn_is_ram in vmcore + */ +static int (*mem_pfn_is_ram)(unsigned long pfn); + +int __init register_mem_pfn_is_ram(int (*fn)(unsigned long pfn)) +{ + if (mem_pfn_is_ram) + return -EBUSY; + mem_pfn_is_ram = fn; + re
Re: [PATCH v4] x86/gart/kcore: Exclude GART aperture from kcore
On Thu, Mar 7, 2019 at 1:03 AM Jiri Bohac wrote: > > Hi, > > On Wed, Mar 06, 2019 at 07:38:59PM +0800, Kairui Song wrote: > > +int register_mem_pfn_is_ram(int (*fn)(unsigned long pfn)) > > +{ > > + if (mem_pfn_is_ram) > > + return -EBUSY; > > + mem_pfn_is_ram = fn; > > + return 0; > > +} > > + > > +void unregister_mem_pfn_is_ram(void) > > +{ > > + mem_pfn_is_ram = NULL; > > +} > > + > > +static int pfn_is_ram(unsigned long pfn) > > +{ > > + if (mem_pfn_is_ram) > > + return mem_pfn_is_ram(pfn); > > + else > > + return 1; > > +} > > + > > If anyone were ever to use unregister_mem_pfn_is_ram(), > pfn_is_ram() would become racy. > > In V2 you had this: > + fn = mem_pfn_is_ram; > + if (fn) > + ret = fn(pfn); > > I agree it's unnecessary since nothing uses > unregister_mem_pfn_is_ram(). But then I think it would be best to > just drop the unregister function. > > Otherwise the patch looks good to me. > Good catch, let me remove the unregister function. Also, I'd like to have an __init prefix for register_mem_pfn_is_ram, will update in V5. -- Best Regards, Kairui Song
[tip:x86/urgent] x86/hyperv: Fix kernel panic when kexec on HyperV
Commit-ID: 179fb36abb097976997f50733d5b122a29158cba Gitweb: https://git.kernel.org/tip/179fb36abb097976997f50733d5b122a29158cba Author: Kairui Song AuthorDate: Wed, 6 Mar 2019 19:18:27 +0800 Committer: Thomas Gleixner CommitDate: Wed, 6 Mar 2019 23:27:44 +0100 x86/hyperv: Fix kernel panic when kexec on HyperV After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"), kexec fails with a kernel panic: kexec_core: Starting new kernel BUG: unable to handle kernel NULL pointer dereference at Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018 RIP: 0010:0xc901d000 Call Trace: ? __send_ipi_mask+0x1c6/0x2d0 ? hv_send_ipi_mask_allbutself+0x6d/0xb0 ? mp_save_irq+0x70/0x70 ? __ioapic_read_entry+0x32/0x50 ? ioapic_read_entry+0x39/0x50 ? clear_IO_APIC_pin+0xb8/0x110 ? native_stop_other_cpus+0x6e/0x170 ? native_machine_shutdown+0x22/0x40 ? kernel_kexec+0x136/0x156 That happens if hypercall based IPIs are used because the hypercall page is reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs, which invokes the hypercall and dereferences the unusable page. To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid any misuse, IPI sending will fall back to the non hypercall based method. This only happens on kexec / kdump so just setting the pointer to NULL is good enough. Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments") Signed-off-by: Kairui Song Signed-off-by: Thomas Gleixner Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Vitaly Kuznetsov Cc: Dave Young Cc: de...@linuxdriverproject.org Link: https://lkml.kernel.org/r/20190306111827.14131-1-kas...@redhat.com --- arch/x86/hyperv/hv_init.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 7abb09e2eeb8..d3f42b6bbdac 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -406,6 +406,13 @@ void hyperv_cleanup(void) /* Reset our OS id */ wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); + /* +* Reset hypercall page reference before reset the page, +* let hypercall operations fail safely rather than +* panic the kernel for using invalid hypercall page +*/ + hv_hypercall_pg = NULL; + /* Reset the hypercall page */ hypercall_msr.as_uint64 = 0; wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
[PATCH v4] x86/gart/kcore: Exclude GART aperture from kcore
On machines where the GART aperture is mapped over physical RAM, /proc/kcore contains the GART aperture range and reading it may lead to kernel panic. Vmcore used to have the same issue, until we fixed it in commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', leveraging existing hook infrastructure in vmcore to let /proc/vmcore return zeroes when attempting to read the aperture region, and so it won't read from the actual memory. We apply the same workaround for kcore. First implement the same hook infrastructure for kcore, then reuse the hook function introduced in previous vmcore fix. Just with some minor adjustment, rename some functions for more general usage, and simplify the hook infrastructure a bit as there is no module usage yet. Suggested-by: Baoquan He Signed-off-by: Kairui Song --- Update from V3: - Reuse the approach in V2, as Jiri noticed V3 approach may fail some use case. It introduce overlapped region in kcore, and can't garenteen the read request will fall into the region we wanted. - Improve some function naming suggested by Baoquan in V2. - Simplify the hook registering and checking, we are not exporting the hook register function for now, no need to make it that complex. - Simplify the commit message Update from V2: Instead of repeating the same hook infrastructure for kcore, introduce a new kcore area type to avoid reading from, and let kcore always bypass this kind of area. Update from V1: Fix a complie error when CONFIG_PROC_KCORE is not set arch/x86/kernel/aperture_64.c | 20 +--- fs/proc/kcore.c | 32 include/linux/kcore.h | 3 +++ 3 files changed, 48 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 58176b56354e..c1319567b441 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -14,6 +14,7 @@ #define pr_fmt(fmt) "AGP: " fmt #include +#include #include #include #include @@ -57,7 +58,7 @@ int fallback_aper_force __initdata; int fix_aperture __initdata = 1; -#ifdef CONFIG_PROC_VMCORE +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) /* * If the first kernel maps the aperture over e820 RAM, the kdump kernel will * use the same range because it will remain configured in the northbridge. @@ -66,20 +67,25 @@ int fix_aperture __initdata = 1; */ static unsigned long aperture_pfn_start, aperture_page_count; -static int gart_oldmem_pfn_is_ram(unsigned long pfn) +static int gart_mem_pfn_is_ram(unsigned long pfn) { return likely((pfn < aperture_pfn_start) || (pfn >= aperture_pfn_start + aperture_page_count)); } -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) +static void exclude_from_core(u64 aper_base, u32 aper_order) { aperture_pfn_start = aper_base >> PAGE_SHIFT; aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT; - WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram)); +#ifdef CONFIG_PROC_VMCORE + WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram)); +#endif +#ifdef CONFIG_PROC_KCORE + WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram)); +#endif } #else -static void exclude_from_vmcore(u64 aper_base, u32 aper_order) +static void exclude_from_core(u64 aper_base, u32 aper_order) { } #endif @@ -474,7 +480,7 @@ int __init gart_iommu_hole_init(void) * may have allocated the range over its e820 RAM * and fixed up the northbridge */ - exclude_from_vmcore(last_aper_base, last_aper_order); + exclude_from_core(last_aper_base, last_aper_order); return 1; } @@ -520,7 +526,7 @@ int __init gart_iommu_hole_init(void) * overlap with the first kernel's memory. We can't access the * range through vmcore even though it should be part of the dump. */ - exclude_from_vmcore(aper_alloc, aper_order); + exclude_from_core(aper_alloc, aper_order); /* Fix up the north bridges */ for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) { diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index bbcc185062bb..e51b324450d6 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -54,6 +54,33 @@ static LIST_HEAD(kclist_head); static DECLARE_RWSEM(kclist_lock); static int kcore_need_update = 1; +/* + * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error + * Same as oldmem_pfn_is_ram in vmcore + */ +static int (*mem_pfn_is_ram)(unsigned long pfn); + +int register_mem_pfn_is_ram(int (*fn)(unsigned long pfn)) +{ + if (mem_pfn_is_ram) + return -EBUSY; + mem_pfn_is_ram = fn; + return 0; +} + +void unregister_mem_pfn_is_ram(void) +{ + mem_pfn_is_ram
[PATCH v3] x86, hyperv: fix kernel panic when kexec on HyperV
After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"), kexec will fail with a kernel panic like this: kexec_core: Starting new kernel BUG: unable to handle kernel NULL pointer dereference at PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018 RIP: 0010:0xc901d000 Code: Bad RIP value. RSP: 0018:c9000495bcf0 EFLAGS: 00010046 RAX: RBX: c901d000 RCX: 00020015 RDX: 7f553000 RSI: RDI: c9000495bd28 RBP: 0002 R08: R09: 8238aaf8 R10: 8238aae0 R11: R12: 88007f553008 R13: 0001 R14: 8800ff553000 R15: FS: 7ff5c0e67b80() GS:880078e0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0 Call Trace: ? __send_ipi_mask+0x1c6/0x2d0 ? hv_send_ipi_mask_allbutself+0x6d/0xb0 ? mp_save_irq+0x70/0x70 ? __ioapic_read_entry+0x32/0x50 ? ioapic_read_entry+0x39/0x50 ? clear_IO_APIC_pin+0xb8/0x110 ? native_stop_other_cpus+0x6e/0x170 ? native_machine_shutdown+0x22/0x40 ? kernel_kexec+0x136/0x156 ? __do_sys_reboot+0x1be/0x210 ? kmem_cache_free+0x1b1/0x1e0 ? __dentry_kill+0x10b/0x160 ? _cond_resched+0x15/0x30 ? dentry_kill+0x47/0x170 ? dput.part.34+0xc6/0x100 ? __fput+0x147/0x220 ? _cond_resched+0x15/0x30 ? task_work_run+0x38/0xa0 ? do_syscall_64+0x5b/0x160 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel hv_vmbus That's because now we may use hypercall for sending IPI, the hypercall page will be reset very early upon kexec reboot, but kexec will need to send IPI for stopping CPUs, and it will reference this no longer usable page, then kernel panics. To fix it, simply reset hv_hypercall_pg to NULL before the page is reset to avoid any misuse, IPI sending will fallback to use non hypercall based method. This only happens on kexec / kdump so setting to NULL should be good enough. Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments") Signed-off-by: Kairui Song --- Update from V2: - The memory barrier is not needed, remove it. Update from V1: - Add comment for the wmb call. arch/x86/hyperv/hv_init.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 7abb09e2eeb8..d3f42b6bbdac 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -406,6 +406,13 @@ void hyperv_cleanup(void) /* Reset our OS id */ wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); + /* +* Reset hypercall page reference before reset the page, +* let hypercall operations fail safely rather than +* panic the kernel for using invalid hypercall page +*/ + hv_hypercall_pg = NULL; + /* Reset the hypercall page */ hypercall_msr.as_uint64 = 0; wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); -- 2.20.1
Re: [PATCH v2] x86/gart/kcore: Exclude GART aperture from kcore
On Tue, Feb 19, 2019 at 4:00 PM Kairui Song wrote: > > On Thu, Jan 24, 2019 at 10:17 AM Baoquan He wrote: > > > > On 01/23/19 at 10:50pm, Kairui Song wrote: > > > > > int fix_aperture __initdata = 1; > > > > > > > > > > -#ifdef CONFIG_PROC_VMCORE > > > > > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) > > > > > /* > > > > > * If the first kernel maps the aperture over e820 RAM, the kdump > > > > > kernel will > > > > > * use the same range because it will remain configured in the > > > > > northbridge. > > > > > @@ -66,7 +67,7 @@ int fix_aperture __initdata = 1; > > > > > */ > > > > > static unsigned long aperture_pfn_start, aperture_page_count; > > > > > > > > > > -static int gart_oldmem_pfn_is_ram(unsigned long pfn) > > > > > +static int gart_mem_pfn_is_ram(unsigned long pfn) > > > > > { > > > > > return likely((pfn < aperture_pfn_start) || > > > > > (pfn >= aperture_pfn_start + > > > > > aperture_page_count)); > > > > > @@ -76,7 +77,12 @@ static void exclude_from_vmcore(u64 aper_base, u32 > > > > > aper_order) > > > > > > > > Shouldn't this function name be changed? It's not only handling vmcore > > > > stuff any more, but also kcore. And this function is not excluding, but > > > > resgistering. > > > > > > > > Other than this, it looks good to me. > > > > > > > > Thanks > > > > Baoquan > > > > > > > > > > Good suggestion, it's good to change this function name too to avoid > > > any misleading. This patch hasn't got any other reviews recently, I'll > > > update it shortly. > > > > There's more. > > > > These two are doing the same thing: > > register_mem_pfn_is_ram > > register_oldmem_pfn_is_ram > > > > Need remove one of them and put it in a right place. Furthermore, may > > need see if there's existing function which is used to register a > > function to a hook. > > > > Secondly, exclude_from_vmcore() is not excluding anthing, it's only > > registering a function which is used to judge if oldmem/pfn is ram. Need > > rename it. > > > > Thanks > > Baoquan > Hi Baoquan, after second thought, vmcore and kcore are doing similar thing but still quite independent of each, didn't see any simple way to share the logic. And for the following naming issue I think considering the context there is no problem, "exclude_from_vmcore(aper_alloc, aper_order)" is clearly doing what it literally means, excluding the aperture from vmcore. Let me know if anything is wrong, will send V4 later reuse this approach. -- Best Regards, Kairui Song
Re: [PATCH v3] x86/gart/kcore: Exclude GART aperture from kcore
On Fri, Mar 1, 2019 at 7:12 AM Jiri Bohac wrote: > > On Wed, Feb 13, 2019 at 04:28:00PM +0800, Kairui Song wrote: > > @@ -465,6 +472,12 @@ read_kcore(struct file *file, char __user *buffer, > > size_t buflen, loff_t *fpos) > > goto out; > > } > > m = NULL; /* skip the list anchor */ > > + } else if (m->type == KCORE_NORAM) { > > + /* for NORAM area just fill zero */ > > + if (clear_user(buffer, tsz)) { > > + ret = -EFAULT; > > + goto out; > > + } > > I don't think this works reliably. The loop filling the buffer > has this logic at the top: > > while (buflen) { > /* > * If this is the first iteration or the address is not within > * the previous entry, search for a matching entry. > */ > if (!m || start < m->addr || start >= m->addr + m->size) { > list_for_each_entry(m, _head, list) { > if (start >= m->addr && > start < m->addr + m->size) > break; > } > } > > This sets m to the kclist entry that contains the memory being > read. But if we do a large read that starts in valid KCORE_RAM > memory below the GART overlap and extends into the overlap, m > will not be set to the KCORE_NORAM entry. It will keep pointing > to the KCORE_RAM entry and the patch will have no effect. > > But this seems already broken in existing cases as well, various > KCORE_* types overlap with KCORE_RAM, don't they? So maybe > bf991c2231117d50a7645792b514354fc8d19dae ("proc/kcore: optimize > multiple page reads") broke this and once fixed, this KCORE_NORAM > approach will work. Omar? > Thanks for the review! You are right, although I hid the NORAM region from the elf header, but didn't notice this potential risk of having overlapped region. I didn't see other kcore regions overlap for now, if so the optimization should be totally good. Better to keep using a hook just like what we did in vmcore or we will have a performance drop for "fixing" this. Will send V4 using the previous approach if there are no further comments. -- Best Regards, Kairui Song
Re: [PATCH v2] x86, hyperv: fix kernel panic when kexec on HyperV
On Tue, Mar 5, 2019 at 8:33 PM Peter Zijlstra wrote: > > On Tue, Mar 05, 2019 at 08:17:03PM +0800, Kairui Song wrote: > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > > index 7abb09e2eeb8..34aa1e953dfc 100644 > > --- a/arch/x86/hyperv/hv_init.c > > +++ b/arch/x86/hyperv/hv_init.c > > @@ -406,6 +406,12 @@ void hyperv_cleanup(void) > > /* Reset our OS id */ > > wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); > > > > + /* Cleanup hypercall page reference before reset the page */ > > + hv_hypercall_pg = NULL; > > + > > + /* Make sure page reference is cleared before wrmsr */ > > This comment forgets to tell us who cares about this. And why the wrmsr > itself isn't serializing enough. > > > + wmb(); > > + > > /* Reset the hypercall page */ > > hypercall_msr.as_uint64 = 0; > > wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); > > That looks like a fake MSR; and you're telling me that VMEXIT doesn't > serialize? Thanks for the review, seem I being a bit paranoid on this. Will drop it and send a v3 if no one has any other complaint. -- Best Regards, Kairui Song
Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM
On Tue, Mar 5, 2019 at 8:28 PM Peter Zijlstra wrote: > > On Wed, Feb 27, 2019 at 10:55:46PM +0800, Kairui Song wrote: > > On Wed, Feb 27, 2019 at 8:02 PM Peter Zijlstra wrote: > > > > > > On Tue, Feb 26, 2019 at 11:56:15PM +0800, Kairui Song wrote: > > > > arch/x86/hyperv/hv_init.c | 4 > > > > 1 file changed, 4 insertions(+) > > > > > > > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > > > > index 7abb09e2eeb8..92291c18d716 100644 > > > > --- a/arch/x86/hyperv/hv_init.c > > > > +++ b/arch/x86/hyperv/hv_init.c > > > > @@ -406,6 +406,10 @@ void hyperv_cleanup(void) > > > > /* Reset our OS id */ > > > > wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); > > > > > > > > + /* Cleanup page reference before reset the page */ > > > > + hv_hypercall_pg = NULL; > > > > + wmb(); > > > > > > What do we need that SFENCE for? Any why does it lack a comment? > > > > Hi, that's for ensuring the hv_hypercall_pg is reset to NULL before > > the following wrmsr call. The wrmsr call will make the pointer address > > invalid. > > WRMSR is a serializing instruction (except for TSC_DEADLINE and the > X2APIC). > Many thanks for the info, I'm not aware of the exception condition, V2 is sent, will drop the barrier in V3 then. -- Best Regards, Kairui Song
[PATCH v2] x86, hyperv: fix kernel panic when kexec on HyperV
After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"), kexec will fail with a kernel panic like this: kexec_core: Starting new kernel BUG: unable to handle kernel NULL pointer dereference at PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018 RIP: 0010:0xc901d000 Code: Bad RIP value. RSP: 0018:c9000495bcf0 EFLAGS: 00010046 RAX: RBX: c901d000 RCX: 00020015 RDX: 7f553000 RSI: RDI: c9000495bd28 RBP: 0002 R08: R09: 8238aaf8 R10: 8238aae0 R11: R12: 88007f553008 R13: 0001 R14: 8800ff553000 R15: FS: 7ff5c0e67b80() GS:880078e0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0 Call Trace: ? __send_ipi_mask+0x1c6/0x2d0 ? hv_send_ipi_mask_allbutself+0x6d/0xb0 ? mp_save_irq+0x70/0x70 ? __ioapic_read_entry+0x32/0x50 ? ioapic_read_entry+0x39/0x50 ? clear_IO_APIC_pin+0xb8/0x110 ? native_stop_other_cpus+0x6e/0x170 ? native_machine_shutdown+0x22/0x40 ? kernel_kexec+0x136/0x156 ? __do_sys_reboot+0x1be/0x210 ? kmem_cache_free+0x1b1/0x1e0 ? __dentry_kill+0x10b/0x160 ? _cond_resched+0x15/0x30 ? dentry_kill+0x47/0x170 ? dput.part.34+0xc6/0x100 ? __fput+0x147/0x220 ? _cond_resched+0x15/0x30 ? task_work_run+0x38/0xa0 ? do_syscall_64+0x5b/0x160 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel hv_vmbus That's because now we may use hypercall for sending IPI, the hypercall page will be reset very early upon kexec reboot, but kexec will need to send IPI for stopping CPUs, and it will reference this no longer usable page, then kernel panics. To fix it, simply reset hv_hypercall_pg to NULL before the page is reset to avoid any misuse, IPI sending will fallback to use non hypercall based method. This only happens on kexec / kdump so setting to NULL should be good enough. Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments") Signed-off-by: Kairui Song --- Update from V1: - Add comment for the barrier. arch/x86/hyperv/hv_init.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 7abb09e2eeb8..34aa1e953dfc 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -406,6 +406,12 @@ void hyperv_cleanup(void) /* Reset our OS id */ wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); + /* Cleanup hypercall page reference before reset the page */ + hv_hypercall_pg = NULL; + + /* Make sure page reference is cleared before wrmsr */ + wmb(); + /* Reset the hypercall page */ hypercall_msr.as_uint64 = 0; wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); -- 2.20.1
Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM
On Wed, Feb 27, 2019 at 8:02 PM Peter Zijlstra wrote: > > On Tue, Feb 26, 2019 at 11:56:15PM +0800, Kairui Song wrote: > > arch/x86/hyperv/hv_init.c | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > > index 7abb09e2eeb8..92291c18d716 100644 > > --- a/arch/x86/hyperv/hv_init.c > > +++ b/arch/x86/hyperv/hv_init.c > > @@ -406,6 +406,10 @@ void hyperv_cleanup(void) > > /* Reset our OS id */ > > wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); > > > > + /* Cleanup page reference before reset the page */ > > + hv_hypercall_pg = NULL; > > + wmb(); > > What do we need that SFENCE for? Any why does it lack a comment? Hi, that's for ensuring the hv_hypercall_pg is reset to NULL before the following wrmsr call. The wrmsr call will make the pointer address invalid. I can add some comment in V2 if this is OK. -- Best Regards, Kairui Song
[RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM
When hypercalls is used for sending IPIs, kexec will fail with a kernel panic like this: kexec_core: Starting new kernel BUG: unable to handle kernel NULL pointer dereference at PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018 RIP: 0010:0xc901d000 Code: Bad RIP value. RSP: 0018:c9000495bcf0 EFLAGS: 00010046 RAX: RBX: c901d000 RCX: 00020015 RDX: 7f553000 RSI: RDI: c9000495bd28 RBP: 0002 R08: R09: 8238aaf8 R10: 8238aae0 R11: R12: 88007f553008 R13: 0001 R14: 8800ff553000 R15: FS: 7ff5c0e67b80() GS:880078e0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0 Call Trace: ? __send_ipi_mask+0x1c6/0x2d0 ? hv_send_ipi_mask_allbutself+0x6d/0xb0 ? mp_save_irq+0x70/0x70 ? __ioapic_read_entry+0x32/0x50 ? ioapic_read_entry+0x39/0x50 ? clear_IO_APIC_pin+0xb8/0x110 ? native_stop_other_cpus+0x6e/0x170 ? native_machine_shutdown+0x22/0x40 ? kernel_kexec+0x136/0x156 ? __do_sys_reboot+0x1be/0x210 ? kmem_cache_free+0x1b1/0x1e0 ? __dentry_kill+0x10b/0x160 ? _cond_resched+0x15/0x30 ? dentry_kill+0x47/0x170 ? dput.part.34+0xc6/0x100 ? __fput+0x147/0x220 ? _cond_resched+0x15/0x30 ? task_work_run+0x38/0xa0 ? do_syscall_64+0x5b/0x160 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel hv_vmbus That's because HyperV's machine_ops.shutdown allow registering a hook to be called upon shutdown, hv_vmbus will invalidate the hypercall page using this hook. But hv_hypercall_pg is still pointing to this invalid page, any hypercall based operation will panic the kernel. And kexec progress will send IPIs for stopping CPUs. This fix this by simply reset hv_hypercall_pg to NULL when the page is revoked to avoid any misuse. IPI sending will fallback to use non hypercall based method. This only happens on kexec / kdump so setting to NULL should be good enough. Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments") Signed-off-by: Kairui Song --- I'm not sure about the details of what happened after the wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); But this fix should be valid, please let me know if I get anything wrong, thanks. arch/x86/hyperv/hv_init.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 7abb09e2eeb8..92291c18d716 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -406,6 +406,10 @@ void hyperv_cleanup(void) /* Reset our OS id */ wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); + /* Cleanup page reference before reset the page */ + hv_hypercall_pg = NULL; + wmb(); + /* Reset the hypercall page */ hypercall_msr.as_uint64 = 0; wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); -- 2.20.1
Re: [PATCH v3] x86/gart/kcore: Exclude GART aperture from kcore
On Wed, Feb 13, 2019 at 4:28 PM Kairui Song wrote: > > On machines where the GART aperture is mapped over physical RAM, > /proc/kcore contains the GART aperture range and reading it may lead > to kernel panic. > > In 'commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', > a workaround is applied for vmcore to let /proc/vmcore return zeroes > when attempting to read the GART region, as vmcore have the same issue, > and after 'commit 707d4eefbdb3 ("Revert "[PATCH] Insert GART region > into resource map"")', userspace tools won't be able to detect GART > region so have to avoid it from being reading in kernel. > > This patch applies a similar workaround for kcore. Let /proc/kcore > return zeroes for GART aperture. > > Both vmcore and kcore maintain a memory mapping list, in the vmcore > workaround we exclude the GART region by registering a hook for checking > if PFN is valid before reading, because vmcore's memory mapping could > be generated by userspace which doesn't know about GART. But for kcore > it will be simpler to just alter the memory area list, kcore's area list > is always generated by kernel on init. > > Kcore's memory area list is generated very late so can't exclude the > overlapped area when GART is initialized, instead simply introduce a > new area enum type KCORE_NORAM, register GART aperture as KCORE_NORAM > and let kcore return zeros for all KCORE_NORAM area. This fixes the > problem well with minor code changes. > > --- > Update from V2: > Instead of repeating the same hook infrastructure for kcore, introduce > a new kcore area type to avoid reading from, and let kcore always bypass > this kind of area. > > Update from V1: > Fix a complie error when CONFIG_PROC_KCORE is not set > > arch/x86/kernel/aperture_64.c | 14 ++ > fs/proc/kcore.c | 13 + > include/linux/kcore.h | 1 + > 3 files changed, 28 insertions(+) > > diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c > index 58176b56354e..5fb04bdd3221 100644 > --- a/arch/x86/kernel/aperture_64.c > +++ b/arch/x86/kernel/aperture_64.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > > /* > * Using 512M as goal, in case kexec will load kernel_big > @@ -84,6 +85,17 @@ static void exclude_from_vmcore(u64 aper_base, u32 > aper_order) > } > #endif > > +#ifdef CONFIG_PROC_KCORE > +static struct kcore_list kcore_gart; > + > +static void __init exclude_from_kcore(u64 aper_base, u32 aper_order) { > + u32 aper_size = (32 * 1024 * 1024) << aper_order; > + kclist_add(_gart, __va(aper_base), aper_size, KCORE_NORAM); > +} > +#else > +static inline void __init exclude_from_kcore(u64 aper_base, u32 aper_order) > { } > +#endif > + > /* This code runs before the PCI subsystem is initialized, so just > access the northbridge directly. */ > > @@ -475,6 +487,7 @@ int __init gart_iommu_hole_init(void) > * and fixed up the northbridge > */ > exclude_from_vmcore(last_aper_base, last_aper_order); > + exclude_from_kcore(last_aper_base, last_aper_order); > > return 1; > } > @@ -521,6 +534,7 @@ int __init gart_iommu_hole_init(void) > * range through vmcore even though it should be part of the dump. > */ > exclude_from_vmcore(aper_alloc, aper_order); > + exclude_from_kcore(aper_alloc, aper_order); > > /* Fix up the north bridges */ > for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) { > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c > index bbcc185062bb..15e0d74d7c56 100644 > --- a/fs/proc/kcore.c > +++ b/fs/proc/kcore.c > @@ -75,6 +75,8 @@ static size_t get_kcore_size(int *nphdr, size_t *phdrs_len, > size_t *notes_len, > size = 0; > > list_for_each_entry(m, _head, list) { > + if (m->type == KCORE_NORAM) > + continue; > try = kc_vaddr_to_offset((size_t)m->addr + m->size); > if (try > size) > size = try; > @@ -256,6 +258,9 @@ static int kcore_update_ram(void) > list_for_each_entry_safe(pos, tmp, _head, list) { > if (pos->type == KCORE_RAM || pos->type == KCORE_VMEMMAP) > list_move(>list, ); > + /* Move NORAM area to head of the new list */ > + if (pos->type == KCORE_NORAM) > + list_move(>list, ); > } > list_splice_tail(, _head)
Re: [PATCH v2] x86/gart/kcore: Exclude GART aperture from kcore
On Thu, Jan 24, 2019 at 10:17 AM Baoquan He wrote: > > On 01/23/19 at 10:50pm, Kairui Song wrote: > > > > int fix_aperture __initdata = 1; > > > > > > > > -#ifdef CONFIG_PROC_VMCORE > > > > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) > > > > /* > > > > * If the first kernel maps the aperture over e820 RAM, the kdump > > > > kernel will > > > > * use the same range because it will remain configured in the > > > > northbridge. > > > > @@ -66,7 +67,7 @@ int fix_aperture __initdata = 1; > > > > */ > > > > static unsigned long aperture_pfn_start, aperture_page_count; > > > > > > > > -static int gart_oldmem_pfn_is_ram(unsigned long pfn) > > > > +static int gart_mem_pfn_is_ram(unsigned long pfn) > > > > { > > > > return likely((pfn < aperture_pfn_start) || > > > > (pfn >= aperture_pfn_start + aperture_page_count)); > > > > @@ -76,7 +77,12 @@ static void exclude_from_vmcore(u64 aper_base, u32 > > > > aper_order) > > > > > > Shouldn't this function name be changed? It's not only handling vmcore > > > stuff any more, but also kcore. And this function is not excluding, but > > > resgistering. > > > > > > Other than this, it looks good to me. > > > > > > Thanks > > > Baoquan > > > > > > > Good suggestion, it's good to change this function name too to avoid > > any misleading. This patch hasn't got any other reviews recently, I'll > > update it shortly. > > There's more. > > These two are doing the same thing: > register_mem_pfn_is_ram > register_oldmem_pfn_is_ram > > Need remove one of them and put it in a right place. Furthermore, may > need see if there's existing function which is used to register a > function to a hook. > > Secondly, exclude_from_vmcore() is not excluding anthing, it's only > registering a function which is used to judge if oldmem/pfn is ram. Need > rename it. > > Thanks > Baoquan Thanks a lot for the review! I've sent V3, using a different approach. It's true repeating the hook infrastructure cause duplication, but I see vmcore/kcore didn't share much code, so instead of sharing a common hook infrastructure / registering entry, I used a new kcore memory mapping list enum type to fix it, it also introduced less code. Please have a look at V3, let me know how you think about it, thanks! -- Best Regards, Kairui Song
[PATCH v3] x86/gart/kcore: Exclude GART aperture from kcore
On machines where the GART aperture is mapped over physical RAM, /proc/kcore contains the GART aperture range and reading it may lead to kernel panic. In 'commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', a workaround is applied for vmcore to let /proc/vmcore return zeroes when attempting to read the GART region, as vmcore have the same issue, and after 'commit 707d4eefbdb3 ("Revert "[PATCH] Insert GART region into resource map"")', userspace tools won't be able to detect GART region so have to avoid it from being reading in kernel. This patch applies a similar workaround for kcore. Let /proc/kcore return zeroes for GART aperture. Both vmcore and kcore maintain a memory mapping list, in the vmcore workaround we exclude the GART region by registering a hook for checking if PFN is valid before reading, because vmcore's memory mapping could be generated by userspace which doesn't know about GART. But for kcore it will be simpler to just alter the memory area list, kcore's area list is always generated by kernel on init. Kcore's memory area list is generated very late so can't exclude the overlapped area when GART is initialized, instead simply introduce a new area enum type KCORE_NORAM, register GART aperture as KCORE_NORAM and let kcore return zeros for all KCORE_NORAM area. This fixes the problem well with minor code changes. --- Update from V2: Instead of repeating the same hook infrastructure for kcore, introduce a new kcore area type to avoid reading from, and let kcore always bypass this kind of area. Update from V1: Fix a complie error when CONFIG_PROC_KCORE is not set arch/x86/kernel/aperture_64.c | 14 ++ fs/proc/kcore.c | 13 + include/linux/kcore.h | 1 + 3 files changed, 28 insertions(+) diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 58176b56354e..5fb04bdd3221 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -31,6 +31,7 @@ #include #include #include +#include /* * Using 512M as goal, in case kexec will load kernel_big @@ -84,6 +85,17 @@ static void exclude_from_vmcore(u64 aper_base, u32 aper_order) } #endif +#ifdef CONFIG_PROC_KCORE +static struct kcore_list kcore_gart; + +static void __init exclude_from_kcore(u64 aper_base, u32 aper_order) { + u32 aper_size = (32 * 1024 * 1024) << aper_order; + kclist_add(_gart, __va(aper_base), aper_size, KCORE_NORAM); +} +#else +static inline void __init exclude_from_kcore(u64 aper_base, u32 aper_order) { } +#endif + /* This code runs before the PCI subsystem is initialized, so just access the northbridge directly. */ @@ -475,6 +487,7 @@ int __init gart_iommu_hole_init(void) * and fixed up the northbridge */ exclude_from_vmcore(last_aper_base, last_aper_order); + exclude_from_kcore(last_aper_base, last_aper_order); return 1; } @@ -521,6 +534,7 @@ int __init gart_iommu_hole_init(void) * range through vmcore even though it should be part of the dump. */ exclude_from_vmcore(aper_alloc, aper_order); + exclude_from_kcore(aper_alloc, aper_order); /* Fix up the north bridges */ for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) { diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index bbcc185062bb..15e0d74d7c56 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -75,6 +75,8 @@ static size_t get_kcore_size(int *nphdr, size_t *phdrs_len, size_t *notes_len, size = 0; list_for_each_entry(m, _head, list) { + if (m->type == KCORE_NORAM) + continue; try = kc_vaddr_to_offset((size_t)m->addr + m->size); if (try > size) size = try; @@ -256,6 +258,9 @@ static int kcore_update_ram(void) list_for_each_entry_safe(pos, tmp, _head, list) { if (pos->type == KCORE_RAM || pos->type == KCORE_VMEMMAP) list_move(>list, ); + /* Move NORAM area to head of the new list */ + if (pos->type == KCORE_NORAM) + list_move(>list, ); } list_splice_tail(, _head); @@ -356,6 +361,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) phdr = [1]; list_for_each_entry(m, _head, list) { + if (m->type == KCORE_NORAM) + continue; phdr->p_type = PT_LOAD; phdr->p_flags = PF_R | PF_W | PF_X; phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset; @@ -465,6 +472,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) goto out; }
[tip:x86/boot] x86/kexec: Fill in acpi_rsdp_addr from the first kernel
Commit-ID: ccec81e4251f5a5421e02874e394338a897056ca Gitweb: https://git.kernel.org/tip/ccec81e4251f5a5421e02874e394338a897056ca Author: Kairui Song AuthorDate: Tue, 5 Feb 2019 01:38:52 +0800 Committer: Borislav Petkov CommitDate: Wed, 6 Feb 2019 15:29:03 +0100 x86/kexec: Fill in acpi_rsdp_addr from the first kernel When efi=noruntime or efi=oldmap is used on the kernel command line, EFI services won't be available in the second kernel, therefore the second kernel will not be able to get the ACPI RSDP address from firmware by calling EFI services and so it won't boot. Commit e6e094e053af ("x86/acpi, x86/boot: Take RSDP address from boot params if available") added an acpi_rsdp_addr field to boot_params which stores the RSDP address for other kernel users. Recently, after 3a63f70bf4c3 ("x86/boot: Early parse RSDP and save it in boot_params") the acpi_rsdp_addr will always be filled with a valid RSDP address. So fill in that value into the second kernel's boot_params thus ensuring that the second kernel receives the RSDP value from the first kernel. [ bp: massage commit message. ] Signed-off-by: Kairui Song Signed-off-by: Borislav Petkov Cc: AKASHI Takahiro Cc: Andrew Morton Cc: Baoquan He Cc: Chao Fan Cc: Dave Young Cc: David Howells Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: ke...@lists.infradead.org Cc: Philipp Rudo Cc: Thomas Gleixner Cc: x86-ml Cc: Yannik Sembritzki Link: https://lkml.kernel.org/r/20190204173852.4863-1-kas...@redhat.com --- arch/x86/kernel/kexec-bzimage64.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 0d5efa34f359..2a0ff871025a 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -215,6 +215,9 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, params->screen_info.ext_mem_k = 0; params->alt_mem_k = 0; + /* Always fill in RSDP: it is either 0 or a valid value */ + params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr; + /* Default APM info */ memset(>apm_bios_info, 0, sizeof(params->apm_bios_info)); @@ -253,7 +256,6 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz, efi_setup_data_offset); #endif - /* Setup EDD info */ memcpy(params->eddbuf, boot_params.eddbuf, EDDMAXNR * sizeof(struct edd_info));
[PATCH] x86, kexec_file_load: fill in acpi_rsdp_addr boot param unconditionally
When efi=noruntime or efi=oldmap is used, EFI services won't be available in the second kernel, therefore the second kernel will not be able to get the ACPI RSDP address from firmware by calling EFI services so it won't boot. Previously we are expecting the user to set the acpi_rsdp= on kernel command line for second kernel as there was no other way to pass RSDP address to second kernel. After commit e6e094e053af ("x86/acpi, x86/boot: Take RSDP address from boot params if available"), now it's possible to set an acpi_rsdp_addr parameter in the boot_params passed to second kernel, and kernel will prefer using this value for the RSDP address when it's set. And with commit 3a63f70bf4c3 ("x86/boot: Early parse RSDP and save it in boot_params"), now the acpi_rsdp_addr will always be filled with valid RSDP address. So we just fill in that value for second kernel's boot_params unconditionally, this ensure second kernel always use the same RSDP value as the first kernel. Tested with an EFI enabled KVM VM with efi=noruntime. Signed-off-by: Kairui Song --- This is update of part of patch series: "[PATCH v3 0/3] make kexec work with efi=noruntime or efi=old_map." But "[PATCH v3 1/3] x86, kexec_file_load: Don't setup EFI info if EFI runtime is not enabled" is already in [tip:x86/urgent], and with Chao's commit 3a63f70bf4c3 in [tip:x86/boot], we can just fill in acpi_rsdp_addr boot param unconditionally to fix the problem, so only I update and resend this patch. arch/x86/kernel/kexec-bzimage64.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 53917a3ebf94..3611946dc7ea 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -218,6 +218,9 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, params->screen_info.ext_mem_k = 0; params->alt_mem_k = 0; + /* Always fill in RSDP, it's either 0 or a valid value */ + params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr; + /* Default APM info */ memset(>apm_bios_info, 0, sizeof(params->apm_bios_info)); @@ -256,7 +259,6 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz, efi_setup_data_offset); #endif - /* Setup EDD info */ memcpy(params->eddbuf, boot_params.eddbuf, EDDMAXNR * sizeof(struct edd_info)); -- 2.20.1
[PATCH] integrity, KEYS: Fix build break with set_platform_trusted_keys
Commit 15ebb2eb0705 ("integrity, KEYS: add a reference to platform keyring") introduced a function set_platform_trusted_keys and calls the function in __integrity_init_keyring. It only checks if CONFIG_INTEGRITY_PLATFORM_KEYRING is enabled when enabling this function, but actually this function also depends on CONFIG_SYSTEM_TRUSTED_KEYRING. So when built with CONFIG_INTEGRITY_PLATFORM_KEYRING && !CONFIG_SYSTEM_TRUSTED_KEYRING. we will get following error: digsig.c:92: undefined reference to `set_platform_trusted_keys' And it also mistakenly wrapped the function code in the ifdef block of CONFIG_SYSTEM_DATA_VERIFICATION. This commit fixes the issue by adding the missing check of CONFIG_SYSTEM_TRUSTED_KEYRING and move the function code out of CONFIG_SYSTEM_DATA_VERIFICATION's ifdef block. Fixes: 15ebb2eb0705 ("integrity, KEYS: add a reference to platform keyring") Signed-off-by: Kairui Song --- certs/system_keyring.c| 4 ++-- include/keys/system_keyring.h | 9 +++-- 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 19bd0504bbcb..c05c29ae4d5d 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -279,11 +279,11 @@ int verify_pkcs7_signature(const void *data, size_t len, } EXPORT_SYMBOL_GPL(verify_pkcs7_signature); +#endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ + #ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING void __init set_platform_trusted_keys(struct key *keyring) { platform_trusted_keys = keyring; } #endif - -#endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h index c7f899ee974e..42a93eda331c 100644 --- a/include/keys/system_keyring.h +++ b/include/keys/system_keyring.h @@ -61,16 +61,13 @@ static inline struct key *get_ima_blacklist_keyring(void) } #endif /* CONFIG_IMA_BLACKLIST_KEYRING */ -#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING - +#if defined(CONFIG_INTEGRITY_PLATFORM_KEYRING) && \ + defined(CONFIG_SYSTEM_TRUSTED_KEYRING) extern void __init set_platform_trusted_keys(struct key *keyring); - #else - static inline void set_platform_trusted_keys(struct key *keyring) { } - -#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */ +#endif #endif /* _KEYS_SYSTEM_KEYRING_H */ -- 2.20.1
[tip:x86/urgent] x86/kexec: Don't setup EFI info if EFI runtime is not enabled
Commit-ID: 2aa958c99c7fd3162b089a1a56a34a0cdb778de1 Gitweb: https://git.kernel.org/tip/2aa958c99c7fd3162b089a1a56a34a0cdb778de1 Author: Kairui Song AuthorDate: Fri, 18 Jan 2019 19:13:08 +0800 Committer: Borislav Petkov CommitDate: Fri, 1 Feb 2019 18:18:54 +0100 x86/kexec: Don't setup EFI info if EFI runtime is not enabled Kexec-ing a kernel with "efi=noruntime" on the first kernel's command line causes the following null pointer dereference: BUG: unable to handle kernel NULL pointer dereference at #PF error: [normal kernel read fault] Call Trace: efi_runtime_map_copy+0x28/0x30 bzImage64_load+0x688/0x872 arch_kexec_kernel_image_load+0x6d/0x70 kimage_file_alloc_init+0x13e/0x220 __x64_sys_kexec_file_load+0x144/0x290 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Just skip the EFI info setup if EFI runtime services are not enabled. [ bp: Massage commit message. ] Suggested-by: Dave Young Signed-off-by: Kairui Song Signed-off-by: Borislav Petkov Acked-by: Dave Young Cc: AKASHI Takahiro Cc: Andrew Morton Cc: Ard Biesheuvel Cc: b...@redhat.com Cc: David Howells Cc: erik.schma...@intel.com Cc: fanc.f...@cn.fujitsu.com Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: ke...@lists.infradead.org Cc: l...@kernel.org Cc: linux-a...@vger.kernel.org Cc: Philipp Rudo Cc: rafael.j.wyso...@intel.com Cc: robert.mo...@intel.com Cc: Thomas Gleixner Cc: x86-ml Cc: Yannik Sembritzki Link: https://lkml.kernel.org/r/20190118111310.29589-2-kas...@redhat.com --- arch/x86/kernel/kexec-bzimage64.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 0d5efa34f359..53917a3ebf94 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -167,6 +167,9 @@ setup_efi_state(struct boot_params *params, unsigned long params_load_addr, struct efi_info *current_ei = _params.efi_info; struct efi_info *ei = >efi_info; + if (!efi_enabled(EFI_RUNTIME_SERVICES)) + return 0; + if (!current_ei->efi_memmap_size) return 0;
Re: [PATCH v2] x86/gart/kcore: Exclude GART aperture from kcore
On Wed, Jan 23, 2019 at 10:14 PM Baoquan He wrote: > > On 01/02/19 at 06:54pm, Kairui Song wrote: > > diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c > > index 58176b56354e..c8a56f083419 100644 > > --- a/arch/x86/kernel/aperture_64.c > > +++ b/arch/x86/kernel/aperture_64.c > > @@ -14,6 +14,7 @@ > > #define pr_fmt(fmt) "AGP: " fmt > > > > #include > > +#include > > #include > > #include > > #include > > @@ -57,7 +58,7 @@ int fallback_aper_force __initdata; > > > > int fix_aperture __initdata = 1; > > > > -#ifdef CONFIG_PROC_VMCORE > > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE) > > /* > > * If the first kernel maps the aperture over e820 RAM, the kdump kernel > > will > > * use the same range because it will remain configured in the northbridge. > > @@ -66,7 +67,7 @@ int fix_aperture __initdata = 1; > > */ > > static unsigned long aperture_pfn_start, aperture_page_count; > > > > -static int gart_oldmem_pfn_is_ram(unsigned long pfn) > > +static int gart_mem_pfn_is_ram(unsigned long pfn) > > { > > return likely((pfn < aperture_pfn_start) || > > (pfn >= aperture_pfn_start + aperture_page_count)); > > @@ -76,7 +77,12 @@ static void exclude_from_vmcore(u64 aper_base, u32 > > aper_order) > > Shouldn't this function name be changed? It's not only handling vmcore > stuff any more, but also kcore. And this function is not excluding, but > resgistering. > > Other than this, it looks good to me. > > Thanks > Baoquan > Good suggestion, it's good to change this function name too to avoid any misleading. This patch hasn't got any other reviews recently, I'll update it shortly. -- Best Regards, Kairui Song
Re: [PATCH v5 0/2] let kexec_file_load use platform keyring to verify the kernel image
On Mon, Jan 21, 2019 at 6:00 PM Kairui Song wrote: > > This patch series adds a .platform_trusted_keys in system_keyring as the > reference to .platform keyring in integrity subsystem, when platform > keyring is being initialized it will be updated, so it will be > accessable for verifying PE signed kernel image. > > This patch series let kexec_file_load use platform keyring as fall > back if it failed to verify the image against secondary keyring, > so the actually PE signature verify process will use keys provides > by firmware. > > After this patch kexec_file_load will be able to verify a signed PE > bzImage using keys in platform keyring. > > Tested in a VM with locally signed kernel with pesign and imported the > cert to EFI's MokList variable. > > To test this patch series on latest kernel, you need to ensure this commit > is applied as there is an regression bug in sanity_check_segment_list(): > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b > > Update from V4: > - Drop ifdef in security/integrity/digsig.c to make code clearer > - Fix a potential issue, set_platform_trusted_keys should not be > called when keyring initialization failed > > Update from V3: > - Tweak and simplify commit message as suggested by Mimi Zohar > > Update from V2: > - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys > should be used for verifying image as suggested by Mimi Zohar > > Update from V1: > - Make platform_trusted_keys static, and update commit message as suggested > by Mimi Zohar > - Always check if platform keyring is initialized before use it > > Kairui Song (2): > integrity, KEYS: add a reference to platform keyring > kexec, KEYS: Make use of platform keyring for signature verify > > arch/x86/kernel/kexec-bzimage64.c | 13 ++--- > certs/system_keyring.c| 22 +- > include/keys/system_keyring.h | 9 + > include/linux/verification.h | 1 + > security/integrity/digsig.c | 3 +++ > 5 files changed, 44 insertions(+), 4 deletions(-) > > -- > 2.20.1 > Hi Mimi, I've updated the patch series again and as the code changed a bit I didn't include previous Reviewd-by / Tested-by, it worked with no problem, could you help have a review again? Thank you. -- Best Regards, Kairui Song
[PATCH v5 0/2] let kexec_file_load use platform keyring to verify the kernel image
This patch series adds a .platform_trusted_keys in system_keyring as the reference to .platform keyring in integrity subsystem, when platform keyring is being initialized it will be updated, so it will be accessable for verifying PE signed kernel image. This patch series let kexec_file_load use platform keyring as fall back if it failed to verify the image against secondary keyring, so the actually PE signature verify process will use keys provides by firmware. After this patch kexec_file_load will be able to verify a signed PE bzImage using keys in platform keyring. Tested in a VM with locally signed kernel with pesign and imported the cert to EFI's MokList variable. To test this patch series on latest kernel, you need to ensure this commit is applied as there is an regression bug in sanity_check_segment_list(): https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b Update from V4: - Drop ifdef in security/integrity/digsig.c to make code clearer - Fix a potential issue, set_platform_trusted_keys should not be called when keyring initialization failed Update from V3: - Tweak and simplify commit message as suggested by Mimi Zohar Update from V2: - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys should be used for verifying image as suggested by Mimi Zohar Update from V1: - Make platform_trusted_keys static, and update commit message as suggested by Mimi Zohar - Always check if platform keyring is initialized before use it Kairui Song (2): integrity, KEYS: add a reference to platform keyring kexec, KEYS: Make use of platform keyring for signature verify arch/x86/kernel/kexec-bzimage64.c | 13 ++--- certs/system_keyring.c| 22 +- include/keys/system_keyring.h | 9 + include/linux/verification.h | 1 + security/integrity/digsig.c | 3 +++ 5 files changed, 44 insertions(+), 4 deletions(-) -- 2.20.1
[PATCH v5 1/2] integrity, KEYS: add a reference to platform keyring
commit 9dc92c45177a ('integrity: Define a trusted platform keyring') introduced a .platform keyring for storing preboot keys, used for verifying kernel images' signature. Currently only IMA-appraisal is able to use the keyring to verify kernel images that have their signature stored in xattr. This patch exposes the .platform keyring, making it accessible for verifying PE signed kernel images as well. Suggested-by: Mimi Zohar Signed-off-by: Kairui Song --- certs/system_keyring.c| 9 + include/keys/system_keyring.h | 9 + security/integrity/digsig.c | 3 +++ 3 files changed, 21 insertions(+) diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 81728717523d..4690ef9cda8a 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys; #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING static struct key *secondary_trusted_keys; #endif +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING +static struct key *platform_trusted_keys; +#endif extern __initconst const u8 system_certificate_list[]; extern __initconst const unsigned long system_certificate_list_size; @@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t len, } EXPORT_SYMBOL_GPL(verify_pkcs7_signature); +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING +void __init set_platform_trusted_keys(struct key *keyring) { + platform_trusted_keys = keyring; +} +#endif + #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h index 359c2f936004..df766ef8f03c 100644 --- a/include/keys/system_keyring.h +++ b/include/keys/system_keyring.h @@ -61,5 +61,14 @@ static inline struct key *get_ima_blacklist_keyring(void) } #endif /* CONFIG_IMA_BLACKLIST_KEYRING */ +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + +extern void __init set_platform_trusted_keys(struct key* keyring); + +#else + +static inline void set_platform_trusted_keys(struct key* keyring) { }; + +#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */ #endif /* _KEYS_SYSTEM_KEYRING_H */ diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c index f45d6edecf99..e19c2eb72c51 100644 --- a/security/integrity/digsig.c +++ b/security/integrity/digsig.c @@ -87,6 +87,9 @@ static int __integrity_init_keyring(const unsigned int id, key_perm_t perm, pr_info("Can't allocate %s keyring (%d)\n", keyring_name[id], err); keyring[id] = NULL; + } else { + if (id == INTEGRITY_KEYRING_PLATFORM) + set_platform_trusted_keys(keyring[id]); } return err; -- 2.20.1
[PATCH v5 2/2] kexec, KEYS: Make use of platform keyring for signature verify
This patch let kexec_file_load makes use of .platform keyring as fall back if it failed to verify a PE signed image against secondary or builtin key ring, make it possible to verify kernel image signed with preboot keys as well. This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature should verify the signature using platform keyring. Also, decrease the error message log level when verification failed with -ENOKEY, so that if called tried multiple time with different keyring it won't generate extra noises. Signed-off-by: Kairui Song --- arch/x86/kernel/kexec-bzimage64.c | 13 ++--- certs/system_keyring.c| 13 - include/linux/verification.h | 1 + 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 7d97e432cbbc..2c007abd3d40 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data) #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len) { - return verify_pefile_signature(kernel, kernel_len, - VERIFY_USE_SECONDARY_KEYRING, - VERIFYING_KEXEC_PE_SIGNATURE); + int ret; + ret = verify_pefile_signature(kernel, kernel_len, + VERIFY_USE_SECONDARY_KEYRING, + VERIFYING_KEXEC_PE_SIGNATURE); + if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) { + ret = verify_pefile_signature(kernel, kernel_len, + VERIFY_USE_PLATFORM_KEYRING, + VERIFYING_KEXEC_PE_SIGNATURE); + } + return ret; } #endif diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 4690ef9cda8a..7085c286f4bd 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len, #else trusted_keys = builtin_trusted_keys; #endif + } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) { +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + trusted_keys = platform_trusted_keys; +#else + trusted_keys = NULL; +#endif + if (!trusted_keys) { + ret = -ENOKEY; + pr_devel("PKCS#7 platform keyring is not available\n"); + goto error; + } } ret = pkcs7_validate_trust(pkcs7, trusted_keys); if (ret < 0) { if (ret == -ENOKEY) - pr_err("PKCS#7 signature not signed with a trusted key\n"); + pr_devel("PKCS#7 signature not signed with a trusted key\n"); goto error; } diff --git a/include/linux/verification.h b/include/linux/verification.h index cfa4730d607a..018fb5f13d44 100644 --- a/include/linux/verification.h +++ b/include/linux/verification.h @@ -17,6 +17,7 @@ * should be used. */ #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL) +#define VERIFY_USE_PLATFORM_KEYRING ((struct key *)2UL) /* * The use to which an asymmetric key is being put. -- 2.20.1
Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image
On Fri, Jan 18, 2019 at 10:28 PM Kairui Song wrote: > > On Fri, Jan 18, 2019 at 9:42 PM Kairui Song wrote: > > > > On Fri, Jan 18, 2019 at 8:37 PM Dave Young wrote: > > > > > > On 01/18/19 at 08:34pm, Dave Young wrote: > > > > On 01/18/19 at 06:53am, Mimi Zohar wrote: > > > > > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote: > > > > > > This patch series adds a .platform_trusted_keys in system_keyring > > > > > > as the > > > > > > reference to .platform keyring in integrity subsystem, when platform > > > > > > keyring is being initialized it will be updated. So other component > > > > > > could > > > > > > use this keyring as well. > > > > > > > > > > Kairui, when people review patches, the comments could be specific, > > > > > but are normally generic. My review included a couple of generic > > > > > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the > > > > > term "preboot" keys, and remove any references to "other components". > > > > > > > > > > After all the wording suggestions I've made, you are still saying, "So > > > > > other components could use this keyring as well". Really?! How the > > > > > platform keyring will be used in the future, is up to you and others > > > > > to convince Linus. At least for now, please limit its usage to > > > > > verifying the PE signed kernel image. If this patch set needs to be > > > > > reposted, please remove all references to "other components". > > > > > > > > > > Dave/David, are you ok with Kairui's usage of "#ifdef's"? Dave, you > > > > > Acked the original post. Can I include it? Can we get some > > > > > additional Ack's on these patches? > > > > > > > > It is better to update patch to use IS_ENABLED in patch 1/2 as well. > > > > > > Hmm, not only for patch 1/2, patch 2/2 also need an update > > > > > > > Other than that, for kexec part I'm fine with an ack. > > > > > > > > Thanks > > > > Dave > > > > Thanks for the review again, will update the patch using IS_ENABLED > > along with update the cover letter shortly. > > > > -- > > Best Regards, > > Kairui Song > > Hi, before I update it again, most part of the new > platform_trusted_keyring related code is following how > secondary_trusted_keyring is implemented (surrounded by ifdefs). I > thought this could reduce unused code when the keyring is not enabled. > Else, all ifdef could be simply removed, when platform_keyring is not > enabled, the platform_trusted_keys will always be NULL, and > verify_pkcs7_signature will simply return NOKEY if anyone try to use > platform keyring. > > Any suggestions? Or I can just remove the ifdef in > security/integrity/digsig.c and make set_platform_trusted_keys a > inline empty function in system_keyring.h. > > -- > Best Regards, > Kairui Song Hi, after a second thought I'll drop the #ifdef in security/integrity/digsig.c in PATCH 1/2, and make the set_platform_trusted_keys function a empty inline function when CONFIG_INTEGRITY_PLATFORM_KEYRING is undefined. But for other ifdefs in certs/system_keyring.c I think maybe just keep then untouched. They were used to strip out the platform_trusted_keyring variable and related function when CONFIG_INTEGRITY_PLATFORM_KEYRING is not used, this should help reduce unused code and prevent compile error, also make code style aligns with existing code in system_keyring.c. Will sent v5 with above updates and fix a potential problem found by Nayna. -- Best Regards, Kairui Song
Re: [PATCH v4 1/2] integrity, KEYS: add a reference to platform keyring
On Fri, Jan 18, 2019 at 10:36 PM Nayna wrote: > On 01/18/2019 04:17 AM, Kairui Song wrote: > > commit 9dc92c45177a ('integrity: Define a trusted platform keyring') > > introduced a .platform keyring for storing preboot keys, used for > > verifying kernel images' signature. Currently only IMA-appraisal is able > > to use the keyring to verify kernel images that have their signature > > stored in xattr. > > > > This patch exposes the .platform keyring, making it accessible for > > verifying PE signed kernel images as well. > > > > Suggested-by: Mimi Zohar > > Signed-off-by: Kairui Song > > Reviewed-by: Mimi Zohar > > Tested-by: Mimi Zohar > > --- > > certs/system_keyring.c| 9 + > > include/keys/system_keyring.h | 5 + > > security/integrity/digsig.c | 6 ++ > > 3 files changed, 20 insertions(+) > > > > diff --git a/certs/system_keyring.c b/certs/system_keyring.c > > index 81728717523d..4690ef9cda8a 100644 > > --- a/certs/system_keyring.c > > +++ b/certs/system_keyring.c > > @@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys; > > #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING > > static struct key *secondary_trusted_keys; > > #endif > > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING > > +static struct key *platform_trusted_keys; > > +#endif > > > > extern __initconst const u8 system_certificate_list[]; > > extern __initconst const unsigned long system_certificate_list_size; > > @@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t > > len, > > } > > EXPORT_SYMBOL_GPL(verify_pkcs7_signature); > > > > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING > > +void __init set_platform_trusted_keys(struct key *keyring) { > > + platform_trusted_keys = keyring; > > +} > > +#endif > > + > > #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ > > diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h > > index 359c2f936004..9e1b7849b6aa 100644 > > --- a/include/keys/system_keyring.h > > +++ b/include/keys/system_keyring.h > > @@ -61,5 +61,10 @@ static inline struct key *get_ima_blacklist_keyring(void) > > } > > #endif /* CONFIG_IMA_BLACKLIST_KEYRING */ > > > > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING > > + > > +extern void __init set_platform_trusted_keys(struct key* keyring); > > + > > +#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */ > > > > #endif /* _KEYS_SYSTEM_KEYRING_H */ > > diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c > > index f45d6edecf99..bfabc2a8111d 100644 > > --- a/security/integrity/digsig.c > > +++ b/security/integrity/digsig.c > > @@ -89,6 +89,12 @@ static int __integrity_init_keyring(const unsigned int > > id, key_perm_t perm, > > keyring[id] = NULL; > > } > > > > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING > > + if (id == INTEGRITY_KEYRING_PLATFORM) { > > Shouldn't it also check that keyring[id] is not NULL ? Good catch, if it's NULL then platform_trusted_keyring will be set to NULL as well, which will work just fine as in this case platform_trusted_keyring is still considered not initialized. I'll add a sanity check here to check err value just in case. Thanks for your suggestion! > > Thanks & Regards, > - Nayna > > > + set_platform_trusted_keys(keyring[id]); > > + } > > +#endif > > + > > return err; > > } > > > > > ___ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec -- Best Regards, Kairui Song
Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image
On Fri, Jan 18, 2019 at 9:42 PM Kairui Song wrote: > > On Fri, Jan 18, 2019 at 8:37 PM Dave Young wrote: > > > > On 01/18/19 at 08:34pm, Dave Young wrote: > > > On 01/18/19 at 06:53am, Mimi Zohar wrote: > > > > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote: > > > > > This patch series adds a .platform_trusted_keys in system_keyring as > > > > > the > > > > > reference to .platform keyring in integrity subsystem, when platform > > > > > keyring is being initialized it will be updated. So other component > > > > > could > > > > > use this keyring as well. > > > > > > > > Kairui, when people review patches, the comments could be specific, > > > > but are normally generic. My review included a couple of generic > > > > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the > > > > term "preboot" keys, and remove any references to "other components". > > > > > > > > After all the wording suggestions I've made, you are still saying, "So > > > > other components could use this keyring as well". Really?! How the > > > > platform keyring will be used in the future, is up to you and others > > > > to convince Linus. At least for now, please limit its usage to > > > > verifying the PE signed kernel image. If this patch set needs to be > > > > reposted, please remove all references to "other components". > > > > > > > > Dave/David, are you ok with Kairui's usage of "#ifdef's"? Dave, you > > > > Acked the original post. Can I include it? Can we get some > > > > additional Ack's on these patches? > > > > > > It is better to update patch to use IS_ENABLED in patch 1/2 as well. > > > > Hmm, not only for patch 1/2, patch 2/2 also need an update > > > > > Other than that, for kexec part I'm fine with an ack. > > > > > > Thanks > > > Dave > > Thanks for the review again, will update the patch using IS_ENABLED > along with update the cover letter shortly. > > -- > Best Regards, > Kairui Song Hi, before I update it again, most part of the new platform_trusted_keyring related code is following how secondary_trusted_keyring is implemented (surrounded by ifdefs). I thought this could reduce unused code when the keyring is not enabled. Else, all ifdef could be simply removed, when platform_keyring is not enabled, the platform_trusted_keys will always be NULL, and verify_pkcs7_signature will simply return NOKEY if anyone try to use platform keyring. Any suggestions? Or I can just remove the ifdef in security/integrity/digsig.c and make set_platform_trusted_keys a inline empty function in system_keyring.h. -- Best Regards, Kairui Song
Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image
On Fri, Jan 18, 2019 at 8:37 PM Dave Young wrote: > > On 01/18/19 at 08:34pm, Dave Young wrote: > > On 01/18/19 at 06:53am, Mimi Zohar wrote: > > > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote: > > > > This patch series adds a .platform_trusted_keys in system_keyring as the > > > > reference to .platform keyring in integrity subsystem, when platform > > > > keyring is being initialized it will be updated. So other component > > > > could > > > > use this keyring as well. > > > > > > Kairui, when people review patches, the comments could be specific, > > > but are normally generic. My review included a couple of generic > > > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the > > > term "preboot" keys, and remove any references to "other components". > > > > > > After all the wording suggestions I've made, you are still saying, "So > > > other components could use this keyring as well". Really?! How the > > > platform keyring will be used in the future, is up to you and others > > > to convince Linus. At least for now, please limit its usage to > > > verifying the PE signed kernel image. If this patch set needs to be > > > reposted, please remove all references to "other components". > > > > > > Dave/David, are you ok with Kairui's usage of "#ifdef's"? Dave, you > > > Acked the original post. Can I include it? Can we get some > > > additional Ack's on these patches? > > > > It is better to update patch to use IS_ENABLED in patch 1/2 as well. > > Hmm, not only for patch 1/2, patch 2/2 also need an update > > > Other than that, for kexec part I'm fine with an ack. > > > > Thanks > > Dave Thanks for the review again, will update the patch using IS_ENABLED along with update the cover letter shortly. -- Best Regards, Kairui Song
Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image
On Fri, Jan 18, 2019, 19:54 Mimi Zohar > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote: > > This patch series adds a .platform_trusted_keys in system_keyring as the > > reference to .platform keyring in integrity subsystem, when platform > > keyring is being initialized it will be updated. So other component could > > use this keyring as well. > > Kairui, when people review patches, the comments could be specific, > but are normally generic. My review included a couple of generic > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the > term "preboot" keys, and remove any references to "other components". > > After all the wording suggestions I've made, you are still saying, "So > other components could use this keyring as well". Really?! How the > platform keyring will be used in the future, is up to you and others > to convince Linus. At least for now, please limit its usage to > verifying the PE signed kernel image. If this patch set needs to be > reposted, please remove all references to "other components". > > Dave/David, are you ok with Kairui's usage of "#ifdef's"? Dave, you > Acked the original post. Can I include it? Can we get some > additional Ack's on these patches? > > thanks! > > Mimi > Hi, Mimi, thanks for your feedback. My bad I reused the old cover letter without checking it carefully, hopefully, the commit messages should have a proper wording now. If the cover letter needs to be updated I can resend the patch, let me just hold a while before update again.
Re: [PATCH v3 2/3] acpi: store acpi_rsdp address for later kexec usage
On Fri, Jan 18, 2019 at 7:26 PM Borislav Petkov wrote: > No, this is getting completely nuts: there's a bunch of functions which > all end up returning boot_params's field except pvh_get_root_pointer(). > > And now you're adding a late variant. And the cmdline paramater > acpi_rsdp is in a CONFIG_KEXEC wrapper, and and... > > Wait until Chao Fan's stuff is applied, then do your changes ontop > an drop all that ifdeffery. We will make this RDSP thing enabled > unconditionally so that there's no need for ifdeffery and function > wrappers. > > Also, after Chao's stuff, you won't need to call > acpi_os_get_root_pointer() because the early code would've done that. > > -- > Regards/Gruss, > Boris. > > Good mailing practices for 400: avoid top-posting and trim the reply. Good suggestion, will wait for Chao's update then. -- Best Regards, Kairui Song
[PATCH v3 1/3] x86, kexec_file_load: Don't setup EFI info if EFI runtime is not enabled
Currently with "efi=noruntime" in kernel command line, calling kexec_file_load will raise below problem: [ 97.967067] BUG: unable to handle kernel NULL pointer dereference at [ 97.967894] #PF error: [normal kernel read fault] ... [ 97.980456] Call Trace: [ 97.980724] efi_runtime_map_copy+0x28/0x30 [ 97.981267] bzImage64_load+0x688/0x872 [ 97.981794] arch_kexec_kernel_image_load+0x6d/0x70 [ 97.982441] kimage_file_alloc_init+0x13e/0x220 [ 97.983035] __x64_sys_kexec_file_load+0x144/0x290 [ 97.983586] do_syscall_64+0x55/0x1a0 [ 97.983962] entry_SYSCALL_64_after_hwframe+0x44/0xa9 When efi runtime is not enabled, efi memmap is not mapped, so just skip EFI info setup. Suggested-by: Dave Young Signed-off-by: Kairui Song --- arch/x86/kernel/kexec-bzimage64.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 2c007abd3d40..097f52fb02e3 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -167,6 +167,9 @@ setup_efi_state(struct boot_params *params, unsigned long params_load_addr, struct efi_info *current_ei = _params.efi_info; struct efi_info *ei = >efi_info; + if (!efi_enabled(EFI_RUNTIME_SERVICES)) + return 0; + if (!current_ei->efi_memmap_size) return 0; -- 2.20.1
[PATCH v3 3/3] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map
When efi=noruntime or efi=oldmap is used, EFI services won't be available in the second kernel, therefore the second kernel will not be able to get the ACPI RSDP address from firmware by calling EFI services and won't boot. Previously we are expecting the user to set the acpi_rsdp= on kernel command line for second kernel as there was no way to pass RSDP address to second kernel. After commit e6e094e053af ('x86/acpi, x86/boot: Take RSDP address from boot params if available'), now it's possible to set an acpi_rsdp_addr parameter in the boot_params passed to second kernel, this commit makes use of it, detect and set the RSDP address when it's required for second kernel to boot. Tested with an EFI enabled KVM VM with efi=noruntime. Suggested-by: Dave Young Signed-off-by: Kairui Song --- arch/x86/kernel/kexec-bzimage64.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 097f52fb02e3..63101b2194fb 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -255,8 +256,17 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, /* Setup EFI state */ setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz, efi_setup_data_offset); + +#ifdef CONFIG_ACPI + /* Setup ACPI RSDP pointer in case EFI is not available in second kernel */ + if (!acpi_disabled && (!efi_enabled(EFI_RUNTIME_SERVICES) || efi_enabled(EFI_OLD_MEMMAP))) { + params->acpi_rsdp_addr = acpi_os_get_root_pointer_late(); + if (!params->acpi_rsdp_addr) + pr_warn("RSDP is not available for second kernel\n"); + } #endif +#endif /* Setup EDD info */ memcpy(params->eddbuf, boot_params.eddbuf, EDDMAXNR * sizeof(struct edd_info)); -- 2.20.1
[PATCH v3 2/3] acpi: store acpi_rsdp address for later kexec usage
Currently we have acpi_os_get_root_pointer as the universal function to get RSDP address. But the function itself and some functions it depends on are in .init section and make it not easy to retrieve the RSDP value once kernel is initialized. And for kexec, it need to retrive RSDP again if EFI is disabled, because the second kernel will not be able get the RSDP value in such case, so it expects either the user specify the RSDP value using kernel cmdline, or kexec could retrive and pass the RSDP value using boot_params. This patch stores the RSDP address when initialized is done, and introduce an acpi_os_get_root_pointer_late for later kexec usage. Signed-off-by: Kairui Song --- drivers/acpi/osl.c | 10 ++ include/linux/acpi.h | 3 +++ 2 files changed, 13 insertions(+) diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index f29e427d0d1d..6340d34d0df1 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -187,6 +187,16 @@ static int __init setup_acpi_rsdp(char *arg) return kstrtoul(arg, 16, _rsdp); } early_param("acpi_rsdp", setup_acpi_rsdp); + +acpi_physical_address acpi_os_get_root_pointer_late(void) { + return acpi_rsdp; +} + +static int __init acpi_store_root_pointer(void) { + acpi_rsdp = acpi_os_get_root_pointer(); + return 0; +} +late_initcall(acpi_store_root_pointer); #endif acpi_physical_address __init acpi_os_get_root_pointer(void) diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 87715f20b69a..226f2572eb8e 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -892,6 +892,9 @@ static inline void arch_reserve_mem_area(acpi_physical_address addr, { } #endif /* CONFIG_X86 */ +#ifdef CONFIG_KEXEC +acpi_physical_address acpi_os_get_root_pointer_late(void); +#endif #else #define acpi_os_set_prepare_sleep(func, pm1a_ctrl, pm1b_ctrl) do { } while (0) #endif -- 2.20.1
[PATCH v3 0/3] make kexec work with efi=noruntime or efi=old_map
This patch series fix the kexec panic on efi=noruntime or efi=old_map pass acpi_rsdp_addr to the second kernel and make it boot up properly. Update from V2: - Store acpi rsdp value, and add an acpi_os_get_root_pointer_late as a helper, leveraging existing codes so we don't need to reparse RSDP. Update from V1: - Add a cover letter and fix some type in commit message - Previous patches are not sent in a single thread Kairui Song (3): x86, kexec_file_load: Don't setup EFI info if EFI runtime is not enabled acpi: store acpi_rsdp address for later kexec usage x86, kexec_file_load: make it work with efi=noruntime or efi=old_map arch/x86/kernel/kexec-bzimage64.c | 13 + drivers/acpi/osl.c| 10 ++ include/linux/acpi.h | 3 +++ 3 files changed, 26 insertions(+) -- 2.20.1
[PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image
This patch series adds a .platform_trusted_keys in system_keyring as the reference to .platform keyring in integrity subsystem, when platform keyring is being initialized it will be updated. So other component could use this keyring as well. This patch series also let kexec_file_load use platform keyring as fall back if it failed to verify the image against secondary keyring, make it possible to load kernel signed by keys provides by firmware. After this patch kexec_file_load will be able to verify a signed PE bzImage using keys in platform keyring. Tested in a VM with locally signed kernel with pesign and imported the cert to EFI's MokList variable. To test this patch series on latest kernel, you need to ensure this commit is applied as there is an regression bug in sanity_check_segment_list(): https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b Update from V3: - Tweak and simplify commit message as suggested by Mimi Zohar Update from V2: - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys should be used for verifying image as suggested by Mimi Zohar Update from V1: - Make platform_trusted_keys static, and update commit message as suggested by Mimi Zohar - Always check if platform keyring is initialized before use it Kairui Song (2): integrity, KEYS: add a reference to platform keyring kexec, KEYS: Make use of platform keyring for signature verify arch/x86/kernel/kexec-bzimage64.c | 13 ++--- certs/system_keyring.c| 22 +- include/keys/system_keyring.h | 5 + include/linux/verification.h | 1 + security/integrity/digsig.c | 6 ++ 5 files changed, 43 insertions(+), 4 deletions(-) -- 2.20.1
[PATCH v4 2/2] kexec, KEYS: Make use of platform keyring for signature verify
This patch let kexec_file_load makes use of .platform keyring as fall back if it failed to verify a PE signed image against secondary or builtin keyring, make it possible to verify kernel image signed with preboot keys as well. This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature should verify the signature using platform keyring. Also, decrease the error message log level when verification failed with -ENOKEY, so that if called tried multiple time with different keyring it won't generate extra noises. Signed-off-by: Kairui Song Reviewed-by: Mimi Zohar Tested-by: Mimi Zohar --- arch/x86/kernel/kexec-bzimage64.c | 13 ++--- certs/system_keyring.c| 13 - include/linux/verification.h | 1 + 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 7d97e432cbbc..2c007abd3d40 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data) #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len) { - return verify_pefile_signature(kernel, kernel_len, - VERIFY_USE_SECONDARY_KEYRING, - VERIFYING_KEXEC_PE_SIGNATURE); + int ret; + ret = verify_pefile_signature(kernel, kernel_len, + VERIFY_USE_SECONDARY_KEYRING, + VERIFYING_KEXEC_PE_SIGNATURE); + if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) { + ret = verify_pefile_signature(kernel, kernel_len, + VERIFY_USE_PLATFORM_KEYRING, + VERIFYING_KEXEC_PE_SIGNATURE); + } + return ret; } #endif diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 4690ef9cda8a..7085c286f4bd 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len, #else trusted_keys = builtin_trusted_keys; #endif + } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) { +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + trusted_keys = platform_trusted_keys; +#else + trusted_keys = NULL; +#endif + if (!trusted_keys) { + ret = -ENOKEY; + pr_devel("PKCS#7 platform keyring is not available\n"); + goto error; + } } ret = pkcs7_validate_trust(pkcs7, trusted_keys); if (ret < 0) { if (ret == -ENOKEY) - pr_err("PKCS#7 signature not signed with a trusted key\n"); + pr_devel("PKCS#7 signature not signed with a trusted key\n"); goto error; } diff --git a/include/linux/verification.h b/include/linux/verification.h index cfa4730d607a..018fb5f13d44 100644 --- a/include/linux/verification.h +++ b/include/linux/verification.h @@ -17,6 +17,7 @@ * should be used. */ #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL) +#define VERIFY_USE_PLATFORM_KEYRING ((struct key *)2UL) /* * The use to which an asymmetric key is being put. -- 2.20.1
[PATCH v4 1/2] integrity, KEYS: add a reference to platform keyring
commit 9dc92c45177a ('integrity: Define a trusted platform keyring') introduced a .platform keyring for storing preboot keys, used for verifying kernel images' signature. Currently only IMA-appraisal is able to use the keyring to verify kernel images that have their signature stored in xattr. This patch exposes the .platform keyring, making it accessible for verifying PE signed kernel images as well. Suggested-by: Mimi Zohar Signed-off-by: Kairui Song Reviewed-by: Mimi Zohar Tested-by: Mimi Zohar --- certs/system_keyring.c| 9 + include/keys/system_keyring.h | 5 + security/integrity/digsig.c | 6 ++ 3 files changed, 20 insertions(+) diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 81728717523d..4690ef9cda8a 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys; #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING static struct key *secondary_trusted_keys; #endif +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING +static struct key *platform_trusted_keys; +#endif extern __initconst const u8 system_certificate_list[]; extern __initconst const unsigned long system_certificate_list_size; @@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t len, } EXPORT_SYMBOL_GPL(verify_pkcs7_signature); +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING +void __init set_platform_trusted_keys(struct key *keyring) { + platform_trusted_keys = keyring; +} +#endif + #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h index 359c2f936004..9e1b7849b6aa 100644 --- a/include/keys/system_keyring.h +++ b/include/keys/system_keyring.h @@ -61,5 +61,10 @@ static inline struct key *get_ima_blacklist_keyring(void) } #endif /* CONFIG_IMA_BLACKLIST_KEYRING */ +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + +extern void __init set_platform_trusted_keys(struct key* keyring); + +#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */ #endif /* _KEYS_SYSTEM_KEYRING_H */ diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c index f45d6edecf99..bfabc2a8111d 100644 --- a/security/integrity/digsig.c +++ b/security/integrity/digsig.c @@ -89,6 +89,12 @@ static int __integrity_init_keyring(const unsigned int id, key_perm_t perm, keyring[id] = NULL; } +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + if (id == INTEGRITY_KEYRING_PLATFORM) { + set_platform_trusted_keys(keyring[id]); + } +#endif + return err; } -- 2.20.1
Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map
On Thu, Jan 17, 2019 at 5:40 PM Rafael J. Wysocki wrote: > > On Thu, Jan 17, 2019 at 9:53 AM Dave Young wrote: > > > > Add linux-acpi list > > Well, thanks, but please resend the patches with a CC to linux-acpi. > Hi, sure will do. Any thought on adding an acpi_os_get_root_pointer_late and store rsdp pointer as mentioned? Will updat the patch and post V2, and cc linux-acpi as well later. > > On 01/17/19 at 03:41pm, Kairui Song wrote: > > > On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov wrote: > > > > > > > > On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote: > > > > > I didn't see a way to reuse things in that patch series, situation is > > > > > different, in that patch it needs to get RSDP in very early boot stage > > > > > so it did everything from scratch, in this patch kexec_file_load need > > > > > to get RSDP too, but everything is well setup so things are a lot > > > > > easier, just read from current boot_prams, efi and fallback to > > > > > acpi_find_root_pointer should be good. > > > > > > > > No no. Early code should find out that venerable RSDP thing once and > > > > will save it somewhere for further use. No gazillion parsings of it. > > > > Just once and share it with the rest of the code that needs it. > > > > > > > > > > How about we refill the boot_params.acpi_rsdp_addr if it is not valid > > > in early code, so it could be used as a reliable RSDP address source? > > > That should make things easier. > > > > > > But if early code should parse it and store it should be done in > > > Chao's patch, or I can post another patch to do it if Chao's patch is > > > merged. > > > > > > For now I think good to have something like this in this patch series > > > to always keep storing acpi_rsdp in late code, > > > acpi_os_get_root_pointer_late (maybe comeup with a better name later) > > > could be used anytime to get RSDP and no extra parsing: > > > > > > --- a/drivers/acpi/osl.c > > > +++ b/drivers/acpi/osl.c > > > @@ -180,8 +180,8 @@ void acpi_os_vprintf(const char *fmt, va_list args) > > > #endif > > > } > > > > > > -#ifdef CONFIG_KEXEC > > > static unsigned long acpi_rsdp; > > > +#ifdef CONFIG_KEXEC > > > static int __init setup_acpi_rsdp(char *arg) > > > { > > > return kstrtoul(arg, 16, _rsdp); > > > @@ -189,28 +189,38 @@ static int __init setup_acpi_rsdp(char *arg) > > > early_param("acpi_rsdp", setup_acpi_rsdp); > > > #endif > > > > > > +acpi_physical_address acpi_os_get_root_pointer_late(void) { > > > + return acpi_rsdp; > > > +} > > > + > > > acpi_physical_address __init acpi_os_get_root_pointer(void) > > > { > > > acpi_physical_address pa; > > > > > > -#ifdef CONFIG_KEXEC > > > if (acpi_rsdp) > > > return acpi_rsdp; > > > -#endif > > > + > > > pa = acpi_arch_get_root_pointer(); > > > - if (pa) > > > + if (pa) { > > > + acpi_rsdp = pa; > > > return pa; > > > + } > > > > > > if (efi_enabled(EFI_CONFIG_TABLES)) { > > > - if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) > > > + if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) { > > > + acpi_rsdp = efi.acpi20; > > > return efi.acpi20; > > > - if (efi.acpi != EFI_INVALID_TABLE_ADDR) > > > + } > > > + if (efi.acpi != EFI_INVALID_TABLE_ADDR) { > > > + acpi_rsdp = efi.acpi; > > > return efi.acpi; > > > + } > > > pr_err(PREFIX "System description tables not found\n"); > > > } else if (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) { > > > acpi_find_root_pointer(); > > > } > > > > > > + acpi_rsdp = pa; > > > return pa; > > > } > > > > > > > -- > > > > Regards/Gruss, > > > > Boris. > > > > > > > > Good mailing practices for 400: avoid top-posting and trim the reply. > > > -- > > > Best Regards, > > > Kairui Song -- Best Regards, Kairui Song
Re: [PATCH v3 0/2] let kexec_file_load use platform keyring to verify the kernel image
On Fri, Jan 18, 2019 at 10:00 AM Dave Young wrote: > > On 01/18/19 at 09:35am, Dave Young wrote: > > On 01/17/19 at 08:08pm, Mimi Zohar wrote: > > > On Wed, 2019-01-16 at 18:16 +0800, Kairui Song wrote: > > > > This patch series adds a .platform_trusted_keys in system_keyring as the > > > > reference to .platform keyring in integrity subsystem, when platform > > > > keyring is being initialized it will be updated. So other component > > > > could > > > > use this keyring as well. > > > > > > Remove "other component could use ...". > > > > > > > > This patch series also let kexec_file_load use platform keyring as fall > > > > back if it failed to verify the image against secondary keyring, make it > > > > possible to load kernel signed by third part key if third party key is > > > > imported in the firmware. > > > > > > This is the only reason for these patches. Please remove "also". > > > > > > > > > > > After this patch kexec_file_load will be able to verify a signed PE > > > > bzImage using keys in platform keyring. > > > > > > > > Tested in a VM with locally signed kernel with pesign and imported the > > > > cert to EFI's MokList variable. > > > > > > It's taken so long for me to review/test this patch set due to a > > > regression in sanity_check_segment_list(), introduced somewhere > > > between 4.20 and 5.0.0-rc1. The sgement overlap test - "if ((mend > > > > pstart) && (mstart < pend))" - fails, returning a -EINVAL. > > > > > > Is anyone else seeing this? > > > > Mimi, should be this issue? I have sent a fix for that. > > https://lore.kernel.org/lkml/20181228011247.ga9...@dhcp-128-65.nay.redhat.com/ > > Hi, Kairui, I think you should know this while working on this series, > It is good to mention the test dependency in cover letter so that reviewers > can save time. > > BTW, Boris took it in tip already: > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b > Hi, thanks for the suggestion, I did apply your patch to avoid the failure. Will add such info next time. Will send out V4 and update commit message as suggested by Mimi -- Best Regards, Kairui Song
Re: [RFC PATCH 1/1] KEYS, integrity: Link .platform keyring to .secondary_trusted_keys
On Thu, Jan 17, 2019 at 11:04 PM David Howells wrote: > > Kairui Song wrote: > > > +extern const struct key* __init integrity_get_platform_keyring(void); > > This should really be in keys/system_keyring.h and probably shouldn't be > exposed directly if it can be avoided. > > David Thanks for the review, I've sent V3 of this patch series, the implementation changed a bit, would you mind take a look of that patch instead? https://lore.kernel.org/lkml/20190116101654.7288-1-kas...@redhat.com/ -- Best Regards, Kairui Song
Re: [PATCH v15 5/6] x86/boot: Parse SRAT address from RSDP and store immovable memory
On Thu, Jan 17, 2019 at 3:58 PM Chao Fan wrote: > > On Wed, Jan 16, 2019 at 03:28:52PM +0800, Kairui Song wrote: > >On Mon, Jan 7, 2019 at 11:24 AM Chao Fan wrote: > >> > >> + > >> +/* Determine RSDP, based on acpi_os_get_root_pointer(). */ > >> +static acpi_physical_address get_rsdp_addr(void) > >> +{ > >> + acpi_physical_address pa; > >> + > >> + pa = get_acpi_rsdp(); > >> + > >> + if (!pa) > >> + pa = efi_get_rsdp_addr(); > >> + > >> + if (!pa) > >> + pa = bios_get_rsdp_addr(); > >> + > >> + return pa; > >> +} > > > >acpi_rsdp might be provided by boot_params.acpi_rsdp_addr too, > >it's introduced in ae7e1238e68f2a for Xen PVH guest and later move to > >boot_params in e6e094e053af, > >kexec could use it to pass RSDP to second kernel as well. Please check > >it as well make sure it always works. > > > > Hi Kairui, > > I saw the parsing code has been added to kernel, but I didn't see > where to fill in the 'acpi_rsdp_addr'. If only you(KEXEC) use it, > I can add "#ifdef CONFIG_KEXEC", but you said Xen will use it also, > so I didn't add ifdef to control it. I was trying to do as below: > > static inline acpi_physical_address get_boot_params_rsdp(void) > { > return boot_params->acpi_rsdp_addr; > } > > static acpi_physical_address get_rsdp_addr(void) > { > bool boot_params_rsdp_exist; > acpi_physical_address pa; > > pa = get_acpi_rsdp(); > > if (!pa) > pa = get_boot_params_rsdp(); > > if (!pa) { > pa = efi_get_rsdp_addr(); > boot_params_rsdp_exist = false; > } > else > boot_params_rsdp_exist = true; > > if (!pa) > pa = bios_get_rsdp_addr(); > > if (pa && !boot_params_rsdp_exist) > boot_params.acpi_rsdp_addr = pa; > > return pa; > } > > At the same time, I notice kernel only parses it when > "#ifdef CONFIG_ACPI", we should keep sync with kernel, but I think > we are parsing SRAT, CONFIG_ACPI is needed sure, so I am going to > update the define of EARLY_SRAT_PARSE: > > config EARLY_SRAT_PARSE > bool "EARLY SRAT parsing" > def_bool y > depends on RANDOMIZE_BASE && MEMORY_HOTREMOVE && ACPI > > Boris, how do you think the update for the new acpi_rsdp_addr? > If I misunderstand something, please let me know. > > Thanks, > Chao Fan > Hi, thanks for considering kexec usage, but I think "boot_params_rsdp_exist" is not necessary, boot_params->acpi_rsdp_addr should be either NULL or a valid value if I, later initialization code considers it a valid value if it's not NULL. For the usage for Xen I'm not sure either, the info comes from commit message of ae7e1238e68f2a that's also where boot_params.acpi_rsdp_addr is first introduced, lets cc Juergen as well. -- Best Regards, Kairui Song
Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map
On Thu, Jan 17, 2019 at 3:51 PM Chao Fan wrote: > > On Thu, Jan 17, 2019 at 03:41:13PM +0800, Kairui Song wrote: > >On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov wrote: > >> > >> On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote: > >> > I didn't see a way to reuse things in that patch series, situation is > >> > different, in that patch it needs to get RSDP in very early boot stage > >> > so it did everything from scratch, in this patch kexec_file_load need > >> > to get RSDP too, but everything is well setup so things are a lot > >> > easier, just read from current boot_prams, efi and fallback to > >> > acpi_find_root_pointer should be good. > >> > >> No no. Early code should find out that venerable RSDP thing once and > >> will save it somewhere for further use. No gazillion parsings of it. > >> Just once and share it with the rest of the code that needs it. > >> > > > >How about we refill the boot_params.acpi_rsdp_addr if it is not valid > >in early code, so it could be used as a reliable RSDP address source? > >That should make things easier. > > I think it's OK. > Try to read it, if get RSDP, use it. > If not, search in EFI/BIOS/... and refill the RSDP to > boot_params.acpi_rsdp_addr. > By the way, I search kernel code, I didn't find other code fill and > use it, only you(KEXEC) are trying to fill it. > If I miss something, please let me know. Yes, kexec would read RSDP again to pass it to second kernel, and only if EFI is disabled (efi=noruntime/old_map, else second kernel will get rsdp just fine). Not sure if any other component would use it. > > Thanks, > Chao Fan > -- Best Regards, Kairui Song
Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map
On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov wrote: > > On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote: > > I didn't see a way to reuse things in that patch series, situation is > > different, in that patch it needs to get RSDP in very early boot stage > > so it did everything from scratch, in this patch kexec_file_load need > > to get RSDP too, but everything is well setup so things are a lot > > easier, just read from current boot_prams, efi and fallback to > > acpi_find_root_pointer should be good. > > No no. Early code should find out that venerable RSDP thing once and > will save it somewhere for further use. No gazillion parsings of it. > Just once and share it with the rest of the code that needs it. > How about we refill the boot_params.acpi_rsdp_addr if it is not valid in early code, so it could be used as a reliable RSDP address source? That should make things easier. But if early code should parse it and store it should be done in Chao's patch, or I can post another patch to do it if Chao's patch is merged. For now I think good to have something like this in this patch series to always keep storing acpi_rsdp in late code, acpi_os_get_root_pointer_late (maybe comeup with a better name later) could be used anytime to get RSDP and no extra parsing: --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -180,8 +180,8 @@ void acpi_os_vprintf(const char *fmt, va_list args) #endif } -#ifdef CONFIG_KEXEC static unsigned long acpi_rsdp; +#ifdef CONFIG_KEXEC static int __init setup_acpi_rsdp(char *arg) { return kstrtoul(arg, 16, _rsdp); @@ -189,28 +189,38 @@ static int __init setup_acpi_rsdp(char *arg) early_param("acpi_rsdp", setup_acpi_rsdp); #endif +acpi_physical_address acpi_os_get_root_pointer_late(void) { + return acpi_rsdp; +} + acpi_physical_address __init acpi_os_get_root_pointer(void) { acpi_physical_address pa; -#ifdef CONFIG_KEXEC if (acpi_rsdp) return acpi_rsdp; -#endif + pa = acpi_arch_get_root_pointer(); - if (pa) + if (pa) { + acpi_rsdp = pa; return pa; + } if (efi_enabled(EFI_CONFIG_TABLES)) { - if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) + if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) { + acpi_rsdp = efi.acpi20; return efi.acpi20; - if (efi.acpi != EFI_INVALID_TABLE_ADDR) + } + if (efi.acpi != EFI_INVALID_TABLE_ADDR) { + acpi_rsdp = efi.acpi; return efi.acpi; + } pr_err(PREFIX "System description tables not found\n"); } else if (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) { acpi_find_root_pointer(); } + acpi_rsdp = pa; return pa; } > -- > Regards/Gruss, > Boris. > > Good mailing practices for 400: avoid top-posting and trim the reply. -- Best Regards, Kairui Song
[PATCH v3 1/2] integrity, KEYS: add a reference to platform keyring
Currently when loading new kernel via kexec_file_load syscall, it is able to verify the signed PE bzimage against .builtin_trusted_keys or .secondary_trusted_keys. But the image could be signed with third part keys which will be provided by platform or firmware as EFI variable (eg. stored in MokListRT EFI variable), and the keys won't be available in keyrings mentioned above. After commit 9dc92c45177a ('integrity: Define a trusted platform keyring') a .platform keyring is introduced to store the keys provided by platform or firmware, this keyring is intended to be used for verifying kernel images being loaded by kexec_file_load syscall. And with a few following up commits, keys provided by firmware is being loaded into this keyring, and IMA-appraisal is able to use the keyring to verify kernel images. IMA is the currently the only user of that keyring. This patch exposes the .platform, and makes it useable for other components. For example, kexec_file_load could use this .platform keyring to verify the kernel image's image. Suggested-by: Mimi Zohar Signed-off-by: Kairui Song --- certs/system_keyring.c| 9 + include/keys/system_keyring.h | 5 + security/integrity/digsig.c | 6 ++ 3 files changed, 20 insertions(+) diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 81728717523d..4690ef9cda8a 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys; #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING static struct key *secondary_trusted_keys; #endif +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING +static struct key *platform_trusted_keys; +#endif extern __initconst const u8 system_certificate_list[]; extern __initconst const unsigned long system_certificate_list_size; @@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t len, } EXPORT_SYMBOL_GPL(verify_pkcs7_signature); +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING +void __init set_platform_trusted_keys(struct key *keyring) { + platform_trusted_keys = keyring; +} +#endif + #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h index 359c2f936004..9e1b7849b6aa 100644 --- a/include/keys/system_keyring.h +++ b/include/keys/system_keyring.h @@ -61,5 +61,10 @@ static inline struct key *get_ima_blacklist_keyring(void) } #endif /* CONFIG_IMA_BLACKLIST_KEYRING */ +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + +extern void __init set_platform_trusted_keys(struct key* keyring); + +#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */ #endif /* _KEYS_SYSTEM_KEYRING_H */ diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c index f45d6edecf99..bfabc2a8111d 100644 --- a/security/integrity/digsig.c +++ b/security/integrity/digsig.c @@ -89,6 +89,12 @@ static int __integrity_init_keyring(const unsigned int id, key_perm_t perm, keyring[id] = NULL; } +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + if (id == INTEGRITY_KEYRING_PLATFORM) { + set_platform_trusted_keys(keyring[id]); + } +#endif + return err; } -- 2.20.1
[PATCH v3 0/2] let kexec_file_load use platform keyring to verify the kernel image
This patch series adds a .platform_trusted_keys in system_keyring as the reference to .platform keyring in integrity subsystem, when platform keyring is being initialized it will be updated. So other component could use this keyring as well. This patch series also let kexec_file_load use platform keyring as fall back if it failed to verify the image against secondary keyring, make it possible to load kernel signed by third part key if third party key is imported in the firmware. After this patch kexec_file_load will be able to verify a signed PE bzImage using keys in platform keyring. Tested in a VM with locally signed kernel with pesign and imported the cert to EFI's MokList variable. Kairui Song (2): integrity, KEYS: add a reference to platform keyring kexec, KEYS: Make use of platform keyring for signature verify Update from V2: - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys should be used for verifying image as suggested by Mimi Zohar Update from V1: - Make platform_trusted_keys static, and update commit message as suggested by Mimi Zohar - Always check if platform keyring is initialized before use it Kairui Song (2): integrity, KEYS: add a reference to platform keyring kexec, KEYS: Make use of platform keyring for signature verify arch/x86/kernel/kexec-bzimage64.c | 13 ++--- certs/system_keyring.c| 22 +- include/keys/system_keyring.h | 5 + include/linux/verification.h | 1 + security/integrity/digsig.c | 6 ++ 5 files changed, 43 insertions(+), 4 deletions(-) -- 2.20.1
[PATCH v3 2/2] kexec, KEYS: Make use of platform keyring for signature verify
With KEXEC_BZIMAGE_VERIFY_SIG enabled, kexec_file_load will need to verify the kernel image. The image might be signed with third part keys, and the keys could be stored in firmware, then got loaded into the .platform keyring. Now we have a symbol .platform_trusted_keyring as the reference to .platform keyring, this patch makes use if it and allow kexec_file_load to verify the image against keys in .platform keyring. This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature should verify the signature using platform keyring. Also, decrease the error message log level when verification failed with -ENOKEY, so that if called tried multiple time with different keyring it won't generate extra noises. Signed-off-by: Kairui Song --- arch/x86/kernel/kexec-bzimage64.c | 13 ++--- certs/system_keyring.c| 13 - include/linux/verification.h | 1 + 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 7d97e432cbbc..2c007abd3d40 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data) #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len) { - return verify_pefile_signature(kernel, kernel_len, - VERIFY_USE_SECONDARY_KEYRING, - VERIFYING_KEXEC_PE_SIGNATURE); + int ret; + ret = verify_pefile_signature(kernel, kernel_len, + VERIFY_USE_SECONDARY_KEYRING, + VERIFYING_KEXEC_PE_SIGNATURE); + if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) { + ret = verify_pefile_signature(kernel, kernel_len, + VERIFY_USE_PLATFORM_KEYRING, + VERIFYING_KEXEC_PE_SIGNATURE); + } + return ret; } #endif diff --git a/certs/system_keyring.c b/certs/system_keyring.c index 4690ef9cda8a..7085c286f4bd 100644 --- a/certs/system_keyring.c +++ b/certs/system_keyring.c @@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len, #else trusted_keys = builtin_trusted_keys; #endif + } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) { +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING + trusted_keys = platform_trusted_keys; +#else + trusted_keys = NULL; +#endif + if (!trusted_keys) { + ret = -ENOKEY; + pr_devel("PKCS#7 platform keyring is not available\n"); + goto error; + } } ret = pkcs7_validate_trust(pkcs7, trusted_keys); if (ret < 0) { if (ret == -ENOKEY) - pr_err("PKCS#7 signature not signed with a trusted key\n"); + pr_devel("PKCS#7 signature not signed with a trusted key\n"); goto error; } diff --git a/include/linux/verification.h b/include/linux/verification.h index cfa4730d607a..018fb5f13d44 100644 --- a/include/linux/verification.h +++ b/include/linux/verification.h @@ -17,6 +17,7 @@ * should be used. */ #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL) +#define VERIFY_USE_PLATFORM_KEYRING ((struct key *)2UL) /* * The use to which an asymmetric key is being put. -- 2.20.1
Re: [PATCH v15 5/6] x86/boot: Parse SRAT address from RSDP and store immovable memory
gt; + } > + table = (struct acpi_subtable_header *) > + ((unsigned long)table + table->length); > + } > + num_immovable_mem = i; > +} > diff --git a/arch/x86/boot/compressed/kaslr.c > b/arch/x86/boot/compressed/kaslr.c > index 9ed9709d9947..b251572e77af 100644 > --- a/arch/x86/boot/compressed/kaslr.c > +++ b/arch/x86/boot/compressed/kaslr.c > @@ -87,10 +87,6 @@ static unsigned long get_boot_seed(void) > #define KASLR_COMPRESSED_BOOT > #include "../../lib/kaslr.c" > > -struct mem_vector { > - unsigned long long start; > - unsigned long long size; > -}; > > /* Only supporting at most 4 unusable memmap regions with kaslr */ > #define MAX_MEMMAP_REGIONS 4 > diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h > index a1d5918765f3..b49748366a5b 100644 > --- a/arch/x86/boot/compressed/misc.h > +++ b/arch/x86/boot/compressed/misc.h > @@ -77,6 +77,11 @@ void choose_random_location(unsigned long input, > unsigned long *output, > unsigned long output_size, > unsigned long *virt_addr); > +struct mem_vector { > + unsigned long long start; > + unsigned long long size; > +}; > + > /* cpuflags.c */ > bool has_cpuflag(int flag); > #else > @@ -116,3 +121,17 @@ static inline void console_init(void) > void set_sev_encryption_mask(void); > > #endif > + > +/* acpi.c */ > +#ifdef CONFIG_RANDOMIZE_BASE > +/* Amount of immovable memory regions */ > +int num_immovable_mem; > +#endif > + > +#ifdef CONFIG_EARLY_SRAT_PARSE > +void get_immovable_mem(void); > +#else > +static void get_immovable_mem(void) > +{ > +} > +#endif > -- > 2.20.1 > > > -- Best Regards, Kairui Song