Re: [PATCH] tracing: fix UAF caused by memory ordering issue

2023-11-14 Thread Kairui Song
Mark Rutland  于2023年11月14日周二 06:17写道:
>

Hi, Mark and Steven

Thank you so much for the detailed comments.

> On Sun, Nov 12, 2023 at 11:00:30PM +0800, Kairui Song wrote:
> > From: Kairui Song 
> >
> > Following kernel panic was observed when doing ftrace stress test:
>
> Can you share some more details:
>
> * What test specifically are you running? Can you share this so that others 
> can
>   try to reproduce the issue?

Yes, the panic happened when doing LTP ftrace stress test:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/tracing/ftrace_test/ftrace_stress_test.sh

>
> * Which machines are you testing on (i.e. which CPU microarchitecture is this
>   seen with) ?

The panic was seen on a ARM64 VM, lscpu output:
Architecture:   aarch64
  CPU op-mode(s):   64-bit
  Byte Order:   Little Endian
CPU(s): 4
  On-line CPU(s) list:  0-3
Vendor ID:  HiSilicon
  BIOS Vendor ID:   QEMU
  Model name:   Kunpeng-920
BIOS Model name:virt-rhel8.6.0  CPU @ 2.0GHz
BIOS CPU family:1
Model:  0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s):  4
Stepping:   0x1
BogoMIPS:   200.00
Flags:  fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

The host machine is a Kunpeng-920 with 4 NUMA nodes and 128 cores.

>
> * Which compiler are you using?

gcc 12.3.1

>
> * The log shows this is with v6.1.61+. Can you reproduce this with a mainline
>   kernel? e.g. v6.6 or v6.7-rc1?

It's reproducible with LTS, not tested with mainline, I'll try to
reproduce this with the latest mainline. But due to the low
reproducibility this may take a while.

>
> > Unable to handle kernel paging request at virtual address 9699b0f8ece28240
> > Mem abort info:
> >   ESR = 0x9604
> >   EC = 0x25: DABT (current EL), IL = 32 bits
> >   SET = 0, FnV = 0
> >   EA = 0, S1PTW = 0
> >   FSC = 0x04: level 0 translation fault
> > Data abort info:
> >   ISV = 0, ISS = 0x0004
> >   CM = 0, WnR = 0
> > [9699b0f8ece28240] address between user and kernel address ranges
> > Internal error: Oops: 9604 [#1] SMP
> > Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill vfat fat loop 
> > fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache 
> > jbd2 sr_mod cdrom crct10dif_ce ghash_ce sha2_ce virtio_gpu virtio_dma_buf 
> > drm_shmem_helper virtio_blk drm_kms_helper syscopyarea sysfillrect 
> > sysimgblt fb_sys_fops virtio_console sha256_arm64 sha1_ce drm virtio_scsi 
> > i2c_core virtio_net net_failover failover virtio_mmio dm_multipath dm_mod 
> > autofs4 [last unloaded: ipmi_msghandler]
> > CPU: 0 PID: 499719 Comm: sh Kdump: loaded Not tainted 6.1.61+ #2
> > Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> > pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : __kmem_cache_alloc_node+0x1dc/0x2e4
> > lr : __kmem_cache_alloc_node+0xac/0x2e4
> > sp : 8ad23aa0
> > x29: 8ad23ab0 x28: 0004052b8000 x27: c513863b
> > x26: 0040 x25: c51384f21ca4 x24: 
> > x23: d615521430b1b1a5 x22: c51386044770 x21: 
> > x20: 0cc0 x19: c0001200 x18: 
> > x17:  x16:  x15: e65e1630
> > x14: 0004 x13: c513863e67a0 x12: c513863af6d8
> > x11: 0001 x10: 8ad23aa0 x9 : c51385058078
> > x8 : 0018 x7 : 0001 x6 : 0010
> > x5 : c09c2280 x4 : c51384f21ca4 x3 : 0040
> > x2 : 9699b0f8ece28240 x1 : c09c2280 x0 : 9699b0f8ece28200
> > Call trace:
> >  __kmem_cache_alloc_node+0x1dc/0x2e4
> >  __kmalloc+0x6c/0x1c0
> >  func_add+0x1a4/0x200
> >  tracepoint_add_func+0x70/0x230
> >  tracepoint_probe_register+0x6c/0xb4
> >  trace_event_reg+0x8c/0xa0
> >  __ftrace_event_enable_disable+0x17c/0x440
> >  __ftrace_set_clr_event_nolock+0xe0/0x150
> >  system_enable_write+0xe0/0x114
> >  vfs_write+0xd0/0x2dc
> >  ksys_write+0x78/0x110
> >  __arm64_sys_write+0x24/0x30
> >  invoke_syscall.constprop.0+0x58/0xf0
> >  el0_svc_common.constprop.0+0x54/0x160
> >  do_el0_svc+0x2c/0x60
> >  el0_svc+0x40/0x1ac
> >  el0t_64_sync_handler+0xf4/0x120
> >  el0t_64_sync+0x19c/0x1a0
> > Code: b9402a63 f9405e77 8b030002 d5384101 (f8636803)
> >
> > Panic was caused by corrupted freelist pointer. After more debugging,
> > I found the root

[PATCH] tracing: fix UAF caused by memory ordering issue

2023-11-12 Thread Kairui Song
From: Kairui Song 

Following kernel panic was observed when doing ftrace stress test:

Unable to handle kernel paging request at virtual address 9699b0f8ece28240
Mem abort info:
  ESR = 0x9604
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x0004
  CM = 0, WnR = 0
[9699b0f8ece28240] address between user and kernel address ranges
Internal error: Oops: 9604 [#1] SMP
Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill vfat fat loop 
fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 
sr_mod cdrom crct10dif_ce ghash_ce sha2_ce virtio_gpu virtio_dma_buf 
drm_shmem_helper virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops virtio_console sha256_arm64 sha1_ce drm virtio_scsi i2c_core 
virtio_net net_failover failover virtio_mmio dm_multipath dm_mod autofs4 [last 
unloaded: ipmi_msghandler]
CPU: 0 PID: 499719 Comm: sh Kdump: loaded Not tainted 6.1.61+ #2
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __kmem_cache_alloc_node+0x1dc/0x2e4
lr : __kmem_cache_alloc_node+0xac/0x2e4
sp : 8ad23aa0
x29: 8ad23ab0 x28: 0004052b8000 x27: c513863b
x26: 0040 x25: c51384f21ca4 x24: 
x23: d615521430b1b1a5 x22: c51386044770 x21: 
x20: 0cc0 x19: c0001200 x18: 
x17:  x16:  x15: e65e1630
x14: 0004 x13: c513863e67a0 x12: c513863af6d8
x11: 0001 x10: 8ad23aa0 x9 : c51385058078
x8 : 0018 x7 : 0001 x6 : 0010
x5 : c09c2280 x4 : c51384f21ca4 x3 : 0040
x2 : 9699b0f8ece28240 x1 : c09c2280 x0 : 9699b0f8ece28200
Call trace:
 __kmem_cache_alloc_node+0x1dc/0x2e4
 __kmalloc+0x6c/0x1c0
 func_add+0x1a4/0x200
 tracepoint_add_func+0x70/0x230
 tracepoint_probe_register+0x6c/0xb4
 trace_event_reg+0x8c/0xa0
 __ftrace_event_enable_disable+0x17c/0x440
 __ftrace_set_clr_event_nolock+0xe0/0x150
 system_enable_write+0xe0/0x114
 vfs_write+0xd0/0x2dc
 ksys_write+0x78/0x110
 __arm64_sys_write+0x24/0x30
 invoke_syscall.constprop.0+0x58/0xf0
 el0_svc_common.constprop.0+0x54/0x160
 do_el0_svc+0x2c/0x60
 el0_svc+0x40/0x1ac
 el0t_64_sync_handler+0xf4/0x120
 el0t_64_sync+0x19c/0x1a0
Code: b9402a63 f9405e77 8b030002 d5384101 (f8636803)

Panic was caused by corrupted freelist pointer. After more debugging,
I found the root cause is UAF of slab allocated object in ftrace
introduced by commit eecb91b9f98d ("tracing: Fix memleak due to race
between current_tracer and trace"), and so far it's only reproducible
on some ARM64 machines, the UAF and free stack is:

UAF:
kasan_report+0xa8/0x1bc
__asan_report_load8_noabort+0x28/0x3c
print_graph_function_flags+0x524/0x5a0
print_graph_function_event+0x28/0x40
print_trace_line+0x5c4/0x1030
s_show+0xf0/0x460
seq_read_iter+0x930/0xf5c
seq_read+0x130/0x1d0
vfs_read+0x288/0x840
ksys_read+0x130/0x270
__arm64_sys_read+0x78/0xac
invoke_syscall.constprop.0+0x90/0x224
do_el0_svc+0x118/0x3dc
el0_svc+0x54/0x120
el0t_64_sync_handler+0xf4/0x120
el0t_64_sync+0x19c/0x1a0

Freed by:
kasan_save_free_info+0x38/0x5c
__kasan_slab_free+0xe8/0x154
slab_free_freelist_hook+0xfc/0x1e0
__kmem_cache_free+0x138/0x260
kfree+0xd0/0x1d0
graph_trace_close+0x60/0x90
s_start+0x610/0x910
seq_read_iter+0x274/0xf5c
seq_read+0x130/0x1d0
vfs_read+0x288/0x840
ksys_read+0x130/0x270
__arm64_sys_read+0x78/0xac
invoke_syscall.constprop.0+0x90/0x224
do_el0_svc+0x118/0x3dc
el0_svc+0x54/0x120
el0t_64_sync_handler+0xf4/0x120
el0t_64_sync+0x19c/0x1a0

Despite the s_start and s_show being serialized by seq_file mutex,
the tracer struct copy in s_start introduced by the commit mentioned
above is not atomic nor guarenteened to be seen by all CPUs. So
following seneriao is possible (and actually happened):

CPU 1 CPU 2
seq_read_iter seq_read_iter
  mutex_lock(>lock);
  s_start
// iter->trace is graph_trace
iter->trace->close(iter);
graph_trace_close
  kfree(data) <- *** data released here ***
// copy current_trace to iter->trace
// but not synced to CPU 2
*iter->trace = *tr->current_trace
  ... (goes on)
  mutex_unlock(>lock);
  mutex_lock(>lock);
  ... (s_start and other work)
  s_show
print_trace_line(iter)
  // iter->trace is still
  // old value (graph_trace)
  iter->trace->print_line()
 

[PATCH] efi: memmap insertion should adjust the vaddr as well

2021-02-24 Thread Kairui Song
Currently when efi_memmap_insert is called, only the
physical memory addresses are re-calculated. The virt
addresses of the split entries are untouched.

If any later operation depends on the virt_addaress info, things
will go wrong. One case it may fail is kexec on x86, after kexec,
efi is already in virtual mode, kernel simply do fixed mapping
reuse the recorded virt address. If the virt address is incorrect,
the mapping will be invalid.

Update the virt_addaress as well when inserting a memmap entry to
fix this potential issue.

Signed-off-by: Kairui Song 
---
 drivers/firmware/efi/memmap.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 2ff1883dc788..de5c545b2074 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -292,7 +292,7 @@ void __init efi_memmap_insert(struct efi_memory_map 
*old_memmap, void *buf,
 {
u64 m_start, m_end, m_attr;
efi_memory_desc_t *md;
-   u64 start, end;
+   u64 start, end, virt_offset;
void *old, *new;
 
/* modifying range */
@@ -321,6 +321,11 @@ void __init efi_memmap_insert(struct efi_memory_map 
*old_memmap, void *buf,
start = md->phys_addr;
end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
 
+   if (md->virt_addr)
+   virt_offset = md->virt_addr - md->phys_addr;
+   else
+   virt_offset = -1;
+
if (m_start <= start && end <= m_end)
md->attribute |= m_attr;
 
@@ -337,6 +342,8 @@ void __init efi_memmap_insert(struct efi_memory_map 
*old_memmap, void *buf,
md->phys_addr = m_end + 1;
md->num_pages = (end - md->phys_addr + 1) >>
EFI_PAGE_SHIFT;
+   if (virt_offset != -1)
+   md->virt_addr = md->phys_addr + virt_offset;
}
 
if ((start < m_start && m_start < end) && m_end < end) {
@@ -351,6 +358,8 @@ void __init efi_memmap_insert(struct efi_memory_map 
*old_memmap, void *buf,
md->phys_addr = m_start;
md->num_pages = (m_end - m_start + 1) >>
EFI_PAGE_SHIFT;
+   if (virt_offset != -1)
+   md->virt_addr = md->phys_addr + virt_offset;
/* last part */
new += old_memmap->desc_size;
memcpy(new, old, old_memmap->desc_size);
@@ -358,6 +367,8 @@ void __init efi_memmap_insert(struct efi_memory_map 
*old_memmap, void *buf,
md->phys_addr = m_end + 1;
md->num_pages = (end - m_end) >>
EFI_PAGE_SHIFT;
+   if (virt_offset != -1)
+   md->virt_addr = md->phys_addr + virt_offset;
}
 
if ((start < m_start && m_start < end) &&
@@ -373,6 +384,8 @@ void __init efi_memmap_insert(struct efi_memory_map 
*old_memmap, void *buf,
md->num_pages = (end - md->phys_addr + 1) >>
EFI_PAGE_SHIFT;
md->attribute |= m_attr;
+   if (virt_offset != -1)
+   md->virt_addr = md->phys_addr + virt_offset;
}
}
 }
-- 
2.29.2



Re: [PATCH v4 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-23 Thread Kairui Song
On Wed, Feb 24, 2021 at 1:45 AM Saeed Mirzamohammadi
 wrote:
>
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
>
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
>  .../admin-guide/kernel-parameters.txt |  6 ++
>  arch/Kconfig  | 20 +++
>  kernel/crash_core.c   |  7 +++
>  4 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..ae030111e22a 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>
>  Boot into System Kernel
>  ===
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 9e3cdb271d06..a5deda5c85fe 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -747,6 +747,12 @@
> a memory unit (amount[KMG]). See also
> Documentation/admin-guide/kdump/kdump.rst for an 
> example.
>
> +   crashkernel=auto
> +   [KNL] This parameter will set the reserved memory for
> +   the crash kernel based on the value of the 
> CRASH_AUTO_STR
> +   that is the best effort estimation for each arch. See 
> also
> +   arch/Kconfig for further details.
> +
> crashkernel=size[KMG],high
> [KNL, X86-64] range could be above 4G. Allow kernel
> to allocate physical memory region from top, so could
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..23d047548772 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,26 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
> bool
>
> +config CRASH_AUTO_STR
> +   string "Memory reserved for crash kernel"
> +   depends on CRASH_CORE
> +   default "1G-64G:128M,64G-1T:256M,1T-:512M"
> +   help
> + This configures the reserved memory dependent
> + on the value of System RAM. The syntax is:
> + crashkernel=:[,:,...][@offset]
> + range=start-[end]
> +
> + For example:
> + crashkernel=512M-2G:64M,2G-:128M
> +
> + This would mean:
> +
> + 1) if the RAM is smaller than 512M, then don't reserve anything
> +(this is the "rescue" case)
> + 2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> + 3) if the RAM size is larger than 2G, then reserve 128M
> +
>  config KEXEC_CORE
> select CRASH_CORE
> bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..90f9e4bb6704 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
> if (suffix)
> return parse_crashkernel_suffix(ck_cmdline, crash_size,
> suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> +   if (strncmp(ck_cmdline, "auto", 4) == 0) {
> +   ck_cmdline = CONFIG_CRASH_AUTO_STR;
> +   pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> +   }
> +#endif
> /*
>  * if the commandline contains a ':', then that's the extended
>  * syntax -- if not, it must be the classic syntax
> --
> 2.27.0
>
>
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Thanks for help pushing the crashkernel=auto to upstream
This patch works well.

Tested-by: Kairui Song 


--
Best Regards,
Kairui Song



Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-23 Thread Kairui Song
rashkernel(char *cmdline,
> >   if (suffix)
> >   return parse_crashkernel_suffix(ck_cmdline, crash_size,
> >   suffix);
> > +#ifdef CONFIG_CRASH_AUTO_STR
> > + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> > + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> > + pr_info("Using crashkernel=auto, the size chosen is a best 
> > effort estimation.\n");
> > + }
> > +#endif
> >   /*
> >* if the commandline contains a ':', then that's the extended
> >* syntax -- if not, it must be the classic syntax
> > --
> > 2.27.0
> >
>
>
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>


-- 
Best Regards,
Kairui Song



Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2020-11-20 Thread Kairui Song
On Fri, Nov 20, 2020 at 4:28 AM Saeed Mirzamohammadi
 wrote:
>
> Hi,
>
> And I think crashkernel=auto could be used as an indicator that user
> want the kernel to control the crashkernel size, so some further work
> could be done to adjust the crashkernel more accordingly. eg. when
> memory encryption is enabled, increase the crashkernel value for the
> auto estimation, as it's known to consume more crashkernel memory.
>
> Thanks for the suggestion! I tried to keep it simple and leave it to the user 
> to change Kconfig in case a different range is needed. Based on experience, 
> these ranges work well for most of the regular cases.

Yes, I think the current implementation is a very good start.

There are some use cases, where kernel is expected to reserve more memory, like:
- when memory encryption is enabled, an extra swiotlb size of memory
should be reserved
- on pcc, fadump will expect more memory to be reserved

I believe there are a lot more cases like these.
I tried to come up with some patches to let the kernel reserve more
memory automatically, when such conditions are detected, but changing
the crashkernel= specified value is really weird.

But if we have a crashkernel=auto, then kernel automatically reserve
more memory will make sense.

> But why not make it arch-independent? This crashkernel=auto idea
> should simply work with every arch.
>
>
> Thanks! I’ll be making it arch-independent in the v2 patch.
>
>
> #include 
> #include 
> @@ -41,6 +42,15 @@ static int __init parse_crashkernel_mem(char *cmdline,
>unsigned long long *crash_base)
> {
>char *cur = cmdline, *tmp;
> +   unsigned long long total_mem = system_ram;
> +
> +   /*
> +* Firmware sometimes reserves some memory regions for it's own use.
> +* so we get less than actual system memory size.
> +* Workaround this by round up the total size to 128M which is
> +* enough for most test cases.
> +*/
> +   total_mem = roundup(total_mem, SZ_128M);
>
>
> I think this rounding may be better moved to the arch specified part
> where parse_crashkernel is called?
>
>
> Thanks for the suggestion. Could you please elaborate why do we need to do 
> that?

Every arch gets their total memory value using different methods,
(just check every parse_crashkernel call, and the system_ram param is
filled in many different ways), so I'm really not sure if this
rounding is always suitable.

>
> Thanks,
> Saeed
>
>
--
Best Regards,
Kairui Song



Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2020-11-18 Thread Kairui Song
  Enable bzImage signature verification support.
>
> -config CRASH_DUMP
> +menuconfig CRASH_DUMP
> bool "kernel crash dumps"
> depends on X86_64 || (X86_32 && HIGHMEM)
> help
> @@ -2049,6 +2049,30 @@ config CRASH_DUMP
>   (CONFIG_RELOCATABLE=y).
>   For more details see Documentation/admin-guide/kdump/kdump.rst
>
> +if CRASH_DUMP
> +
> +config CRASH_AUTO_STR
> +string "Memory reserved for crash kernel" if X86_64
> +   depends on CRASH_DUMP
> +default "1G-64G:128M,64G-1T:256M,1T-:512M"
> +   help
> + This configures the reserved memory dependent
> + on the value of System RAM. The syntax is:
> + crashkernel=:[,:,...][@offset]
> + range=start-[end]
> +
> + For example:
> + crashkernel=512M-2G:64M,2G-:128M
> +
> + This would mean:
> +
> + 1) if the RAM is smaller than 512M, then don't reserve anything
> +(this is the "rescue" case)
> + 2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> + 3) if the RAM size is larger than 2G, then reserve 128M
> +
> +endif # CRASH_DUMP
> +
>  config KEXEC_JUMP
> bool "kexec jump"
> depends on KEXEC && HIBERNATION
> diff --git a/arch/x86/configs/x86_64_defconfig 
> b/arch/x86/configs/x86_64_defconfig
> index 9936528e1939..7a87fbecf40b 100644
> --- a/arch/x86/configs/x86_64_defconfig
> +++ b/arch/x86/configs/x86_64_defconfig
> @@ -33,6 +33,7 @@ CONFIG_EFI_MIXED=y
>  CONFIG_HZ_1000=y
>  CONFIG_KEXEC=y
>  CONFIG_CRASH_DUMP=y
> +# CONFIG_CRASH_AUTO_STR is not set
>  CONFIG_HIBERNATION=y
>  CONFIG_PM_DEBUG=y
>  CONFIG_PM_TRACE_RTC=y
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..a44cd9cc12c4 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -41,6 +42,15 @@ static int __init parse_crashkernel_mem(char *cmdline,
> unsigned long long *crash_base)
>  {
> char *cur = cmdline, *tmp;
> +   unsigned long long total_mem = system_ram;
> +
> +   /*
> +* Firmware sometimes reserves some memory regions for it's own use.
> +* so we get less than actual system memory size.
> +* Workaround this by round up the total size to 128M which is
> +* enough for most test cases.
> +*/
> +   total_mem = roundup(total_mem, SZ_128M);

I think this rounding may be better moved to the arch specified part
where parse_crashkernel is called?

>
> /* for each entry of the comma-separated list */
> do {
> @@ -85,13 +95,13 @@ static int __init parse_crashkernel_mem(char *cmdline,
> return -EINVAL;
> }
> cur = tmp;
> -   if (size >= system_ram) {
> +   if (size >= total_mem) {
> pr_warn("crashkernel: invalid size\n");
> return -EINVAL;
> }
>
> /* match ? */
> -   if (system_ram >= start && system_ram < end) {
> +   if (total_mem >= start && total_mem < end) {
> *crash_size = size;
> break;
> }
> @@ -250,6 +260,12 @@ static int __init __parse_crashkernel(char *cmdline,
> if (suffix)
> return parse_crashkernel_suffix(ck_cmdline, crash_size,
> suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> +   if (strncmp(ck_cmdline, "auto", 4) == 0) {
> +   ck_cmdline = CONFIG_CRASH_AUTO_STR;
> +   pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> +   }
> +#endif
> /*
>  * if the commandline contains a ':', then that's the extended
>  * syntax -- if not, it must be the classic syntax
> --
> 2.18.4
>


--
Best Regards,
Kairui Song



[tip: x86/urgent] x86/kexec: Use up-to-dated screen_info copy to fill boot params

2020-10-14 Thread tip-bot2 for Kairui Song
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: afc18069a2cb7ead5f86623a5f3d4ad6e21f940d
Gitweb:
https://git.kernel.org/tip/afc18069a2cb7ead5f86623a5f3d4ad6e21f940d
Author:Kairui Song 
AuthorDate:Wed, 14 Oct 2020 17:24:28 +08:00
Committer: Ingo Molnar 
CommitterDate: Wed, 14 Oct 2020 17:05:03 +02:00

x86/kexec: Use up-to-dated screen_info copy to fill boot params

kexec_file_load() currently reuses the old boot_params.screen_info,
but if drivers have change the hardware state, boot_param.screen_info
could contain invalid info.

For example, the video type might be no longer VGA, or the frame buffer
address might be changed. If the kexec kernel keeps using the old screen_info,
kexec'ed kernel may attempt to write to an invalid framebuffer
memory region.

There are two screen_info instances globally available, boot_params.screen_info
and screen_info. Later one is a copy, and is updated by drivers.

So let kexec_file_load use the updated copy.

[ mingo: Tidied up the changelog. ]

Signed-off-by: Kairui Song 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20201014092429.1415040-2-kas...@redhat.com
---
 arch/x86/kernel/kexec-bzimage64.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 57c2ecf..ce831f9 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
 
/* Copying screen_info will do? */
-   memcpy(>screen_info, _params.screen_info,
-   sizeof(struct screen_info));
+   memcpy(>screen_info, _info, sizeof(struct screen_info));
 
/* Fill in memsize later */
params->screen_info.ext_mem_k = 0;


[tip: x86/urgent] hyperv_fb: Update screen_info after removing old framebuffer

2020-10-14 Thread tip-bot2 for Kairui Song
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 3cb73bc3fa2a3cb80b88aa63b48409939e0d996b
Gitweb:
https://git.kernel.org/tip/3cb73bc3fa2a3cb80b88aa63b48409939e0d996b
Author:Kairui Song 
AuthorDate:Wed, 14 Oct 2020 17:24:29 +08:00
Committer: Ingo Molnar 
CommitterDate: Wed, 14 Oct 2020 17:05:26 +02:00

hyperv_fb: Update screen_info after removing old framebuffer

On gen2 HyperV VM, hyperv_fb will remove the old framebuffer, and the
new allocated framebuffer address could be at a differnt location,
and it might be no longer a VGA framebuffer.

Update screen_info so that after kexec the kernel won't try to reuse
the old invalid/stale framebuffer address as VGA, corrupting memory.

[ mingo: Tidied up the changelog. ]

Signed-off-by: Kairui Song 
Signed-off-by: Ingo Molnar 
Cc: Dexuan Cui 
Cc: Jake Oshins 
Cc: Wei Hu 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Link: https://lore.kernel.org/r/20201014092429.1415040-3-kas...@redhat.com
---
 drivers/video/fbdev/hyperv_fb.c |  9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 02411d8..e36fb1a 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -1114,8 +1114,15 @@ static int hvfb_getmem(struct hv_device *hdev, struct 
fb_info *info)
 getmem_done:
remove_conflicting_framebuffers(info->apertures,
KBUILD_MODNAME, false);
-   if (!gen2vm)
+
+   if (gen2vm) {
+   /* framebuffer is reallocated, clear screen_info to avoid 
misuse from kexec */
+   screen_info.lfb_size = 0;
+   screen_info.lfb_base = 0;
+   screen_info.orig_video_isVGA = 0;
+   } else {
pci_dev_put(pdev);
+   }
kfree(info->apertures);
 
return 0;


[PATCH 1/2] x86/kexec: Use up-to-dated screen_info copy to fill boot params

2020-10-14 Thread Kairui Song
kexec_file_load now just reuse the old boot_params.screen_info.
But if drivers have change the hardware state, boot_param.screen_info
could contain invalid info.

For example, the video type might be no longer VGA, or frame buffer
address changed. If kexec kernel keep using the old screen_info,
kexec'ed kernel may attempt to write to an invalid framebuffer
memory region.

There are two screen_info globally available, boot_params.screen_info
and screen_info. Later one is a copy, and could be updated by drivers.

So let kexec_file_load use the updated copy.

Signed-off-by: Kairui Song 
---
 arch/x86/kernel/kexec-bzimage64.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 57c2ecf43134..ce831f9448e7 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
 
/* Copying screen_info will do? */
-   memcpy(>screen_info, _params.screen_info,
-   sizeof(struct screen_info));
+   memcpy(>screen_info, _info, sizeof(struct screen_info));
 
/* Fill in memsize later */
params->screen_info.ext_mem_k = 0;
-- 
2.28.0



[PATCH 0/2] x86/hyperv: fix kexec/kdump hang on some VMs

2020-10-14 Thread Kairui Song
On some HyperV machines, if kexec_file_load is used to load the kexec
kernel, second kernel could hang with following stacktrace:

[0.591705] efifb: probing for efifb
[0.596869] efifb: framebuffer at 0xf800, using 3072k, total 3072k
[0.605894] efifb: mode is 1024x768x32, linelength=4096, pages=1
[0.617926] efifb: scrolling: redraw
[0.622715] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[   28.039046] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:1]
[   28.039046] Modules linked in:
[   28.039046] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-230.el8.x86_64 
#1
[   28.039046] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[   28.039046] RIP: 0010:cfb_imageblit+0x450/0x4c0
[   28.039046] Code: 89 f8 b9 08 00 00 00 48 89 04 24 eb 2d 41 0f be 30 29 e9 
4c 8d 5f 04 d3 fe 44 21 ee 41 8b 04 b6 44 21 c8 89 c6 44 31 d6 89 37 <85> c9 75 
09 49 83 c0 01 b9 08 00 00 00 4c 89 df 48 39 df 75 ce 83
[   28.039046] RSP: 0018:c9087830 EFLAGS: 00010246 ORIG_RAX: 
ff12
[   28.039046] RAX:  RBX: c9542000 RCX: 0003
[   28.039046] RDX: 000e RSI:  RDI: c9541bf0
[   28.039046] RBP: 0001 R08: 8880f555c8df R09: 00aa
[   28.039046] R10:  R11: c9541bf4 R12: 1000
[   28.039046] R13: 0001 R14: 81e9a460 R15: 8880f555c880
[   28.039046] FS:  () GS:8880f100() 
knlGS:
[   28.039046] CS:  0010 DS:  ES:  CR0: 80050033
[   28.039046] CR2: 7f7b223b8000 CR3: f3a0a004 CR4: 003606b0
[   28.039046] DR0:  DR1:  DR2: 
[   28.039046] DR3:  DR6: fffe0ff0 DR7: 0400
[   28.039046] Call Trace:
[   28.039046]  bit_putcs+0x2a1/0x550
[   28.039046]  ? fbcon_switch+0x33e/0x5b0
[   28.039046]  ? bit_clear+0x120/0x120
[   28.039046]  fbcon_putcs+0xe7/0x100
[   28.039046]  do_update_region+0x154/0x1a0
[   28.039046]  redraw_screen+0x209/0x240
[   28.039046]  ? vc_do_resize+0x5c9/0x660
[   28.039046]  fbcon_prepare_logo+0x3b3/0x430
[   28.039046]  fbcon_init+0x436/0x630
[   28.039046]  visual_init+0xce/0x130
[   28.039046]  do_bind_con_driver+0x1df/0x2d0
[   28.039046]  do_take_over_console+0x113/0x180
[   28.039046]  do_fbcon_takeover+0x58/0xb0
[   28.039046]  register_framebuffer+0x225/0x2f0
[   28.039046]  efifb_probe.cold.5+0x51a/0x55d
[   28.039046]  platform_drv_probe+0x38/0x90
[   28.039046]  really_probe+0x212/0x440
[   28.039046]  driver_probe_device+0x49/0xc0
[   28.039046]  device_driver_attach+0x50/0x60
[   28.039046]  __driver_attach+0x61/0x130
[   28.039046]  ? device_driver_attach+0x60/0x60
[   28.039046]  bus_for_each_dev+0x77/0xc0
[   28.039046]  ? klist_add_tail+0x57/0x70
[   28.039046]  bus_add_driver+0x14d/0x1e0
[   28.039046]  ? vesafb_driver_init+0x13/0x13
[   28.039046]  ? do_early_param+0x91/0x91
[   28.039046]  driver_register+0x6b/0xb0
[   28.039046]  ? vesafb_driver_init+0x13/0x13
[   28.039046]  do_one_initcall+0x46/0x1c3
[   28.039046]  ? do_early_param+0x91/0x91
[   28.039046]  kernel_init_freeable+0x1b4/0x25d
[   28.039046]  ? rest_init+0xaa/0xaa
[   28.039046]  kernel_init+0xa/0xfa
[   28.039046]  ret_from_fork+0x35/0x40

The root cause is that hyperv_fb driver will relocate the
framebuffer address in first kernel, but kexec_file_load simply reuse
the old framebuffer info from boot_params, which is now invalid, so
second kernel will write to an invalid framebuffer address.

This series fix this problem by:

1. Let kexec_file_load use the updated copy of screen_info.

  Instead of using boot_params.screen_info, use the globally available
  screen_info variable instead (which is just an copy of
  boot_params.screen_info on x86). This variable could be updated
  by arch indenpendent drivers. Just keep this variable updated should
  be a good way to keep screen_info consistent across kexec.

2. Let hyperv_fb clean the screen_info copy when the boot framebuffer
  is relocated outside the old framebuffer.

  After the relocation, the framebuffer is no longer a VGA
  framebuffer, so just clean it up should be good.

Kairui Song (2):
  x86/kexec: Use up-to-dated screen_info copy to fill boot params
  hyperv_fb: Update screen_info after removing old framebuffer

 arch/x86/kernel/kexec-bzimage64.c | 3 +--
 drivers/video/fbdev/hyperv_fb.c   | 8 
 2 files changed, 9 insertions(+), 2 deletions(-)

-- 
2.28.0



[PATCH 2/2] hyperv_fb: Update screen_info after removing old framebuffer

2020-10-14 Thread Kairui Song
On gen2 HyperV VM, hyperv_fb will remove the old framebuffer, the
new allocated framebuffer address could be at a differnt location,
and it's no longer VGA framebuffer. Update screen_info
so that after kexec, kernel won't try to reuse the old invalid
framebuffer address as VGA.

Signed-off-by: Kairui Song 
---
 drivers/video/fbdev/hyperv_fb.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 02411d89cb46..e36fb1a0ecdb 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -1114,8 +1114,15 @@ static int hvfb_getmem(struct hv_device *hdev, struct 
fb_info *info)
 getmem_done:
remove_conflicting_framebuffers(info->apertures,
KBUILD_MODNAME, false);
-   if (!gen2vm)
+
+   if (gen2vm) {
+   /* framebuffer is reallocated, clear screen_info to avoid 
misuse from kexec */
+   screen_info.lfb_size = 0;
+   screen_info.lfb_base = 0;
+   screen_info.orig_video_isVGA = 0;
+   } else {
pci_dev_put(pdev);
+   }
kfree(info->apertures);
 
return 0;
-- 
2.28.0



Re: [RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem

2020-09-21 Thread Kairui Song
On Thu, Sep 10, 2020 at 12:43 AM Kairui Song  wrote:
>
> On Wed, Sep 9, 2020 at 10:04 PM Eric W. Biederman  
> wrote:
> >
> > Kairui Song  writes:
> >
> > > Currently vmcore only supports reading, this patch series is an RFC
> > > to add writing support to vmcore. It's x86_64 only yet, I'll add other
> > > architecture later if there is no problem with this idea.
> > >
> > > My purpose of adding writing support is to reuse the crashed kernel's
> > > old memory in kdump kernel, reduce kdump memory pressure, and
> > > allow kdump to run with a smaller crashkernel reservation.
> > >
> > > This is doable because in most cases, after kernel panic, user only
> > > interested in the crashed kernel itself, and userspace/cache/free
> > > memory pages are not dumped. `makedumpfile` is widely used to skip
> > > these pages. Kernel pages usually only take a small part of
> > > the whole old memory. So there will be many reusable pages.
> > >
> > > By adding writing support, userspace then can use these pages as a fast
> > > and temporary storage. This helps reduce memory pressure in many ways.
> > >
> > > For example, I've written a POC program based on this, it will find
> > > the reusable pages, and creates an NBD device which maps to these pages.
> > > The NBD device can then be used as swap, or to hold some temp files
> > > which previouly live in RAM.
> > >
> > > The link of the POC tool: https://github.com/ryncsn/kdumpd
> >
> > A couple of thoughts.
> > 1) Unless I am completely mistaken treating this as a exercise in
> >memory hotplug would be much simpler.
> >
> >AKA just plug in the memory that is not needed as part of the kdump.
> >
> >I see below that you have problems doing this because
> >of fragmentation.  I still think hotplug is doable using some
> >kind of fragmented memory zone.
> >
> > 2) The purpose of the memory reservation is because hardware is
> >still potentially running agains the memory of the old kernel.
> >
> >By the time we have brought up a new kernel enough of the hardware
> >may have been reinitialized that we don't have to worry about
> >hardware randomly dma'ing into the memory used by the old kernel.
> >
> >With IOMMUs and care we may be able to guarantee for some machine
> >configurations it is impossible for DMA to come from some piece of
> >hardware that is present but the kernel does not have a driver
> >loaded for.\
> >
> > I really do not like this approach because it is fundamentlly doing the
> > wrong thing.  Adding write support to read-only drivers.  I do not see
> > anywhere that you even mentioned the hard problem and the reason we
> > reserve memory in the first place.  Hardware spontaneously DMA'ing onto
> > it.
> >
> That POC tool looks ugly for now as it only a draft to prove this
> works, sorry about it.
>
> For the patch, yes, it is expecting IOMMU to lower the chance of
> potential DMA issue, and expecting DMA will not hit userspace/free
> page, or at least won't override a massive amount of reusable old
> memory. And I thought about some solutions for the potential DMA
> issue.
>
> As old memories are used as a block device, which is proxied by
> userspace, so upon each IO, the userspace tool could do an integrity
> check of the corresponding data stored in old mem, and keep multiple
> copies of the data. (eg. use 512M of old memory to hold a 128M block
> device). These copies will be kept far away from each other regarding
> the physical memory location. The reusable old memories are sparse so
> the actual memory containing the data should be also sparse.
> So if some part is corrupted, it is still recoverable. Unless the DMA
> went very wrong and wiped a large region of memory, but if such thing
> happens, it's most likely kernel pages are also being wiped by DMA, so
> the vmcore is already corrupted and kdump may not help. But at least
> it won't fail silently, the userspace tool can still do something like
> dump some available data to an easy to setup target.
>
> And also that's one of the reasons not using old memory as kdump's
> memory directly.
>
> > > It's have been a long time issue that kdump suffers from OOM issue
> > > with limited crashkernel memory. So reusing old memory could be very
> > > helpful.
> >
> > There is a very fine line here between reusing existing code (aka
> > drivers and userspace) and doing something that should work.
> >
> > It might ma

Re: [RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem

2020-09-09 Thread Kairui Song
On Wed, Sep 9, 2020 at 10:04 PM Eric W. Biederman  wrote:
>
> Kairui Song  writes:
>
> > Currently vmcore only supports reading, this patch series is an RFC
> > to add writing support to vmcore. It's x86_64 only yet, I'll add other
> > architecture later if there is no problem with this idea.
> >
> > My purpose of adding writing support is to reuse the crashed kernel's
> > old memory in kdump kernel, reduce kdump memory pressure, and
> > allow kdump to run with a smaller crashkernel reservation.
> >
> > This is doable because in most cases, after kernel panic, user only
> > interested in the crashed kernel itself, and userspace/cache/free
> > memory pages are not dumped. `makedumpfile` is widely used to skip
> > these pages. Kernel pages usually only take a small part of
> > the whole old memory. So there will be many reusable pages.
> >
> > By adding writing support, userspace then can use these pages as a fast
> > and temporary storage. This helps reduce memory pressure in many ways.
> >
> > For example, I've written a POC program based on this, it will find
> > the reusable pages, and creates an NBD device which maps to these pages.
> > The NBD device can then be used as swap, or to hold some temp files
> > which previouly live in RAM.
> >
> > The link of the POC tool: https://github.com/ryncsn/kdumpd
>
> A couple of thoughts.
> 1) Unless I am completely mistaken treating this as a exercise in
>memory hotplug would be much simpler.
>
>AKA just plug in the memory that is not needed as part of the kdump.
>
>I see below that you have problems doing this because
>of fragmentation.  I still think hotplug is doable using some
>kind of fragmented memory zone.
>
> 2) The purpose of the memory reservation is because hardware is
>still potentially running agains the memory of the old kernel.
>
>By the time we have brought up a new kernel enough of the hardware
>may have been reinitialized that we don't have to worry about
>hardware randomly dma'ing into the memory used by the old kernel.
>
>With IOMMUs and care we may be able to guarantee for some machine
>configurations it is impossible for DMA to come from some piece of
>hardware that is present but the kernel does not have a driver
>loaded for.\
>
> I really do not like this approach because it is fundamentlly doing the
> wrong thing.  Adding write support to read-only drivers.  I do not see
> anywhere that you even mentioned the hard problem and the reason we
> reserve memory in the first place.  Hardware spontaneously DMA'ing onto
> it.
>
That POC tool looks ugly for now as it only a draft to prove this
works, sorry about it.

For the patch, yes, it is expecting IOMMU to lower the chance of
potential DMA issue, and expecting DMA will not hit userspace/free
page, or at least won't override a massive amount of reusable old
memory. And I thought about some solutions for the potential DMA
issue.

As old memories are used as a block device, which is proxied by
userspace, so upon each IO, the userspace tool could do an integrity
check of the corresponding data stored in old mem, and keep multiple
copies of the data. (eg. use 512M of old memory to hold a 128M block
device). These copies will be kept far away from each other regarding
the physical memory location. The reusable old memories are sparse so
the actual memory containing the data should be also sparse.
So if some part is corrupted, it is still recoverable. Unless the DMA
went very wrong and wiped a large region of memory, but if such thing
happens, it's most likely kernel pages are also being wiped by DMA, so
the vmcore is already corrupted and kdump may not help. But at least
it won't fail silently, the userspace tool can still do something like
dump some available data to an easy to setup target.

And also that's one of the reasons not using old memory as kdump's
memory directly.

> > It's have been a long time issue that kdump suffers from OOM issue
> > with limited crashkernel memory. So reusing old memory could be very
> > helpful.
>
> There is a very fine line here between reusing existing code (aka
> drivers and userspace) and doing something that should work.
>
> It might make sense to figure out what is using so much memory
> that an OOM is triggered.
>
> Ages ago I did something that was essentially dumping the kernels printk
> buffer to the serial console in case of a crash and I had things down to
> something comparatively miniscule like 8M or less.
>
> My memory is that historically it has been high performance scsi raid
> drivers or something like that, that are behind the need to have such
> large memory reservations.
>
> Now that I think 

[RFC PATCH 1/3] vmcore: simplify read_from_olemem

2020-09-09 Thread Kairui Song
Simplify the code logic, also helps reduce object size and stack usage.

Stack usage:
  Before: fs/proc/vmcore.c:106:9:read_from_oldmem.part.0  80 static
  fs/proc/vmcore.c:106:9:read_from_oldmem 16 static
  After:  fs/proc/vmcore.c:106:9:read_from_oldmem 80 static

Size of vmcore.o:
  textdata bss dec hex filename
  Before: 7677 109  8878741ec2 fs/proc/vmcore.o
  After:  7669 109  8878661eba fs/proc/vmcore.o

Signed-off-by: Kairui Song 
---
 fs/proc/vmcore.c | 27 ++-
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index c3a345c28a93..124c2066f3e5 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -108,25 +108,19 @@ ssize_t read_from_oldmem(char *buf, size_t count,
 bool encrypted)
 {
unsigned long pfn, offset;
-   size_t nr_bytes;
-   ssize_t read = 0, tmp;
+   size_t nr_bytes, to_copy = count;
+   ssize_t tmp;
 
-   if (!count)
-   return 0;
-
-   offset = (unsigned long)(*ppos % PAGE_SIZE);
+   offset = (unsigned long)(*ppos & (PAGE_SIZE - 1));
pfn = (unsigned long)(*ppos / PAGE_SIZE);
 
-   do {
-   if (count > (PAGE_SIZE - offset))
-   nr_bytes = PAGE_SIZE - offset;
-   else
-   nr_bytes = count;
+   while (to_copy) {
+   nr_bytes = min(to_copy, PAGE_SIZE - offset);
 
/* If pfn is not ram, return zeros for sparse dump files */
-   if (pfn_is_ram(pfn) == 0)
+   if (pfn_is_ram(pfn) == 0) {
memset(buf, 0, nr_bytes);
-   else {
+   } else {
if (encrypted)
tmp = copy_oldmem_page_encrypted(pfn, buf,
 nr_bytes,
@@ -140,14 +134,13 @@ ssize_t read_from_oldmem(char *buf, size_t count,
return tmp;
}
*ppos += nr_bytes;
-   count -= nr_bytes;
buf += nr_bytes;
-   read += nr_bytes;
+   to_copy -= nr_bytes;
++pfn;
offset = 0;
-   } while (count);
+   }
 
-   return read;
+   return count;
 }
 
 /*
-- 
2.26.2



[RFC PATCH 3/3] x86_64: implement copy_to_oldmem_page

2020-09-09 Thread Kairui Song
Previous commit introduced writing support for vmcore, it requires
per-architecture implementation for the writing function.

Signed-off-by: Kairui Song 
---
 arch/x86/kernel/crash_dump_64.c | 49 +++--
 1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
index 045e82e8945b..ec80da75b287 100644
--- a/arch/x86/kernel/crash_dump_64.c
+++ b/arch/x86/kernel/crash_dump_64.c
@@ -13,7 +13,7 @@
 
 static ssize_t __copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
  unsigned long offset, int userbuf,
- bool encrypted)
+ bool encrypted, bool is_write)
 {
void  *vaddr;
 
@@ -28,13 +28,25 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char 
*buf, size_t csize,
if (!vaddr)
return -ENOMEM;
 
-   if (userbuf) {
-   if (copy_to_user((void __user *)buf, vaddr + offset, csize)) {
-   iounmap((void __iomem *)vaddr);
-   return -EFAULT;
+   if (is_write) {
+   if (userbuf) {
+   if (copy_from_user(vaddr + offset, (void __user *)buf, 
csize)) {
+   iounmap((void __iomem *)vaddr);
+   return -EFAULT;
+   }
+   } else {
+   memcpy(vaddr + offset, buf, csize);
}
-   } else
-   memcpy(buf, vaddr + offset, csize);
+   } else {
+   if (userbuf) {
+   if (copy_to_user((void __user *)buf, vaddr + offset, 
csize)) {
+   iounmap((void __iomem *)vaddr);
+   return -EFAULT;
+   }
+   } else {
+   memcpy(buf, vaddr + offset, csize);
+   }
+   }
 
set_iounmap_nonlazy();
iounmap((void __iomem *)vaddr);
@@ -57,7 +69,7 @@ static ssize_t __copy_oldmem_page(unsigned long pfn, char 
*buf, size_t csize,
 ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
 unsigned long offset, int userbuf)
 {
-   return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, false);
+   return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, false, 
false);
 }
 
 /**
@@ -68,7 +80,26 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf, 
size_t csize,
 ssize_t copy_oldmem_page_encrypted(unsigned long pfn, char *buf, size_t csize,
   unsigned long offset, int userbuf)
 {
-   return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, true);
+   return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, true, 
false);
+}
+
+/**
+ * copy_to_oldmem_page - similar to copy_oldmem_page but in opposite direction.
+ */
+ssize_t copy_to_oldmem_page(unsigned long pfn, char *src, size_t csize,
+   unsigned long offset, int userbuf)
+{
+   return __copy_oldmem_page(pfn, src, csize, offset, userbuf, false, 
true);
+}
+
+/**
+ * copy_to_oldmem_page_encrypted - similar to copy_oldmem_page_encrypted but
+ * in opposite direction.
+ */
+ssize_t copy_to_oldmem_page_encrypted(unsigned long pfn, char *src, size_t 
csize,
+   unsigned long offset, int userbuf)
+{
+   return __copy_oldmem_page(pfn, src, csize, offset, userbuf, true, true);
 }
 
 ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
-- 
2.26.2



[RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem

2020-09-09 Thread Kairui Song
Currently vmcore only supports reading, this patch series is an RFC
to add writing support to vmcore. It's x86_64 only yet, I'll add other
architecture later if there is no problem with this idea.

My purpose of adding writing support is to reuse the crashed kernel's
old memory in kdump kernel, reduce kdump memory pressure, and
allow kdump to run with a smaller crashkernel reservation.

This is doable because in most cases, after kernel panic, user only
interested in the crashed kernel itself, and userspace/cache/free
memory pages are not dumped. `makedumpfile` is widely used to skip
these pages. Kernel pages usually only take a small part of
the whole old memory. So there will be many reusable pages.

By adding writing support, userspace then can use these pages as a fast
and temporary storage. This helps reduce memory pressure in many ways.

For example, I've written a POC program based on this, it will find
the reusable pages, and creates an NBD device which maps to these pages.
The NBD device can then be used as swap, or to hold some temp files
which previouly live in RAM.

The link of the POC tool: https://github.com/ryncsn/kdumpd

I tested it on x86_64 on latest Fedora by using it as swap with
following step in kdump kernel:

  1. Install this tool in kdump initramfs
  2. Execute following command in kdump:
 /sbin/modprobe nbd nbds_max=1
 /bin/kdumpd &
 /sbin/mkswap /dev/nbd0
 /sbin/swapon /dev/nbd0
  3. Observe the swap is being used:
 SwapTotal:131068 kB
 SwapFree: 121852 kB

It helped to reduce the crashkernel from 168M to 110M for a successful
kdump run over NFSv3. There are still many workitems that could be done
based on this idea, eg. move the initramfs content to the old memory,
which may help reduce another ~10-20M of memory.

It's have been a long time issue that kdump suffers from OOM issue
with limited crashkernel memory. So reusing old memory could be very
helpful.

This method have it's limitation:
- Swap only works for userspace. But kdump userspace is a major memory
  consumer, so in general this should be helpful enough.
- For users who want to dump the whole memory area, this won't help as
  there is no reusable page.

I've tried other ways to improve the crashkernel value, eg.
- Reserve some smaller memory segments in first kernel for crashkernel: It's
  only a suppliment of the default crashkernel reservation and only make
  crashkernel value more adjustable, still not solving the real problem.

- Reuse old memory, but hotplug chunk of reusable old memory into
  kdump kernel's memory:
  It's hard to find large chunk of continuous memory, especially on
  systems with heavy workload, the reusable regions could be very
  fragmental. So it can only hotplug small fragments of memories,
  which looks hackish, and may have a high page table overhead.

- Implement the old memory based based block device as a kernel
  module. It doesn't looks good to have a module for this sole
  usage and it don't have much performance/implementation advantage
  compared to this RFC.

Besides, keeping all the complex logic of parsing reusing old memory
logic in userspace seems a better idea.

And as a plus, this could make it more doable and reasonable to
have n crashkernel=auto param. If there is a swap, then userspace
will have less memory pressure. crashkernel=auto can focus on the
kernel usage.

Kairui Song (3):
  vmcore: simplify read_from_olemem
  vmcore: Add interface to write to old mem
  x86_64: implement copy_to_oldmem_page

 arch/x86/kernel/crash_dump_64.c |  49 --
 fs/proc/vmcore.c| 154 ++--
 include/linux/crash_dump.h  |  18 +++-
 3 files changed, 180 insertions(+), 41 deletions(-)

-- 
2.26.2



[RFC PATCH 2/3] vmcore: Add interface to write to old mem

2020-09-09 Thread Kairui Song
vmcore is used as the interface to access crashed kernel's memory in
kdump, and currently vmcore only supports reading.

Adding writing support is useful for enabling userspace making better
use of the old memory.

For kdump, `makedumpfile` is widely used to reduce the dumped vmcore
size, and in most setup, it will drop user space memory, caches. This
means these memory pages are reusable.

Kdump runs in limited pre-reserved memory region, so if these old memory
pages are reused, it can help reduce memory pressure in kdump kernel,
hence allow first kernel to reserve less memory for kdump.

Adding write support to vmcore is the first step, then user space can
do IO on the old mem. There are multiple ways to reuse the memory, for
example, userspace can register a NBD device, and redirect the IO on the
device to old memory. The NBD device can be used as swap, or used to
hold some temp files.

Signed-off-by: Kairui Song 
---
 fs/proc/vmcore.c   | 129 +
 include/linux/crash_dump.h |  18 --
 2 files changed, 131 insertions(+), 16 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 124c2066f3e5..23acc0f2ecd7 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -103,9 +103,9 @@ static int pfn_is_ram(unsigned long pfn)
 }
 
 /* Reads a page from the oldmem device from given offset. */
-ssize_t read_from_oldmem(char *buf, size_t count,
-u64 *ppos, int userbuf,
-bool encrypted)
+static ssize_t oldmem_rw_page(char *buf, size_t count,
+ u64 *ppos, int userbuf,
+ bool encrypted, bool is_write)
 {
unsigned long pfn, offset;
size_t nr_bytes, to_copy = count;
@@ -119,20 +119,33 @@ ssize_t read_from_oldmem(char *buf, size_t count,
 
/* If pfn is not ram, return zeros for sparse dump files */
if (pfn_is_ram(pfn) == 0) {
-   memset(buf, 0, nr_bytes);
-   } else {
-   if (encrypted)
-   tmp = copy_oldmem_page_encrypted(pfn, buf,
-nr_bytes,
-offset,
-userbuf);
+   if (is_write)
+   return -EINVAL;
else
-   tmp = copy_oldmem_page(pfn, buf, nr_bytes,
-  offset, userbuf);
+   memset(buf, 0, nr_bytes);
+   } else {
+   if (encrypted) {
+   tmp = is_write ?
+   copy_to_oldmem_page_encrypted(pfn, buf,
+ nr_bytes,
+ offset,
+ userbuf) :
+   copy_oldmem_page_encrypted(pfn, buf,
+  nr_bytes,
+  offset,
+  userbuf);
+   } else {
+   tmp = is_write ?
+   copy_to_oldmem_page(pfn, buf, nr_bytes,
+   offset, userbuf) :
+   copy_oldmem_page(pfn, buf, nr_bytes,
+   offset, userbuf);
+   }
 
if (tmp < 0)
return tmp;
}
+
*ppos += nr_bytes;
buf += nr_bytes;
to_copy -= nr_bytes;
@@ -143,6 +156,22 @@ ssize_t read_from_oldmem(char *buf, size_t count,
return count;
 }
 
+/* Reads a page from the oldmem device from given offset. */
+ssize_t read_from_oldmem(char *buf, size_t count,
+u64 *ppos, int userbuf,
+bool encrypted)
+{
+   return oldmem_rw_page(buf, count, ppos, userbuf, encrypted, 0);
+}
+
+/* Writes a page to the oldmem device of given offset. */
+ssize_t write_to_oldmem(char *buf, size_t count,
+   u64 *ppos, int userbuf,
+   bool encrypted)
+{
+   return oldmem_rw_page(buf, count, ppos, userbuf, encrypted, 1);
+}
+
 /*
  * Architectures may override this function to allocate ELF header in 2nd 
kernel
  */
@@ -184,6 +213,26 @@ int __weak remap_oldmem_pfn_range(struct vm_area_struct 
*vma,
return remap_pfn_range(vma, from, pfn, size, prot);
 }
 
+/*
+ * Architectures which support writ

Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel

2020-07-23 Thread Kairui Song
On Thu, Jul 23, 2020 at 8:00 AM Bjorn Helgaas  wrote:
>
> On Wed, Jul 22, 2020 at 03:50:48PM -0600, Jerry Hoemann wrote:
> > On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote:
> > > On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:
>
> > > > I think I didn't make one thing clear, The PCI UR error never arrives
> > > > in kernel, it's the iLo BMC on that HPE machine caught the error, and
> > > > send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> > > > figure out why the NMI hanged kernel, even with panic=-1,
> > > > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> > > > NMI by shutdown the devices in right order, that's also a solution.
>
> ACPI v6.3, chapter 18, does mention NMIs several times, e.g., Table
> 18-394 and sec 18.4.  I'm not familiar enough with APEI to know
> whether Linux correctly supports all those cases.  Maybe this is a
> symptom that we don't?
>
> > > I'm not sure how much sympathy to have for this situation.  A PCIe UR
> > > is fatal for the transaction and maybe even the device, but from the
> > > overall system point of view, it *should* be a recoverable error and
> > > we shouldn't panic.
> > >
> > > Errors like that should be reported via the normal AER or ACPI/APEI
> > > mechanisms.  It sounds like in this case, the platform has decided
> > > these aren't enough and it is trying to force a reboot?  If this is
> > > "special" platform behavior, I'm not sure how much we need to cater
> > > for it.
> >
> > Are these AER errors the type processed by the GHES code?
>
> My understanding from ACPI v6.3, sec 18.3.2, is that the Hardware
> Error Source Table may contain Error Source Descriptors of types like:
>
>   IA-32 Machine Check Exception
>   IA-32 Corrected Machine Check
>   IA-32 Non-Maskable Interrupt
>   PCIe Root Port AER
>   PCIe Device AER
>   Generic Hardware Error Source (GHES)
>   Hardware Error Notification
>   IA-32 Deferred Machine Check
>
> I would naively expect PCIe UR errors to be reported via one of the
> PCIe Error Sources, not GHES, but maybe there's some reason to use
> GHES.
>
> The kernel should already know how to deal with the PCIe AER errors,
> but we'd have to add new device-specific code to handle things
> reported via GHES, along the lines of what Shiju is doing here:
>
>   https://lore.kernel.org/r/20200722104245.1060-1-shiju.j...@huawei.com
>
> > I'll note that RedHat runs their crash kernel with:  hest_disable.
> > So, the ghes code is disabled in the crash kernel.
>
> That would disable all the HEST error sources, including the PCIe AER
> ones as well as GHES ones.  If we turn off some of the normal error
> handling mechanisms, I guess we have to expect that some errors won't
> be handled correctly.


Hi, that's true, hest_disable is added by default to reduce memory
usage in special cases.
But even if I remove hest_disable and have GHES enabled, but the
hanging issue still exists, from the iLO console log, it's still
sending an NMI to kernel, and kernel hanged.

The NMI won't hang the kernel for 100 percent, sometime it will just
panic and reboot and sometimes it hangs. This behavior didn't change
after/before enabled the GHES.

Maybe this is a "special platform behavior". I'm also not 100 percent
sure if/how we can cover this in a good way for now.
I'll try to figure how the NMI actually hanged the kernel and see if
it could be fixed in other ways.

-- 
Best Regards,
Kairui Song



Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel

2020-07-22 Thread Kairui Song
On Fri, Mar 6, 2020 at 5:38 PM Baoquan He  wrote:
>
> On 03/04/20 at 08:53pm, Deepa Dinamani wrote:
> > On Wed, Mar 4, 2020 at 7:53 PM Baoquan He  wrote:
> > >
> > > +Joerg to CC.
> > >
> > > On 03/03/20 at 01:01pm, Deepa Dinamani wrote:
> > > > I looked at this some more. Looks like we do not clear irqs when we do
> > > > a kexec reboot. And, the bootup code maintains the same table for the
> > > > kexec-ed kernel. I'm looking at the following code in
> > >
> > > I guess you are talking about kdump reboot here, right? Kexec and kdump
> > > boot take the similar mechanism, but differ a little.
> >
> > Right I meant kdump kernel here. And, clearly the is_kdump_kernel() case 
> > below.
> >
> > >
> > > > intel_irq_remapping.c:
> > > >
> > > > if (ir_pre_enabled(iommu)) {
> > > > if (!is_kdump_kernel()) {
> > > > pr_warn("IRQ remapping was enabled on %s but
> > > > we are not in kdump mode\n",
> > > > iommu->name);
> > > > clear_ir_pre_enabled(iommu);
> > > > iommu_disable_irq_remapping(iommu);
> > > > } else if (iommu_load_old_irte(iommu))
> > >
> > > Here, it's for kdump kernel to copy old ir table from 1st kernel.
> >
> > Correct.
> >
> > > > pr_err("Failed to copy IR table for %s from
> > > > previous kernel\n",
> > > >iommu->name);
> > > > else
> > > > pr_info("Copied IR table for %s from previous 
> > > > kernel\n",
> > > > iommu->name);
> > > > }
> > > >
> > > > Would cleaning the interrupts(like in the non kdump path above) just
> > > > before shutdown help here? This should clear the interrupts enabled
> > > > for all the devices in the current kernel. So when kdump kernel
> > > > starts, it starts clean. This should probably help block out the
> > > > interrupts from a device that does not have a driver.
> > >
> > > I think stopping those devices out of control from continue sending
> > > interrupts is a good idea. While not sure if only clearing the interrupt
> > > will be enough. Those devices which will be initialized by their driver
> > > will brake, but devices which drivers are not loaded into kdump kernel
> > > may continue acting. Even though interrupts are cleaning at this time,
> > > the on-flight DMA could continue triggerring interrupt since the ir
> > > table and iopage table are rebuilt.
> >
> > This should be handled by the IOMMU, right? And, hence you are getting
> > UR. This seems like the correct execution flow to me.
>
> Sorry for late reply.
> Yes, this is initializing IOMMU device.
>
> >
> > Anyway, you could just test this theory by removing the
> > is_kdump_kernel() check above and see if it solves your problem.
> > Obviously, check the VT-d spec to figure out the exact sequence to
> > turn off the IR.
>
> OK, I will talk to Kairui and get a machine to test it. Thanks for your
> nice idea, if you have a draft patch, we are happy to test it.
>
> >
> > Note that the device that is causing the problem here is a legit
> > device. We want to have interrupts from devices we don't know about
> > blocked anyway because we can have compromised firmware/ devices that
> > could cause a DoS attack. So blocking the unwanted interrupts seems
> > like the right thing to do here.
>
> Kairui said it's a device which driver is not loaded in kdump kernel
> because it's not needed by kdump. We try to only load kernel modules
> which are needed, e.g one device is the dump target, its driver has to
> be loaded in. In this case, the device is more like a out of control
> device to kdump kernel.
>

Hi Bao, Deepa, sorry for this very late response. The test machine was
not available for sometime, and I restarted to work on this problem.

For the workaround mention by Deepa (by remote the is_kdump_kernel()
check), it didn't work, the machine still hangs upon shutdown.
The devices that were left in an unknown state and sending interrupt
could be a problem, but it's irrelevant to this hanging problem.

I think I didn't make one thing clear, The PCI UR error never arrives
in kernel, it's the iLo BMC on that HPE machine caught the error, and
send kernel an NMI. kernel is panicked by NMI, I'm still trying to
figure out why the NMI hanged kernel, even with panic=-1,
panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
NMI by shutdown the devices in right order, that's also a solution.

--
Best Regards,
Kairui Song



Re: [PATCH v2] x86, efi: never relocate kernel below lowest acceptable address

2019-09-25 Thread Kairui Song
On Wed, Sep 25, 2019 at 5:55 PM Baoquan He  wrote:
>
> On 09/20/19 at 12:05am, Kairui Song wrote:
> > Currently, kernel fails to boot on some HyperV VMs when using EFI.
> > And it's a potential issue on all platforms.
> >
> > It's caused a broken kernel relocation on EFI systems, when below three
> > conditions are met:
> >
> > 1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR)
> >by the loader.
> > 2. There isn't enough room to contain the kernel, starting from the
> >default load address (eg. something else occupied part the region).
> > 3. In the memmap provided by EFI firmware, there is a memory region
> >starts below LOAD_PHYSICAL_ADDR, and suitable for containing the
> >kernel.
>
> Thanks for the effort, Kairui.
>
> Let me summarize what I got from this issue, please correct me if
> anything missed:
>
> ***
> Problem:
> This bug is reported on Hyper-V platform. The kernel will reset to
> firmware w/o any console printing in 1st kernel and kdump kernel
> sometime.
>
> ***
> Root cause:
> With debugging, the resetting to firmware is triggered when execute
> 'rep movsq' line of /boot/compressed/head_64.S. The reason is that
> efi boot stub may put kernel image below 16M, then later head_64.S will
> relocate kernel to 16M directly. That relocation will conflict with some
> efi reserved region, then cause the resetting.
>
> A more detail process based on the problem occurred on that HyperV
> machine:
>
> - kernel (INIT_SIZE: 56820K) got loaded at 0x3c881000 (not aligned,
>   and not equal to pref_address 0x100), need to relocate.
>
> - efi_relocate_kernel is called, try to allocate INIT_SIZE of memory
>   at pref_address, failed, something else occupied this region.
>
> - efi_relocate_kernel call efi_low_alloc as fallback, and got the address
>   0x80 (Below 0x100)
>
> - Later in arch/x86/boot/compressed/head_64.S:108, LOAD_PHYSICAL_ADDR is
>   force used as the new load address as the current address is lower than
>   that. Then kernel try relocate to 0x100.
>
> - However the memory starting from 0x100 is not allocated from EFI
>   firmware, writing to this region caused the system to reset.
>
> ***
> Solution:
> Alwasys search area above LOAD_PHYSICAL_ADDR, namely 16M to put kernel
> image in /boot/compressed/eboot.c. Then efi boot stub in eboot.c will
> search an suitable area in efi memmap, to make sure no any reserved
> region will conflict with the target area of kernel image. Besides,
> kernel won't be relocated in /boot/compressed/head_64.S since it has
> been above 16M.
>
> #ifdef CONFIG_RELOCATABLE
> leaqstartup_32(%rip) /* - $startup_32 */, %rbp
> movlBP_kernel_alignment(%rsi), %eax
> decl%eax
> addq%rax, %rbp
> notq%rax
> andq%rax, %rbp
> cmpq$LOAD_PHYSICAL_ADDR, %rbp
> jge 1f
> #endif
> movq$LOAD_PHYSICAL_ADDR, %rbp
> 1:
>
> /* Target address to relocate to for decompression */
> movlBP_init_size(%rsi), %ebx
> subl$_end, %ebx
> addq%rbp, %rbx
>

Hi Baoquan,

Yes, it's all correct. Thanks for adding these details.

>
> ***
> I have one concerns about this patch:
>
> Why this only happen in Hyper-V platform. Qemu/kvm, baremetal, vmware
> ESI don't have this issue? What's the difference?

Let me post part the efi memmap on that machine (and btw the kernel
size is 55M):

kernel: efi: mem00: type=7, attr=0xf,
range=[0x-0x0008) (0MB)
kernel: efi: mem01: type=4, attr=0xf,
range=[0x0008-0x00081000) (0MB)
kernel: efi: mem02: type=2, attr=0xf,
range=[0x00081000-0x00082000) (0MB)
kernel: efi: mem03: type=7, attr=0xf,
range=[0x00082000-0x000a) (0MB)
kernel: efi: mem04: type=4, attr=0xf,
range=[0x0010-0x0062a000) (5MB)
kernel: efi: mem05: type=7, attr=0xf,
range=[0x0062a000-0x0420) (59MB)
kernel: efi: mem06: type=4, attr=0xf,
range=[0x0420-0x0440) (2MB)
kernel: efi: mem07: type=7, attr=0xf,
range=[0x0440-0x045c6000) (1MB)
kernel: efi: mem08: type=4, attr=0xf,
range=[0x045c6000-0x045e6000) (0MB)
kernel: efi: mem09: type=3, attr=0xf,
range=[0x045e6000-0x0460b000) (0MB)
kernel: efi: mem10: type=4, attr=0xf,
range=[0x0460b000-0x04613000) (0MB)
kernel: efi: mem11: type=3, attr=0xf,
range=[0x04613000-0x0462b000) (0MB)
kernel: efi: mem12: type=7, attr=0xf,
range=[0x0462b000-0x0480) (1MB)
kernel: efi: mem13: type=2, attr=0xf,
range=[0x0480-0x00

[tip:x86/boot] x86/kexec: Add the ACPI NVS region to the ident map

2019-06-10 Thread tip-bot for Kairui Song
Commit-ID:  5a949b38839e284b1307540c56b03caf57da9736
Gitweb: https://git.kernel.org/tip/5a949b38839e284b1307540c56b03caf57da9736
Author: Kairui Song 
AuthorDate: Mon, 10 Jun 2019 15:36:17 +0800
Committer:  Borislav Petkov 
CommitDate: Mon, 10 Jun 2019 22:00:26 +0200

x86/kexec: Add the ACPI NVS region to the ident map

With the recent addition of RSDP parsing in the decompression stage,
a kexec-ed kernel now needs ACPI tables to be covered by the identity
mapping. And in commit

  6bbeb276b71f ("x86/kexec: Add the EFI system tables and ACPI tables to the 
ident map")

the ACPI tables memory region was added to the ident map.

But some machines have only an ACPI NVS memory region and the ACPI
tables are located in that region. In such case, the kexec-ed kernel
will still fail when trying to access ACPI tables if they're not mapped.

So add the NVS memory region to the ident map as well.

 [ bp: Massage. ]

Fixes: 6bbeb276b71f ("x86/kexec: Add the EFI system tables and ACPI tables to 
the ident map")
Suggested-by: Junichi Nomura 
Signed-off-by: Kairui Song 
Signed-off-by: Borislav Petkov 
Tested-by: Junichi Nomura 
Cc: Baoquan He 
Cc: Chao Fan 
Cc: Dave Young 
Cc: Dirk van der Merwe 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: ke...@lists.infradead.org
Cc: Lianbo Jiang 
Cc: "Rafael J. Wysocki" 
Cc: Thomas Gleixner 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190610073617.19767-1-kas...@redhat.com
---
 arch/x86/kernel/machine_kexec_64.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 3c77bdf7b32a..b2b88dcaaf88 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -54,14 +54,26 @@ static int mem_region_callback(struct resource *res, void 
*arg)
 static int
 map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p)
 {
-   unsigned long flags = IORESOURCE_MEM | IORESOURCE_BUSY;
struct init_pgtable_data data;
+   unsigned long flags;
+   int ret;
 
data.info = info;
data.level4p = level4p;
flags = IORESOURCE_MEM | IORESOURCE_BUSY;
-   return walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1,
-  , mem_region_callback);
+
+   ret = walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1,
+ , mem_region_callback);
+   if (ret && ret != -EINVAL)
+   return ret;
+
+   /* ACPI tables could be located in ACPI Non-volatile Storage region */
+   ret = walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1,
+ , mem_region_callback);
+   if (ret && ret != -EINVAL)
+   return ret;
+
+   return 0;
 }
 #else
 static int map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p) { 
return 0; }


[tip:x86/boot] x86/kexec: Add the EFI system tables and ACPI tables to the ident map

2019-06-06 Thread tip-bot for Kairui Song
Commit-ID:  6bbeb276b71f06c5267bfd154629b1bec82e7136
Gitweb: https://git.kernel.org/tip/6bbeb276b71f06c5267bfd154629b1bec82e7136
Author: Kairui Song 
AuthorDate: Mon, 29 Apr 2019 08:23:18 +0800
Committer:  Borislav Petkov 
CommitDate: Thu, 6 Jun 2019 20:13:48 +0200

x86/kexec: Add the EFI system tables and ACPI tables to the ident map

Currently, only the whole physical memory is identity-mapped for the
kexec kernel and the regions reserved by firmware are ignored.

However, the recent addition of RSDP parsing in the decompression stage
and especially:

  33f0df8d843d ("x86/boot: Search for RSDP in the EFI tables")

which tries to access EFI system tables and to dig out the RDSP address
from there, becomes a problem because in certain configurations, they
might not be mapped in the kexec'ed kernel's address space.

What is more, this problem doesn't appear on all systems because the
kexec kernel uses gigabyte pages to build the identity mapping. And
the EFI system tables and ACPI tables can, depending on the system
configuration, end up being mapped as part of all physical memory, if
they share the same 1 GB area with the physical memory.

Therefore, make sure they're always mapped.

 [ bp: productize half-baked patch:
   - rewrite commit message.
   - correct the map_acpi_tables() function name in the !ACPI case. ]

Signed-off-by: Kairui Song 
Signed-off-by: Baoquan He 
Signed-off-by: Borislav Petkov 
Tested-by: Dirk van der Merwe 
Cc: dyo...@redhat.com
Cc: fanc.f...@cn.fujitsu.com
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: j-nom...@ce.jp.nec.com
Cc: ke...@lists.infradead.org
Cc: "Kirill A. Shutemov" 
Cc: Lianbo Jiang 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190429002318.GA25400@MiWiFi-R3L-srv
---
 arch/x86/kernel/machine_kexec_64.c | 75 ++
 1 file changed, 75 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index ceba408ea982..3c77bdf7b32a 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -29,6 +30,43 @@
 #include 
 #include 
 
+#ifdef CONFIG_ACPI
+/*
+ * Used while adding mapping for ACPI tables.
+ * Can be reused when other iomem regions need be mapped
+ */
+struct init_pgtable_data {
+   struct x86_mapping_info *info;
+   pgd_t *level4p;
+};
+
+static int mem_region_callback(struct resource *res, void *arg)
+{
+   struct init_pgtable_data *data = arg;
+   unsigned long mstart, mend;
+
+   mstart = res->start;
+   mend = mstart + resource_size(res) - 1;
+
+   return kernel_ident_mapping_init(data->info, data->level4p, mstart, 
mend);
+}
+
+static int
+map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p)
+{
+   unsigned long flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+   struct init_pgtable_data data;
+
+   data.info = info;
+   data.level4p = level4p;
+   flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+   return walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1,
+  , mem_region_callback);
+}
+#else
+static int map_acpi_tables(struct x86_mapping_info *info, pgd_t *level4p) { 
return 0; }
+#endif
+
 #ifdef CONFIG_KEXEC_FILE
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_bzImage64_ops,
@@ -36,6 +74,31 @@ const struct kexec_file_ops * const kexec_file_loaders[] = {
 };
 #endif
 
+static int
+map_efi_systab(struct x86_mapping_info *info, pgd_t *level4p)
+{
+#ifdef CONFIG_EFI
+   unsigned long mstart, mend;
+
+   if (!efi_enabled(EFI_BOOT))
+   return 0;
+
+   mstart = (boot_params.efi_info.efi_systab |
+   ((u64)boot_params.efi_info.efi_systab_hi<<32));
+
+   if (efi_enabled(EFI_64BIT))
+   mend = mstart + sizeof(efi_system_table_64_t);
+   else
+   mend = mstart + sizeof(efi_system_table_32_t);
+
+   if (!mstart)
+   return 0;
+
+   return kernel_ident_mapping_init(info, level4p, mstart, mend);
+#endif
+   return 0;
+}
+
 static void free_transition_pgtable(struct kimage *image)
 {
free_page((unsigned long)image->arch.p4d);
@@ -159,6 +222,18 @@ static int init_pgtable(struct kimage *image, unsigned 
long start_pgtable)
return result;
}
 
+   /*
+* Prepare EFI systab and ACPI tables for kexec kernel since they are
+* not covered by pfn_mapped.
+*/
+   result = map_efi_systab(, level4p);
+   if (result)
+   return result;
+
+   result = map_acpi_tables(, level4p);
+   if (result)
+   return result;
+
return init_transition_pgtable(image, level4p);
 }
 


[PATCH v5] vmcore: Add a kernel parameter novmcoredd

2019-05-30 Thread Kairui Song
Since commit 2724273e8fd0 ("vmcore: add API to collect hardware dump in
second kernel"), drivers is allowed to add device related dump data to
vmcore as they want by using the device dump API. This have a potential
issue, the data is stored in memory, drivers may append too much data
and use too much memory. The vmcore is typically used in a kdump kernel
which runs in a pre-reserved small chunk of memory. So as a result it
will make kdump unusable at all due to OOM issues.

So introduce new 'novmcoredd' command line option. User can disable
device dump to reduce memory usage. This is helpful if device dump is
using too much memory, disabling device dump could make sure a regular
vmcore without device dump data is still available.

Signed-off-by: Kairui Song 
Reviewed-by: Bhupesh Sharma 
Acked-by: Dave Young 

---

Hi Andrew, sorry for the trouble but could you help pick up this one
instead for "vmcore: Add a kernel parameter novmcoredd" patch? Previous
one is in mm tree but failed compile when CONFIG_MODULES is not set, I
fixed this issue and carried something else like your doc fix, thanks!

 Update from V4:
  - Document adjust by Andrew Morton, also move the text to a better
position
  - Fix compile error when CONFIG_MODULES is not set
  - Return EPERM instead of EINVAL when device dump is disabled as
suggested by Dave Young

 Update from V3:
  - Use novmcoredd instead of vmcore_device_dump. Use
vmcore_device_dump and make it off by default is confusing,
novmcoredd is a cleaner way to let user space be able to disable
device dump to save memory.

 Update from V2:
  - Improve related docs

 Update from V1:
  - Use bool parameter to turn it on/off instead of letting user give
the size limit. Size of device dump is hard to determine.

 Documentation/admin-guide/kernel-parameters.txt | 11 +++
 fs/proc/Kconfig |  3 ++-
 fs/proc/vmcore.c|  9 +
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 138f6664b2e2..90b25234d965 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3088,6 +3088,17 @@
 
nosync  [HW,M68K] Disables sync negotiation for all devices.
 
+   novmcoredd  [KNL,KDUMP]
+   Disable device dump. Device dump allows drivers to
+   append dump data to vmcore so you can collect driver
+   specified debug info.  Drivers can append the data
+   without any limit and this data is stored in memory,
+   so this may cause significant memory stress.  Disabling
+   device dump can help save memory but the driver debug
+   data will be no longer available.  This parameter
+   is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
+   is set.
+
nowatchdog  [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).
 
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 62ee41b4bbd0..b74ea844abd5 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -58,7 +58,8 @@ config PROC_VMCORE_DEVICE_DUMP
  snapshot.
 
  If you say Y here, the collected device dumps will be added
- as ELF notes to /proc/vmcore.
+ as ELF notes to /proc/vmcore. You can still disable device
+ dump using the kernel command line option 'novmcoredd'.
 
 config PROC_SYSCTL
bool "Sysctl support (/proc/sys)" if EXPERT
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7bb96fdd38ad..936e9dbbfbec 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
@@ -54,6 +55,9 @@ static struct proc_dir_entry *proc_vmcore;
 /* Device Dump list and mutex to synchronize access to list */
 static LIST_HEAD(vmcoredd_list);
 static DEFINE_MUTEX(vmcoredd_mutex);
+
+static bool vmcoredd_disabled;
+core_param(novmcoredd, vmcoredd_disabled, bool, 0);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
 /* Device Dump Size */
@@ -1452,6 +1456,11 @@ int vmcore_add_device_dump(struct vmcoredd_data *data)
size_t data_size;
int ret;
 
+   if (vmcoredd_disabled) {
+   pr_err_once("Device dump is disabled\n");
+   return -EPERM;
+   }
+
if (!data || !strlen(data->dump_name) ||
!data->vmcoredd_callback || !data->size)
return -EINVAL;
-- 
2.21.0



Re: Getting empty callchain from perf_callchain_kernel()

2019-05-27 Thread Kairui Song
On Sat, May 25, 2019 at 7:23 AM Josh Poimboeuf  wrote:
>
> On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote:
> > On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf  wrote:
> > >
> > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
> > > >  On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf  
> > > > wrote:
> > > > >
> > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > > > > > Hi Josh, this still won't fix the problem.
> > > > > > > >
> > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what 
> > > > > > > > actually went
> > > > > > > > wrong is with the JITed bpf code.
> > > > > > >
> > > > > > > There seem to be a bunch of issues.  My patch at least fixes the 
> > > > > > > failing
> > > > > > > selftest reported by Alexei for ORC.
> > > > > > >
> > > > > > > How can I recreate your issue?
> > > > > >
> > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with 
> > > > > > that
> > > > > > fix stack trace is still invalid.
> > > > > >
> > > > > > CMD I used with bcc:
> > > > > > python3 ./tools/stackcount.py t:sched:sched_fork
> > > > >
> > > > > I've had problems in the past getting bcc to build, so I was hoping it
> > > > > was reproducible with a standalone selftest.
> > > > >
> > > > > > And I just had another try applying your patch, self test is also 
> > > > > > failing.
> > > > >
> > > > > Is it the same selftest reported by Alexei?
> > > > >
> > > > >   test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap 
> > > > > err -1 errno 2
> > > > >
> > > > > > I'm applying on my local master branch, a few days older than
> > > > > > upstream, I can update and try again, am I missing anything?
> > > > >
> > > > > The above patch had some issues, so with some configs you might see an
> > > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> > > > > fix the test_stacktrace_map selftest.
> > > > >
> > > > > Here's the latest version which should fix it in all cases (based on
> > > > > tip/master):
> > > > >
> > > > >   
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
> > > >
> > > > Hmm, I still get the failure:
> > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
> > > > err -1 errno 2
> > > >
> > > > And I didn't see how this will fix the issue. As long as ORC need to
> > > > unwind through the JITed code it will fail. And that will happen
> > > > before reaching ___bpf_prog_run.
> > >
> > > Ok, I was able to recreate by doing
> > >
> > >   echo 1 > /proc/sys/net/core/bpf_jit_enable
> > >
> > > first.  I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.
> > >
> >
> > Yes, with JIT off it will be fixed. I can confirm that.
>
> Here's a tentative BPF fix for the JIT frame pointer issue.  It was a
> bit harder than I expected.  Encoding r12 as a base register requires a
> SIB byte, so I had to add support for encoding that.  I also simplified
> the prologue to resemble a GCC prologue, which decreases the prologue
> size quite a bit.
>
> Next week I can work on the corresponding ORC change.  Then I can clean
> all the patches up and submit them properly.
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index afabf597c855..c9b4503558c9 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>  /*
>   * The following table maps BPF registers to x86-64 registers.
>   *
> - * x86-64 register R12 is unused, since if used as base address
> - * register in load/store instructions, it always needs an
> - * extra byte of encoding and is callee saved.
> + * RBP isn't used; it needs to be preserved to allow the unwinder to move
> + * through generated code stacks.
>   *
>   * Also x86-64 register R9 is unused. x86-64 register R10 is
>   * used fo

Re: Getting empty callchain from perf_callchain_kernel()

2019-05-23 Thread Kairui Song
On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf  wrote:
>
> On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
> >  On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf  
> > wrote:
> > >
> > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > > > Hi Josh, this still won't fix the problem.
> > > > > >
> > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually 
> > > > > > went
> > > > > > wrong is with the JITed bpf code.
> > > > >
> > > > > There seem to be a bunch of issues.  My patch at least fixes the 
> > > > > failing
> > > > > selftest reported by Alexei for ORC.
> > > > >
> > > > > How can I recreate your issue?
> > > >
> > > > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > > > fix stack trace is still invalid.
> > > >
> > > > CMD I used with bcc:
> > > > python3 ./tools/stackcount.py t:sched:sched_fork
> > >
> > > I've had problems in the past getting bcc to build, so I was hoping it
> > > was reproducible with a standalone selftest.
> > >
> > > > And I just had another try applying your patch, self test is also 
> > > > failing.
> > >
> > > Is it the same selftest reported by Alexei?
> > >
> > >   test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err 
> > > -1 errno 2
> > >
> > > > I'm applying on my local master branch, a few days older than
> > > > upstream, I can update and try again, am I missing anything?
> > >
> > > The above patch had some issues, so with some configs you might see an
> > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> > > fix the test_stacktrace_map selftest.
> > >
> > > Here's the latest version which should fix it in all cases (based on
> > > tip/master):
> > >
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
> >
> > Hmm, I still get the failure:
> > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
> > err -1 errno 2
> >
> > And I didn't see how this will fix the issue. As long as ORC need to
> > unwind through the JITed code it will fail. And that will happen
> > before reaching ___bpf_prog_run.
>
> Ok, I was able to recreate by doing
>
>   echo 1 > /proc/sys/net/core/bpf_jit_enable
>
> first.  I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.
>

Yes, with JIT off it will be fixed. I can confirm that.

--
Best Regards,
Kairui Song


Re: Getting empty callchain from perf_callchain_kernel()

2019-05-23 Thread Kairui Song
 On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf  wrote:
>
> On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > Hi Josh, this still won't fix the problem.
> > > >
> > > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > > wrong is with the JITed bpf code.
> > >
> > > There seem to be a bunch of issues.  My patch at least fixes the failing
> > > selftest reported by Alexei for ORC.
> > >
> > > How can I recreate your issue?
> >
> > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > fix stack trace is still invalid.
> >
> > CMD I used with bcc:
> > python3 ./tools/stackcount.py t:sched:sched_fork
>
> I've had problems in the past getting bcc to build, so I was hoping it
> was reproducible with a standalone selftest.
>
> > And I just had another try applying your patch, self test is also failing.
>
> Is it the same selftest reported by Alexei?
>
>   test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 
> errno 2
>
> > I'm applying on my local master branch, a few days older than
> > upstream, I can update and try again, am I missing anything?
>
> The above patch had some issues, so with some configs you might see an
> objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> fix the test_stacktrace_map selftest.
>
> Here's the latest version which should fix it in all cases (based on
> tip/master):
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix

Hmm, I still get the failure:
test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
err -1 errno 2

And I didn't see how this will fix the issue. As long as ORC need to
unwind through the JITed code it will fail. And that will happen
before reaching ___bpf_prog_run.

>
> > > > For frame pointer unwinder, it seems the JITed bpf code will have a
> > > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > > > unshift it properly then it will work.
> > >
> > > Yeah, that looks like a frame pointer bug in emit_prologue().
> > >
> > > > I tried below code, and problem is fixed (only for frame pointer
> > > > unwinder though). Need to find a better way to detect and do any
> > > > similar trick for bpf part, if this is a feasible way to fix it:
> > > >
> > > > diff --git a/arch/x86/kernel/unwind_frame.c 
> > > > b/arch/x86/kernel/unwind_frame.c
> > > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > > > --- a/arch/x86/kernel/unwind_frame.c
> > > > +++ b/arch/x86/kernel/unwind_frame.c
> > > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> > > > }
> > > >
> > > > /* Move to the next frame if it's safe: */
> > > > -   if (!update_stack_state(state, next_bp))
> > > > -   goto bad_address;
> > > > +   if (!update_stack_state(state, next_bp)) {
> > > > +   // Try again with shifted BP
> > > > +   state->bp += 5; // see AUX_STACK_SPACE
> > > > +   next_bp = (unsigned long
> > > > *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> > > > +   // Clean and refetch stack info, it's marked as error 
> > > > outed
> > > > +   state->stack_mask = 0;
> > > > +   get_stack_info(next_bp, state->task,
> > > > >stack_info, >stack_mask);
> > > > +   if (!update_stack_state(state, next_bp)) {
> > > > +   goto bad_address;
> > > > +   }
> > > > +   }
> > > >
> > > > return true;
> > >
> > > Nack.
> > >
> > > > For ORC unwinder, I think the unwinder can't find any info about the
> > > > JITed part. Maybe if can let it just skip the JITed part and go to
> > > > kernel context, then should be good enough.
> > >
> > > If it's starting from a fake pt_regs then that's going to be a
> > > challenge.
> > >
> > > Will the JIT code always have the same stack layout?  If so then we
> > > could hard code that knowledge in ORC.  Or even better, create a generic
> > > interface for ORC to query the creator of the generated code about the
> > > stack layout.
> >
> > I think yes.
> >
> > Not sure why we have the BP

Re: Getting empty callchain from perf_callchain_kernel()

2019-05-23 Thread Kairui Song
On Thu, May 23, 2019 at 9:32 PM Josh Poimboeuf  wrote:
>
> On Thu, May 23, 2019 at 02:48:11PM +0800, Kairui Song wrote:
> > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf  wrote:
> > >
> > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> > > > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote:
> > > > > The one that is broken is prog_tests/stacktrace_map.c
> > > > > There we attach bpf to standard tracepoint where
> > > > > kernel suppose to collect pt_regs before calling into bpf.
> > > > > And that's what bpf_get_stackid_tp() is doing.
> > > > > It passes pt_regs (that was collected before any bpf)
> > > > > into bpf_get_stackid() which calls get_perf_callchain().
> > > > > Same thing with kprobes, uprobes.
> > > >
> > > > Is it trying to unwind through ___bpf_prog_run()?
> > > >
> > > > If so, that would at least explain why ORC isn't working.  Objtool
> > > > currently ignores that function because it can't follow the jump table.
> > >
> > > Here's a tentative fix (for ORC, at least).  I'll need to make sure this
> > > doesn't break anything else.
> > >
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index 242a643af82f..1d9a7cc4b836 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct 
> > > bpf_insn *insn, u64 *stack)
> > > BUG_ON(1);
> > > return 0;
> > >  }
> > > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
> > >
> > >  #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
> > >  #define DEFINE_BPF_PROG_RUN(stack_size) \
> > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > > index 172f99195726..2567027fce95 100644
> > > --- a/tools/objtool/check.c
> > > +++ b/tools/objtool/check.c
> > > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct 
> > > objtool_file *file,
> > > if (text_rela->type == R_X86_64_PC32)
> > > table_offset += 4;
> > >
> > > -   /*
> > > -* Make sure the .rodata address isn't associated with a
> > > -* symbol.  gcc jump tables are anonymous data.
> > > -*/
> > > -   if (find_symbol_containing(rodata_sec, table_offset))
> > > -   continue;
> > > -
> > > rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> > > if (rodata_rela) {
> > > /*
> >
> > Hi Josh, this still won't fix the problem.
> >
> > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > wrong is with the JITed bpf code.
>
> There seem to be a bunch of issues.  My patch at least fixes the failing
> selftest reported by Alexei for ORC.
>
> How can I recreate your issue?

Hmm, I used bcc's example to attach bpf to trace point, and with that
fix stack trace is still invalid.

CMD I used with bcc:
python3 ./tools/stackcount.py t:sched:sched_fork

And I just had another try applying your patch, self test is also failing.

I'm applying on my local master branch, a few days older than
upstream, I can update and try again, am I missing anything?

>
> > For frame pointer unwinder, it seems the JITed bpf code will have a
> > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > unshift it properly then it will work.
>
> Yeah, that looks like a frame pointer bug in emit_prologue().
>
> > I tried below code, and problem is fixed (only for frame pointer
> > unwinder though). Need to find a better way to detect and do any
> > similar trick for bpf part, if this is a feasible way to fix it:
> >
> > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > --- a/arch/x86/kernel/unwind_frame.c
> > +++ b/arch/x86/kernel/unwind_frame.c
> > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> > }
> >
> > /* Move to the next frame if it's safe: */
> > -   if (!update_stack_state(state, next_bp))
> > -   goto bad_address;
> > +   if (!update_stack_state(state, next_bp)) {
> > +   // Try again with shifted BP
> > +   state->bp += 

Re: Getting empty callchain from perf_callchain_kernel()

2019-05-23 Thread Kairui Song
 On Thu, May 23, 2019 at 4:28 PM Song Liu  wrote:
>
> > On May 22, 2019, at 11:48 PM, Kairui Song  wrote:
> >
> > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf  wrote:
> >>
> >> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> >>> On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote:
> >>>> The one that is broken is prog_tests/stacktrace_map.c
> >>>> There we attach bpf to standard tracepoint where
> >>>> kernel suppose to collect pt_regs before calling into bpf.
> >>>> And that's what bpf_get_stackid_tp() is doing.
> >>>> It passes pt_regs (that was collected before any bpf)
> >>>> into bpf_get_stackid() which calls get_perf_callchain().
> >>>> Same thing with kprobes, uprobes.
> >>>
> >>> Is it trying to unwind through ___bpf_prog_run()?
> >>>
> >>> If so, that would at least explain why ORC isn't working.  Objtool
> >>> currently ignores that function because it can't follow the jump table.
> >>
> >> Here's a tentative fix (for ORC, at least).  I'll need to make sure this
> >> doesn't break anything else.
> >>
> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >> index 242a643af82f..1d9a7cc4b836 100644
> >> --- a/kernel/bpf/core.c
> >> +++ b/kernel/bpf/core.c
> >> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct 
> >> bpf_insn *insn, u64 *stack)
> >>BUG_ON(1);
> >>return 0;
> >> }
> >> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
> >>
> >> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
> >> #define DEFINE_BPF_PROG_RUN(stack_size) \
> >> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> >> index 172f99195726..2567027fce95 100644
> >> --- a/tools/objtool/check.c
> >> +++ b/tools/objtool/check.c
> >> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct 
> >> objtool_file *file,
> >>if (text_rela->type == R_X86_64_PC32)
> >>table_offset += 4;
> >>
> >> -   /*
> >> -* Make sure the .rodata address isn't associated with a
> >> -* symbol.  gcc jump tables are anonymous data.
> >> -*/
> >> -   if (find_symbol_containing(rodata_sec, table_offset))
> >> -   continue;
> >> -
> >>rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> >>if (rodata_rela) {
> >>/*
> >
> > Hi Josh, this still won't fix the problem.
> >
> > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > wrong is with the JITed bpf code.
> >
> > For frame pointer unwinder, it seems the JITed bpf code will have a
> > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > unshift it properly then it will work.
> >
> > I tried below code, and problem is fixed (only for frame pointer
> > unwinder though). Need to find a better way to detect and do any
> > similar trick for bpf part, if this is a feasible way to fix it:
> >
> > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > --- a/arch/x86/kernel/unwind_frame.c
> > +++ b/arch/x86/kernel/unwind_frame.c
> > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> >}
> >
> >/* Move to the next frame if it's safe: */
> > -   if (!update_stack_state(state, next_bp))
> > -   goto bad_address;
> > +   if (!update_stack_state(state, next_bp)) {
> > +   // Try again with shifted BP
> > +   state->bp += 5; // see AUX_STACK_SPACE
> > +   next_bp = (unsigned long
> > *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> > +   // Clean and refetch stack info, it's marked as error outed
> > +   state->stack_mask = 0;
> > +   get_stack_info(next_bp, state->task,
> > >stack_info, >stack_mask);
> > +   if (!update_stack_state(state, next_bp)) {
> > +   goto bad_address;
> > +   }
> > +   }
> >
> >return true;
> >
> > For ORC unwinder, I think the unwinder can't find any info about the
>

Re: Getting empty callchain from perf_callchain_kernel()

2019-05-23 Thread Kairui Song
On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf  wrote:
>
> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote:
> > > The one that is broken is prog_tests/stacktrace_map.c
> > > There we attach bpf to standard tracepoint where
> > > kernel suppose to collect pt_regs before calling into bpf.
> > > And that's what bpf_get_stackid_tp() is doing.
> > > It passes pt_regs (that was collected before any bpf)
> > > into bpf_get_stackid() which calls get_perf_callchain().
> > > Same thing with kprobes, uprobes.
> >
> > Is it trying to unwind through ___bpf_prog_run()?
> >
> > If so, that would at least explain why ORC isn't working.  Objtool
> > currently ignores that function because it can't follow the jump table.
>
> Here's a tentative fix (for ORC, at least).  I'll need to make sure this
> doesn't break anything else.
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 242a643af82f..1d9a7cc4b836 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct 
> bpf_insn *insn, u64 *stack)
> BUG_ON(1);
> return 0;
>  }
> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
>
>  #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
>  #define DEFINE_BPF_PROG_RUN(stack_size) \
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 172f99195726..2567027fce95 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct 
> objtool_file *file,
> if (text_rela->type == R_X86_64_PC32)
> table_offset += 4;
>
> -   /*
> -* Make sure the .rodata address isn't associated with a
> -* symbol.  gcc jump tables are anonymous data.
> -*/
> -   if (find_symbol_containing(rodata_sec, table_offset))
> -   continue;
> -
> rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> if (rodata_rela) {
> /*

Hi Josh, this still won't fix the problem.

Problem is not (or not only) with ___bpf_prog_run, what actually went
wrong is with the JITed bpf code.

For frame pointer unwinder, it seems the JITed bpf code will have a
shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
unshift it properly then it will work.

I tried below code, and problem is fixed (only for frame pointer
unwinder though). Need to find a better way to detect and do any
similar trick for bpf part, if this is a feasible way to fix it:

diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 9b9fd4826e7a..2c0fa2aaa7e4 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
}

/* Move to the next frame if it's safe: */
-   if (!update_stack_state(state, next_bp))
-   goto bad_address;
+   if (!update_stack_state(state, next_bp)) {
+   // Try again with shifted BP
+   state->bp += 5; // see AUX_STACK_SPACE
+   next_bp = (unsigned long
*)READ_ONCE_TASK_STACK(state->task, *state->bp);
+   // Clean and refetch stack info, it's marked as error outed
+   state->stack_mask = 0;
+   get_stack_info(next_bp, state->task,
>stack_info, >stack_mask);
+   if (!update_stack_state(state, next_bp)) {
+   goto bad_address;
+       }
+   }

return true;

For ORC unwinder, I think the unwinder can't find any info about the
JITed part. Maybe if can let it just skip the JITed part and go to
kernel context, then should be good enough.





--
Best Regards,
Kairui Song


Re: [PATCH v2] perf/x86: always include regs->ip in callchain

2019-05-23 Thread Kairui Song
On Thu, May 23, 2019 at 1:34 PM Song Liu  wrote:
>
> Commit d15d356887e7 removes regs->ip for !perf_hw_regs(regs) case. This
> patch adds regs->ip back.
>
> Fixes: d15d356887e7 ("perf/x86: Make perf callchains work without 
> CONFIG_FRAME_POINTER")
> Cc: Kairui Song 
> Cc: Peter Zijlstra (Intel) 
> Signed-off-by: Song Liu 
> ---
>  arch/x86/events/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index f315425d8468..7b8a9eb4d5fd 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2402,9 +2402,9 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
> return;
> }
>
> +   if (perf_callchain_store(entry, regs->ip))
> +   return;
> if (perf_hw_regs(regs)) {
> -   if (perf_callchain_store(entry, regs->ip))
> -   return;
> unwind_start(, current, regs, NULL);
> } else {
> unwind_start(, current, NULL, (void *)regs->sp);
> --
> 2.17.1
>

Hi, this will make !perf_hw_regs(regs) case print a double first level
stack trace, which is wrong. And the actual problem that unwinder give
empty calltrace in bpf is still not fixed.

-- 
Best Regards,
Kairui Song


Re: Getting empty callchain from perf_callchain_kernel()

2019-05-19 Thread Kairui Song
On Sat, May 18, 2019 at 5:48 AM Song Liu  wrote:
>
>
>
> > On May 17, 2019, at 2:06 PM, Alexei Starovoitov  wrote:
> >
> > On 5/17/19 11:40 AM, Song Liu wrote:
> >> +Alexei, Daniel, and bpf
> >>
> >>> On May 17, 2019, at 2:10 AM, Peter Zijlstra  wrote:
> >>>
> >>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> >>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> >>>> some other bfp functions) is now broken, or, strating an unwind
> >>>> directly inside a bpf program will end up strangely. It have following
> >>>> kernel message:
> >>>
> >>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> >>> follow.
> >>
> >> I guess we need something like the following? (we should be able to
> >> optimize the PER_CPU stuff).
> >>
> >> Thanks,
> >> Song
> >>
> >>
> >> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c
> >> index f92d6ad5e080..c525149028a7 100644
> >> --- i/kernel/trace/bpf_trace.c
> >> +++ w/kernel/trace/bpf_trace.c
> >> @@ -696,11 +696,13 @@ static const struct bpf_func_proto 
> >> bpf_perf_event_output_proto_tp = {
> >> .arg5_type  = ARG_CONST_SIZE_OR_ZERO,
> >>  };
> >>
> >> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs);
> >>  BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map,
> >>u64, flags)
> >>  {
> >> -   struct pt_regs *regs = *(struct pt_regs **)tp_buff;
> >> +   struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs);
> >>
> >> +   perf_fetch_caller_regs(regs);
> >
> > No. pt_regs is already passed in. It's the first argument.
> > If we call perf_fetch_caller_regs() again the stack trace will be wrong.
> > bpf prog should not see itself, interpreter or all the frames in between.
>
> Thanks Alexei! I get it now.
>
> In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field
> of tp_buff:
>
> struct pt_regs *regs = *(struct pt_regs **)tp_buff;
>
> tp_buff points to something like
>
> struct sched_switch_args {
> unsigned long long pad;
> char prev_comm[16];
> int prev_pid;
> int prev_prio;
> long long prev_state;
> char next_comm[16];
> int next_pid;
> int next_prio;
> };
>
> where the first field "pad" is a pointer to pt_regs.
>
> @Kairui, I think you confirmed that current code will give empty call trace
> with ORC unwinder? If that's the case, can we add regs->ip back? (as in the
> first email of this thread.
>
> Thanks,
> Song
>

Hi thanks for the suggestion, yes we can add it should be good an idea
to always have IP when stack trace is not available.
But stack trace is actually still broken, it will always give only one
level of stacktrace (the IP).

-- 
Best Regards,
Kairui Song


Re: Getting empty callchain from perf_callchain_kernel()

2019-05-19 Thread Kairui Song
On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra  wrote:
>
> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> > some other bfp functions) is now broken, or, strating an unwind
> > directly inside a bpf program will end up strangely. It have following
> > kernel message:
>
> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> follow.

bpf_get_stackid_tp will just use the regs passed to it from the trace
point. And then it will eventually call perf_get_callchain to get the
call chain.
With a tracepoint we have the fake regs, so unwinder will start from
where it is called, and use the fake regs as the indicator of the
target frame it want, and keep unwinding until reached the actually
callsite.

But if the stack trace is started withing a bpf func call then it's broken...

If the unwinder could trace back through the bpf func call then there
will be no such problem.

For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
before bpf call, and ensure bp is also dumped could fix it (so it will
start using the regs for bpf calls, like before the commit
d15d356887e7). But for ORC I don't see a clear way to fix the problem.
First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
DX, R10, R13, else?) in the trace point handler, then use the flag to
indicate ORC to do one more unwind (because can't get caller's regs,
so get callee's regs instaed) before actually giving output?

I had a try, for framepointer unwinder, mark the indicator flag before
calling bpf functions, and dump bp as well in the trace point. Then
with frame pointer, it works, test passed:

diff --git a/arch/x86/include/asm/perf_event.h
b/arch/x86/include/asm/perf_event.h
index 1392d5e6e8d6..6f1192e9776b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
pt_regs *regs);

 #include 

+#ifdef CONFIG_FRAME_POINTER
+static inline unsigned long caller_frame_pointer(void)
+{
+   return (unsigned long)__builtin_frame_address(1);
+}
+#else
+static inline unsigned long caller_frame_pointer(void)
+{
+   return 0;
+}
+#endif
+
 /*
  * We abuse bit 3 from flags to pass exact information, see perf_misc_flags
  * and the comment with PERF_EFLAGS_EXACT.
  */
 #define perf_arch_fetch_caller_regs(regs, __ip){   \
(regs)->ip = (__ip);\
+   (regs)->bp = caller_frame_pointer();\
(regs)->sp = (unsigned long)__builtin_frame_address(0); \
(regs)->cs = __KERNEL_CS;   \
regs->flags = 0;\
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3b96c2..ca7b95ee74f0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
int size, int rctx,
   struct task_struct *task)
 {
if (bpf_prog_array_valid(call)) {
+   regs->flags |= X86_EFLAGS_FIXED;
*(struct pt_regs **)raw_data = regs;
if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
@@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
int ret = 0;

ctx.regs = perf_arch_bpf_user_pt_regs(regs);
+   ctx.regs->flags |= X86_EFLAGS_FIXED;
+
preempt_disable();
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
goto out;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..e1fa656677dc 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
flags, void *meta, u64 meta_size,
};

perf_fetch_caller_regs(regs);
+   regs->flags |= X86_EFLAGS_FIXED;
+
perf_sample_data_init(sd, 0, 0);
sd->raw = 

@@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs);

perf_fetch_caller_regs(regs);
+   regs->flags |= X86_EFLAGS_FIXED;
+
return bpf_perf_event_output(regs, map, flags, data, size);
 }

@@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs);

perf_fetch_caller_regs(regs);
+   regs->flags |= X86_EFLAGS_FIXED;
+
/* similar to bpf_perf_event_output_tp, but pt_regs fetched
differently */
return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
   flags, 0, 0);
@@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_

Re: Getting empty callchain from perf_callchain_kernel()

2019-05-17 Thread Kairui Song
On Fri, May 17, 2019 at 4:15 PM Kairui Song  wrote:
>
> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra  wrote:
> >
> > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote:
> > > > Hi,
> > > >
> > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map 
> > > > (on bpf/master
> > > > branch).
> > > >
> > > > After digging into the code, we found that perf_callchain_kernel() is 
> > > > giving empty
> > > > callchain for tracepoint sched/sched_switch. And it seems related to 
> > > > commit
> > > >
> > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d
> > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
> > > >
> > > > Before this commit, perf_callchain_kernel() returns callchain with 
> > > > regs->ip. With
> > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
> > >
> > > So while I think the below is indeed right; we should store regs->ip
> > > regardless of the unwind path chosen, I still think there's something
> > > fishy if this results in just the 1 entry.
> > >
> > > The sched/sched_switch event really should have a non-trivial stack.
> > >
> > > Let me see if I can reproduce with just perf.
> >
> > $ perf record -g -e "sched:sched_switch" -- make clean
> > $ perf report -D
> >
> > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 
> > 0x81c29562 period: 1 addr: 0
> > ... FP chain: nr:10
> > .  0: ff80
> > .  1: 81c29562
> > .  2: 81c29933
> > .  3: 8111f688
> > .  4: 81120b9d
> > .  5: 81120ce5
> > .  6: 8100254a
> > .  7: 81e0007d
> > .  8: fe00
> > .  9: 7f9b6cd9682a
> > ... thread: sh:7236
> > .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
> >
> >
> > IOW, it seems to 'work'.
> >
>
> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> some other bfp functions) is now broken, or, strating an unwind
> directly inside a bpf program will end up strangely. It have following
> kernel message:
>
> WARNING: kernel stack frame pointer at 70cad07c in
> test_progs:1242 has bad value ffd4497e
>
> And in the debug message:
>
> [  160.460287] 6e117175: aa23a0e8
> (get_perf_callchain+0x148/0x280)
> [  160.460287] 02e8715f: 00c6bba0 (0xc6bba0)
> [  160.460288] b3d07758: 9ce3f979 (0x9ce3f979)
> [  160.460289] 55c97836: 9ce3f979 (0x9ce3f979)
> [  160.460289] 7cbb140a: 0001007f (0x1007f)
> [  160.460290] 7dc62ac9:  ...
> [  160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000)
> [  160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80)
> [  160.460292] f9a16017: 007f (0x7f)
> [  160.460292] a8e62d44:  ...
> [  160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000)
> [  160.460293] 92842522: 007f (0x7f)
> [  160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50)
> [  160.460296] 0777e4cf: aa225d97 (bpf_get_stackid+0x77/0x470)
> [  160.460296] 9942ea16:  ...
> [  160.460297] a08006b1: 0001 (0x1)
> [  160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30)
> [  160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90)
> [  160.460300] 3a13d702:  ...
> [  160.460300] e2e2496d: 9ce3 (0x9ce3)
> [  160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80)
> [  160.460301] a8cf6d55:  ...
> [  160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80)
> [  160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0)
> [  160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60)
> [  160.460305] 97498bf2: aa1f4791 
> (bpf_get_stackid_tp+0x11/0x20)
> [  160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78)
> [  160.460307] dacd0ce5: c0405676 (0xc0405676)
> [  160.460307] a81f2714:  ...
>
> # Note here is the invalid frame pointer
> [  160.460308] 70cad07c: b89d

Re: Getting empty callchain from perf_callchain_kernel()

2019-05-17 Thread Kairui Song
ab651be0
(event_sched_migrate_task+0xa0/0xa0)
[  160.460316] 355cf319:  ...
[  160.460316] 3b67f2ad: d89cffc3ae80 (0xd89cffc3ae80)
[  160.460316] 9a77e20b: 9ce3fba25000 (0x9ce3fba25000)
[  160.460317] 32cf7376: 0001 (0x1)
[  160.460317] 0051db74: b89d00c6bd20 (0xb89d00c6bd20)
[  160.460318] 40eb3bf7: aa22be81
(perf_trace_run_bpf_submit+0x41/0xb0)

Simply store the IP still won't really fix the problem, it just passed
the test. Just had a try to have bpf functions set the
X86_EFLAGS_FIXED for the flags and always dump the bp, it bypassed
this specified problem.

Using frame pointer unwinder for testing this, and seems ORC is fine with this.

-- 
Best Regards,
Kairui Song


[tip:perf/core] perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER

2019-04-29 Thread tip-bot for Kairui Song
Commit-ID:  d15d356887e770c5f2dcf963b52c7cb510c9e42d
Gitweb: https://git.kernel.org/tip/d15d356887e770c5f2dcf963b52c7cb510c9e42d
Author: Kairui Song 
AuthorDate: Tue, 23 Apr 2019 00:26:52 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 29 Apr 2019 08:25:05 +0200

perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER

Currently perf callchain doesn't work well with ORC unwinder
when sampling from trace point. We'll get useless in kernel callchain
like this:

perf  6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 
pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
be23e32e __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

The root cause is that, for trace point events, it doesn't provide a
real snapshot of the hardware registers. Instead perf tries to get
required caller's registers and compose a fake register snapshot
which suppose to contain enough information for start a unwinding.
However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
frame pointer, so current frame pointer is returned instead. We get
a invalid register combination which confuse the unwinder, and end the
stacktrace early.

So in such case just don't try dump BP, and let the unwinder start
directly when the register is not a real snapshot. Use SP
as the skip mark, unwinder will skip all the frames until it meet
the frame of the trace point caller.

Tested with frame pointer unwinder and ORC unwinder, this makes perf
callchain get the full kernel space stacktrace again like this:

perf  6503 [000]  1567.570191: kmem:mm_page_alloc: page=0x16c904 
pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
b523e2ae __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52383bd __get_free_pages+0xd 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

Co-developed-by: Josh Poimboeuf 
Signed-off-by: Kairui Song 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Borislav Petkov 
Cc: Dave Young 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190422162652.15483-1-kas...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/core.c| 21 +
 arch/x86/include/asm/perf_event.h |  7 +--
 arch/x86/include/asm/stacktrace.h | 13 -
 include/linux/perf_event.h| 14 ++
 4 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index de1a924a4914..f315425d8468 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event *event,
cyc2ns_read_end();
 }
 
+/*
+ * Determine whether the regs were taken from an irq/exception handler rather
+ * than from perf_arch_fetch_caller_regs().
+ */
+static bool perf_hw_regs(struct pt_regs *regs)
+{
+   return regs->flags & X86_EFLAGS_FIXED;
+}
+
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
@@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
return;
}
 
-   if (perf_callchain_store(entry, regs->ip))
-   return;
+   if (perf_hw_regs(regs)) {
+   if (perf_callchain_store(entry, regs->ip))
+   return;
+   unwind_start(, current, regs, NULL);
+   } else {
+   unwind_start(, current, NULL, (void *)regs->sp);
+   }
 
-   for (unwind_start(, current, regs, NULL); !unwind_done();
-unwind_next_frame()) {
+   for (; !unwind_done(); unwind_next_frame()) {
addr = unwind_get_return_address();
if (!addr || perf_callchain_store(entry, addr))
re

Re: [RFC PATCH v4] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-24 Thread Kairui Song
On Tue, Apr 23, 2019 at 7:35 AM Peter Zijlstra  wrote:
>
> On Tue, Apr 23, 2019 at 12:26:52AM +0800, Kairui Song wrote:
> > Currently perf callchain doesn't work well with ORC unwinder
> > when sampling from trace point. We'll get useless in kernel callchain
> > like this:
> >
> > perf  6429 [000]22.498450: kmem:mm_page_alloc: 
> > page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
> > be23e32e __alloc_pages_nodemask+0x22e 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> >   7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
> >   5651468729c1 [unknown] (/usr/bin/perf)
> >   5651467ee82a main+0x69a (/usr/bin/perf)
> >   7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
> > 5541f689495641d7 [unknown] ([unknown])
> >
> > The root cause is that, for trace point events, it doesn't provide a
> > real snapshot of the hardware registers. Instead perf tries to get
> > required caller's registers and compose a fake register snapshot
> > which suppose to contain enough information for start a unwinding.
> > However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
> > frame pointer, so current frame pointer is returned instead. We get
> > a invalid register combination which confuse the unwinder, and end the
> > stacktrace early.
> >
> > So in such case just don't try dump BP, and let the unwinder start
> > directly when the register is not a real snapshot. And Use SP
> > as the skip mark, unwinder will skip all the frames until it meet
> > the frame of the trace point caller.
> >
> > Tested with frame pointer unwinder and ORC unwinder, this make perf
> > callchain get the full kernel space stacktrace again like this:
> >
> > perf  6503 [000]  1567.570191: kmem:mm_page_alloc: 
> > page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
> > b523e2ae __alloc_pages_nodemask+0x22e 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52383bd __get_free_pages+0xd 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52fe3e2 do_sys_poll+0x252 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52ff027 __x64_sys_poll+0x37 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b500418b do_syscall_64+0x5b 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> >   7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
> >   55a22960d9c1 [unknown] (/usr/bin/perf)
> >   55a22958982a main+0x69a (/usr/bin/perf)
> >   7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
> > 5541f689495641d7 [unknown] ([unknown])
> >
> > Co-developed-by: Josh Poimboeuf 
> > Signed-off-by: Kairui Song 
>
> Thanks!
>
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index e47ef764f613..ab135abe62e0 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -1059,7 +1059,7 @@ static inline void perf_arch_fetch_caller_regs(struct 
> > pt_regs *regs, unsigned lo
> >   * the nth caller. We only need a few of the regs:
> >   * - ip for PERF_SAMPLE_IP
> >   * - cs for user_mode() tests
> > - * - bp for callchains
> > + * - sp for callchains
> >   * - eflags, for future purposes, just in case
> >   */
> >  static inline void perf_fetch_caller_regs(struct pt_regs *regs)
>
> I've extended that like so:
>
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1058,12 +1058,18 @@ static inline void perf_arch_fetch_calle
>  #endif
>
>  /*
> - * Take a snapshot of the regs. Skip ip and frame pointer to
> - * the nth caller. We only need a few of the regs:
> + * When generating a perf sample in-line, instead of from an interrupt /
> + * exception, we lack a pt_regs. This is typically used from software events
> + * like: SW_CONTEXT_SWITCHES, SW_MIGRATIONS and the tie-in with tracepoints.
> + *
> + * We typically don't need a full set, but (for x86) do require:
>   * - ip for PERF_SAMPLE_IP
>   * - cs for user_mode() tests
> - * - sp for callchains
> - * - eflags, for future purposes, just in case
> + * - sp for PERF_SAMPLE_CALLCHAIN
> + * - eflags for MISC bits and CALLCHAIN (see: perf_hw_regs())
> + *
> + * NOTE: assumes @regs is otherwise already 0 filled; this is important for
> + * things like PERF_SAMPLE_REGS_INTR.
>   */
>  static inline void perf_fetch_caller_regs(struct pt_regs *regs)
>  {

Sure, the updated comments looks much better. Will the maintainer
squash the comment update or should I send a V5?

--
Best Regards,
Kairui Song


[RFC PATCH v4] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-22 Thread Kairui Song
Currently perf callchain doesn't work well with ORC unwinder
when sampling from trace point. We'll get useless in kernel callchain
like this:

perf  6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 
pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
be23e32e __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

The root cause is that, for trace point events, it doesn't provide a
real snapshot of the hardware registers. Instead perf tries to get
required caller's registers and compose a fake register snapshot
which suppose to contain enough information for start a unwinding.
However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
frame pointer, so current frame pointer is returned instead. We get
a invalid register combination which confuse the unwinder, and end the
stacktrace early.

So in such case just don't try dump BP, and let the unwinder start
directly when the register is not a real snapshot. And Use SP
as the skip mark, unwinder will skip all the frames until it meet
the frame of the trace point caller.

Tested with frame pointer unwinder and ORC unwinder, this make perf
callchain get the full kernel space stacktrace again like this:

perf  6503 [000]  1567.570191: kmem:mm_page_alloc: page=0x16c904 
pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
b523e2ae __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52383bd __get_free_pages+0xd 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

Co-developed-by: Josh Poimboeuf 
Signed-off-by: Kairui Song 
---

Update from V3:
  - Alway start the unwinding directly on fake registers, so we have
a unified path for both with/without frame pointer and simplify
the code, as posted by Josh Poimboeuf

Update from V2:
  - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as
the indicator of where the pt_regs is valid for unwinding. As
suggested by Peter Zijlstra
  - Update some comments accordingly.

Update from V1:
  Get rid of a lot of unneccessary code and just don't dump a inaccurate
  BP, and use SP as the marker for target frame.

 arch/x86/events/core.c| 21 +
 arch/x86/include/asm/perf_event.h |  7 +--
 arch/x86/include/asm/stacktrace.h | 13 -
 include/linux/perf_event.h|  2 +-
 4 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 81911e11a15d..9856b5b91b9c 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2348,6 +2348,15 @@ void arch_perf_update_userpage(struct perf_event *event,
cyc2ns_read_end();
 }
 
+/*
+ * Determine whether the regs were taken from an irq/exception handler rather
+ * than from perf_arch_fetch_caller_regs().
+ */
+static bool perf_hw_regs(struct pt_regs *regs)
+{
+   return regs->flags & X86_EFLAGS_FIXED;
+}
+
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
@@ -2359,11 +2368,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
return;
}
 
-   if (perf_callchain_store(entry, regs->ip))
-   return;
+   if (perf_hw_regs(regs)) {
+   if (perf_callchain_store(entry, regs->ip))
+   return;
+   unwind_start(, current, regs, NULL);
+   } else {
+   unwind_start(, current, NULL, (void *)regs->sp);
+   }
 
-   for (unwind_start(, current, regs, NULL); !unwind_done();
-unwind_next_frame()) {
+   for (; !unwind_done(); unwind_next_frame()) {
addr = unwind_get_return_address();
if (!addr || perf_callchain_store(entry, addr))
return;
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 8bdf74902293..f4854cd0905b 100644
--- a/arch

Re: [RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-19 Thread Kairui Song
On Fri, Apr 19, 2019 at 5:43 PM Peter Zijlstra  wrote:
>
> On Fri, Apr 19, 2019 at 10:17:49AM +0800, Kairui Song wrote:
> > On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf  wrote:
> > >
> > > I still don't like using regs->bp because it results in different code
> > > paths for FP and ORC.  In the FP case, the regs are treated like real
> > > regs even though they're fake.
> > >
> > > Something like the below would be much simpler.  Would this work?  I don't
> > > know if any other code relies on the fake regs->bp or regs->sp.
> >
> > Works perfectly. My only concern is that FP path used to work very
> > well, not sure it's a good idea to change it, and this may bring some
> > extra overhead for FP path.
>
> Given Josh wrote all that code, I'm fairly sure it is still OK :-)
>
> But also looking at the code in unwind_frame.c, __unwind_start() seems
> to pretty much do what the removed caller_frame_pointer() did (when
> .regs=NULL) but better.
>

OK, with FP we will also need to do a few more extra unwinding,
previously it start directly from the frame of the trace point, now
have to trace back to the trace point first.
If that's fine I could post another update (that will be pretty much
just copy from the Josh's code he posted :P , is this OK?)





--
Best Regards,
Kairui Song


Re: [RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-18 Thread Kairui Song
On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf  wrote:
>
> I still don't like using regs->bp because it results in different code
> paths for FP and ORC.  In the FP case, the regs are treated like real
> regs even though they're fake.
>
> Something like the below would be much simpler.  Would this work?  I don't
> know if any other code relies on the fake regs->bp or regs->sp.

Works perfectly. My only concern is that FP path used to work very
well, not sure it's a good idea to change it, and this may bring some
extra overhead for FP path.

>
> (BTW, tomorrow is a US holiday so I may not be very responsive until
> Monday.)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index de1a924a4914..f315425d8468 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event 
> *event,
> cyc2ns_read_end();
>  }
>
> +/*
> + * Determine whether the regs were taken from an irq/exception handler rather
> + * than from perf_arch_fetch_caller_regs().
> + */
> +static bool perf_hw_regs(struct pt_regs *regs)
> +{
> +   return regs->flags & X86_EFLAGS_FIXED;
> +}
> +
>  void
>  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
> *regs)
>  {
> @@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
> return;
> }
>
> -   if (perf_callchain_store(entry, regs->ip))
> -   return;
> +   if (perf_hw_regs(regs)) {
> +   if (perf_callchain_store(entry, regs->ip))
> +   return;
> +   unwind_start(, current, regs, NULL);
> +   } else {
> +   unwind_start(, current, NULL, (void *)regs->sp);
> +   }
>
> -   for (unwind_start(, current, regs, NULL); !unwind_done();
> -unwind_next_frame()) {
> +   for (; !unwind_done(); unwind_next_frame()) {
> addr = unwind_get_return_address();
> if (!addr || perf_callchain_store(entry, addr))
> return;
> diff --git a/arch/x86/include/asm/perf_event.h 
> b/arch/x86/include/asm/perf_event.h
> index 04768a3a5454..1392d5e6e8d6 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -308,14 +308,9 @@ extern unsigned long perf_misc_flags(struct pt_regs 
> *regs);
>   */
>  #define perf_arch_fetch_caller_regs(regs, __ip){   \
> (regs)->ip = (__ip);\
> -   (regs)->bp = caller_frame_pointer();\
> +   (regs)->sp = (unsigned long)__builtin_frame_address(0); \
> (regs)->cs = __KERNEL_CS;   \
> regs->flags = 0;\
> -   asm volatile(   \
> -   _ASM_MOV "%%"_ASM_SP ", %0\n"   \
> -   : "=m" ((regs)->sp) \
> -   :: "memory" \
> -   );  \
>  }
>
>  struct perf_guest_switch_msr {
> diff --git a/arch/x86/include/asm/stacktrace.h 
> b/arch/x86/include/asm/stacktrace.h
> index d6d758a187b6..a8d0cdf48616 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -100,19 +100,6 @@ struct stack_frame_ia32 {
>  u32 return_address;
>  };
>
> -static inline unsigned long caller_frame_pointer(void)
> -{
> -   struct stack_frame *frame;
> -
> -   frame = __builtin_frame_address(0);
> -
> -#ifdef CONFIG_FRAME_POINTER
> -   frame = frame->next_frame;
> -#endif
> -
> -   return (unsigned long)frame;
> -}
> -
>  void show_opcodes(struct pt_regs *regs, const char *loglvl);
>  void show_ip(struct pt_regs *regs, const char *loglvl);
>  #endif /* _ASM_X86_STACKTRACE_H */
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f3864e1c5569..0f560069aeec 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1062,7 +1062,7 @@ static inline void perf_arch_fetch_caller_regs(struct 
> pt_regs *regs, unsigned lo
>   * the nth caller. We only need a few of the regs:
>   * - ip for PERF_SAMPLE_IP
>   * - cs for user_mode() tests
> - * - bp for callchains
> + * - sp for callchains
>   * - eflags, for future purposes, just in case
>   */
>  static inline void perf_fetch_caller_regs(struct pt_regs *regs)

-- 
Best Regards,
Kairui Song


[RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-18 Thread Kairui Song
Currently perf callchain doesn't work well when sampling from trace
point, with ORC unwinder enabled and CONFIG_FRAME_POINTER disabled.
We'll get useless in kernel callchain like this:

perf  6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 
pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
be23e32e __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

The root cause is within a trace point perf will try to dump the
required caller's registers, but without CONFIG_FRAME_POINTER we
can't get caller's BP as the frame pointer, so current frame pointer
is returned instead. We get a invalid register combination which
confuse the unwinder and end the stacktrace early.

So in such case just don't try dump BP when doing partial register
dump. And just let the unwinder start directly when the register is
incapable of presenting a unwinding start point. Use SP as the skip
mark, skip all the frames until we meet the frame we want.

This make the callchain get the full kernel space stacktrace again:

perf  6503 [000]  1567.570191: kmem:mm_page_alloc: page=0x16c904 
pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
b523e2ae __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52383bd __get_free_pages+0xd 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

Signed-off-by: Kairui Song 
---

Update from V2:
  - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as
the indicator of where the pt_regs is valid for unwinding. As
suggested by Peter Zijlstra
  - Update some comments accordingly.

Update from V1:
  Get rid of a lot of unneccessary code and just don't dump a inaccurate
  BP, and use SP as the marker for target frame.

 arch/x86/events/core.c| 24 +---
 arch/x86/include/asm/perf_event.h |  5 +
 arch/x86/include/asm/stacktrace.h |  9 +++--
 include/linux/perf_event.h|  6 +++---
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447192a8..e181e195fe5d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2355,6 +2355,18 @@ void arch_perf_update_userpage(struct perf_event *event,
cyc2ns_read_end();
 }
 
+static inline int
+valid_unwinding_registers(struct pt_regs *regs)
+{
+   /*
+* regs might be a fake one, it won't dump the flags reg,
+* and without frame pointer, it won't have a valid BP.
+*/
+   if (IS_ENABLED(CONFIG_FRAME_POINTER))
+   return 1;
+   return (regs->flags & PERF_EFLAGS_SNAP);
+}
+
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
@@ -2366,11 +2378,17 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
return;
}
 
-   if (perf_callchain_store(entry, regs->ip))
+   if (valid_unwinding_registers(regs)) {
+   if (perf_callchain_store(entry, regs->ip))
+   return;
+   unwind_start(, current, regs, NULL);
+   } else if (regs->sp) {
+   unwind_start(, current, NULL, (unsigned long *)regs->sp);
+   } else {
return;
+   }
 
-   for (unwind_start(, current, regs, NULL); !unwind_done();
-unwind_next_frame()) {
+   for (; !unwind_done(); unwind_next_frame()) {
addr = unwind_get_return_address();
if (!addr || perf_callchain_store(entry, addr))
return;
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 8bdf74902293..77c8519512ff 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -239,11 +239,16 @@ extern void perf_events_lapic_init(void);
  * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
  * unused and ABI sp

Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-17 Thread Kairui Song
On Wed, Apr 17, 2019 at 4:16 AM Josh Poimboeuf  wrote:
>
> On Wed, Apr 17, 2019 at 01:39:19AM +0800, Kairui Song wrote:
> > On Tue, Apr 16, 2019 at 7:30 PM Kairui Song  wrote:
> > >
> > > On Tue, Apr 16, 2019 at 12:59 AM Josh Poimboeuf  
> > > wrote:
> > > >
> > > > On Mon, Apr 15, 2019 at 05:36:22PM +0200, Peter Zijlstra wrote:
> > > > >
> > > > > I'll mostly defer to Josh on unwinding, but a few comments below.
> > > > >
> > > > > On Tue, Apr 09, 2019 at 12:59:42AM +0800, Kairui Song wrote:
> > > > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > > > > > index e2b1447192a8..6075a4f94376 100644
> > > > > > --- a/arch/x86/events/core.c
> > > > > > +++ b/arch/x86/events/core.c
> > > > > > @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct 
> > > > > > perf_event *event,
> > > > > > cyc2ns_read_end();
> > > > > >  }
> > > > > >
> > > > > > +static inline int
> > > > > > +valid_perf_registers(struct pt_regs *regs)
> > > > > > +{
> > > > > > +   return (regs->ip && regs->bp && regs->sp);
> > > > > > +}
> > > > >
> > > > > I'm unconvinced by this, with both guess and orc having !bp is 
> > > > > perfectly
> > > > > valid.
> > > > >
> > > > > >  void
> > > > > >  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, 
> > > > > > struct pt_regs *regs)
> > > > > >  {
> > > > > > @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct 
> > > > > > perf_callchain_entry_ctx *entry, struct pt_regs *re
> > > > > > return;
> > > > > > }
> > > > > >
> > > > > > -   if (perf_callchain_store(entry, regs->ip))
> > > > > > +   if (valid_perf_registers(regs)) {
> > > > > > +   if (perf_callchain_store(entry, regs->ip))
> > > > > > +   return;
> > > > > > +   unwind_start(, current, regs, NULL);
> > > > > > +   } else if (regs->sp) {
> > > > > > +   unwind_start(, current, NULL, (unsigned long 
> > > > > > *)regs->sp);
> > > > > > +   } else {
> > > > > > return;
> > > > > > +   }
> > > > >
> > > > > AFAICT if we, by pure accident, end up with !bp for ORC, then we
> > > > > initialize the unwind wrong.
> > > > >
> > > > > Note that @regs is mostly trivially correct, except for that 
> > > > > tracepoint
> > > > > case. So I don't think we should magic here.
> > > >
> > > > Ah, I didn't quite understand this code before, and I still don't
> > > > really, but I guess the issue is that @regs can be either real or fake.
> > > >
> > > > In the real @regs case, we just want to always unwind starting from
> > > > regs->sp.
> > > >
> > > > But in the fake @regs case, we should instead unwind from the current
> > > > frame, skipping all frames until we hit the fake regs->sp.  Because
> > > > starting from fake/incomplete regs is most likely going to cause
> > > > problems with ORC (or DWARF for other arches).
> > > >
> > > > The idea of a fake regs is fragile and confusing.  Is it possible to
> > > > just pass in the "skip" stack pointer directly instead?  That should
> > > > work for both FP and non-FP.  And I _think_ there's no need to ever
> > > > capture regs->bp anyway -- the stack pointer should be sufficient.
> > >
> > > Hi, that will break some other usage, if perf_callchain_kernel is
> > > called but it won't unwind to the callsite (could be produced by
> > > attach an ebpf call to kprobe), things will also go wrong. It should
> > > start with given registers when the register is valid.
> > > And it's true with omit frame pointer BP value could be anything, so 0
> > > is also valid, I think I need to find a better way to tell if we could
> > > start with the registers value or direct start unwinding and skip
> > > until got the stack.
> > >
> >
> > Hi, sorry I might have some misu

Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-17 Thread Kairui Song
On Wed, Apr 17, 2019 at 1:45 AM Peter Zijlstra  wrote:
>
> On Wed, Apr 17, 2019 at 01:39:19AM +0800, Kairui Song wrote:
> > And I also think the "fake"/"real" reg is fragile, could we abuse
> > another eflag (just like PERF_EFLAGS_EXACT) to indicate the regs are
> > partially dumped fake registers?
>
> Sure, the SDM seems to suggest bits 1,3,5,15 are 'available'. We've
> already used 3 and 5, and I think we can use !X86_EFLAGS_FIXED to
> indicate a fake regs set. Any real regs set will always have that set.

Thanks! This is a good idea. Will update accordingly in V3 later.





--
Best Regards,
Kairui Song


Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-16 Thread Kairui Song
On Tue, Apr 16, 2019 at 7:30 PM Kairui Song  wrote:
>
> On Tue, Apr 16, 2019 at 12:59 AM Josh Poimboeuf  wrote:
> >
> > On Mon, Apr 15, 2019 at 05:36:22PM +0200, Peter Zijlstra wrote:
> > >
> > > I'll mostly defer to Josh on unwinding, but a few comments below.
> > >
> > > On Tue, Apr 09, 2019 at 12:59:42AM +0800, Kairui Song wrote:
> > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > > > index e2b1447192a8..6075a4f94376 100644
> > > > --- a/arch/x86/events/core.c
> > > > +++ b/arch/x86/events/core.c
> > > > @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct perf_event 
> > > > *event,
> > > > cyc2ns_read_end();
> > > >  }
> > > >
> > > > +static inline int
> > > > +valid_perf_registers(struct pt_regs *regs)
> > > > +{
> > > > +   return (regs->ip && regs->bp && regs->sp);
> > > > +}
> > >
> > > I'm unconvinced by this, with both guess and orc having !bp is perfectly
> > > valid.
> > >
> > > >  void
> > > >  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct 
> > > > pt_regs *regs)
> > > >  {
> > > > @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct 
> > > > perf_callchain_entry_ctx *entry, struct pt_regs *re
> > > > return;
> > > > }
> > > >
> > > > -   if (perf_callchain_store(entry, regs->ip))
> > > > +   if (valid_perf_registers(regs)) {
> > > > +   if (perf_callchain_store(entry, regs->ip))
> > > > +   return;
> > > > +   unwind_start(, current, regs, NULL);
> > > > +   } else if (regs->sp) {
> > > > +   unwind_start(, current, NULL, (unsigned long 
> > > > *)regs->sp);
> > > > +   } else {
> > > > return;
> > > > +   }
> > >
> > > AFAICT if we, by pure accident, end up with !bp for ORC, then we
> > > initialize the unwind wrong.
> > >
> > > Note that @regs is mostly trivially correct, except for that tracepoint
> > > case. So I don't think we should magic here.
> >
> > Ah, I didn't quite understand this code before, and I still don't
> > really, but I guess the issue is that @regs can be either real or fake.
> >
> > In the real @regs case, we just want to always unwind starting from
> > regs->sp.
> >
> > But in the fake @regs case, we should instead unwind from the current
> > frame, skipping all frames until we hit the fake regs->sp.  Because
> > starting from fake/incomplete regs is most likely going to cause
> > problems with ORC (or DWARF for other arches).
> >
> > The idea of a fake regs is fragile and confusing.  Is it possible to
> > just pass in the "skip" stack pointer directly instead?  That should
> > work for both FP and non-FP.  And I _think_ there's no need to ever
> > capture regs->bp anyway -- the stack pointer should be sufficient.
>
> Hi, that will break some other usage, if perf_callchain_kernel is
> called but it won't unwind to the callsite (could be produced by
> attach an ebpf call to kprobe), things will also go wrong. It should
> start with given registers when the register is valid.
> And it's true with omit frame pointer BP value could be anything, so 0
> is also valid, I think I need to find a better way to tell if we could
> start with the registers value or direct start unwinding and skip
> until got the stack.
>

Hi, sorry I might have some misunderstanding. Adding an extra argument
(eg. skip_sp) to indicate if it should just unwind from the current
frame, and use SP as the "skip mark", should work well.

And I also think the "fake"/"real" reg is fragile, could we abuse
another eflag (just like PERF_EFLAGS_EXACT) to indicate the regs are
partially dumped fake registers? So perf_callchain_kernel just check
if it's a "partial registers", and in such case it can start unwinding
and skip until it get to SP. This make it easier to tell if the
registers are "fake".

-- 
Best Regards,
Kairui Song


Re: [RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-16 Thread Kairui Song
On Tue, Apr 16, 2019 at 12:59 AM Josh Poimboeuf  wrote:
>
> On Mon, Apr 15, 2019 at 05:36:22PM +0200, Peter Zijlstra wrote:
> >
> > I'll mostly defer to Josh on unwinding, but a few comments below.
> >
> > On Tue, Apr 09, 2019 at 12:59:42AM +0800, Kairui Song wrote:
> > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > > index e2b1447192a8..6075a4f94376 100644
> > > --- a/arch/x86/events/core.c
> > > +++ b/arch/x86/events/core.c
> > > @@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct perf_event 
> > > *event,
> > > cyc2ns_read_end();
> > >  }
> > >
> > > +static inline int
> > > +valid_perf_registers(struct pt_regs *regs)
> > > +{
> > > +   return (regs->ip && regs->bp && regs->sp);
> > > +}
> >
> > I'm unconvinced by this, with both guess and orc having !bp is perfectly
> > valid.
> >
> > >  void
> > >  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct 
> > > pt_regs *regs)
> > >  {
> > > @@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct 
> > > perf_callchain_entry_ctx *entry, struct pt_regs *re
> > > return;
> > > }
> > >
> > > -   if (perf_callchain_store(entry, regs->ip))
> > > +   if (valid_perf_registers(regs)) {
> > > +   if (perf_callchain_store(entry, regs->ip))
> > > +   return;
> > > +   unwind_start(, current, regs, NULL);
> > > +   } else if (regs->sp) {
> > > +   unwind_start(, current, NULL, (unsigned long 
> > > *)regs->sp);
> > > +   } else {
> > > return;
> > > +   }
> >
> > AFAICT if we, by pure accident, end up with !bp for ORC, then we
> > initialize the unwind wrong.
> >
> > Note that @regs is mostly trivially correct, except for that tracepoint
> > case. So I don't think we should magic here.
>
> Ah, I didn't quite understand this code before, and I still don't
> really, but I guess the issue is that @regs can be either real or fake.
>
> In the real @regs case, we just want to always unwind starting from
> regs->sp.
>
> But in the fake @regs case, we should instead unwind from the current
> frame, skipping all frames until we hit the fake regs->sp.  Because
> starting from fake/incomplete regs is most likely going to cause
> problems with ORC (or DWARF for other arches).
>
> The idea of a fake regs is fragile and confusing.  Is it possible to
> just pass in the "skip" stack pointer directly instead?  That should
> work for both FP and non-FP.  And I _think_ there's no need to ever
> capture regs->bp anyway -- the stack pointer should be sufficient.

Hi, that will break some other usage, if perf_callchain_kernel is
called but it won't unwind to the callsite (could be produced by
attach an ebpf call to kprobe), things will also go wrong. It should
start with given registers when the register is valid.
And it's true with omit frame pointer BP value could be anything, so 0
is also valid, I think I need to find a better way to tell if we could
start with the registers value or direct start unwinding and skip
until got the stack.

>
> In other words, either regs should be "real", and skip_sp is NULL; or
> regs should be NULL and skip_sp should have a value.
>
> --
> Josh
--
Best Regards,
Kairui Song


[RFC PATCH v2] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-08 Thread Kairui Song
Currently perf callchain is not working properly with ORC unwinder,
and sampling event from trace point. We'll get useless in kernel
callchain like this:

perf  6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 
pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
be23e32e __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

The root cause is within a trace point perf will try to dump the
caller's register, but without CONFIG_FRAME_POINTER we can't get
caller's BP as the frame pointer, so current frame pointer is returned
instead. We get a register combination of caller IP and current BP,
which confuse the unwinder and end the stacktrace early.

So in such case don't dump BP, and just let the unwinder start directly
and skip until we reached the stack we wanted.

This make the callchain get the full kernel space stacktrace again:

perf  6503 [000]  1567.570191: kmem:mm_page_alloc: page=0x16c904 
pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
b523e2ae __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52383bd __get_free_pages+0xd 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
b500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

Signed-off-by: Kairui Song 
---

Update from V1:
  Get rid of a lot of unneccessary code and just don't dump a inaccurate
  BP, and use SP as the marker for target frame.

 arch/x86/events/core.c| 18 +++---
 arch/x86/include/asm/stacktrace.h |  9 +++--
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447192a8..6075a4f94376 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2355,6 +2355,12 @@ void arch_perf_update_userpage(struct perf_event *event,
cyc2ns_read_end();
 }
 
+static inline int
+valid_perf_registers(struct pt_regs *regs)
+{
+   return (regs->ip && regs->bp && regs->sp);
+}
+
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
@@ -2366,11 +2372,17 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
return;
}
 
-   if (perf_callchain_store(entry, regs->ip))
+   if (valid_perf_registers(regs)) {
+   if (perf_callchain_store(entry, regs->ip))
+   return;
+   unwind_start(, current, regs, NULL);
+   } else if (regs->sp) {
+   unwind_start(, current, NULL, (unsigned long *)regs->sp);
+   } else {
return;
+   }
 
-   for (unwind_start(, current, regs, NULL); !unwind_done();
-unwind_next_frame()) {
+   for (; !unwind_done(); unwind_next_frame()) {
addr = unwind_get_return_address();
if (!addr || perf_callchain_store(entry, addr))
return;
diff --git a/arch/x86/include/asm/stacktrace.h 
b/arch/x86/include/asm/stacktrace.h
index f335aad404a4..226077e20412 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -98,18 +98,23 @@ struct stack_frame_ia32 {
 u32 return_address;
 };
 
+#ifdef CONFIG_FRAME_POINTER
 static inline unsigned long caller_frame_pointer(void)
 {
struct stack_frame *frame;
 
frame = __builtin_frame_address(0);
 
-#ifdef CONFIG_FRAME_POINTER
frame = frame->next_frame;
-#endif
 
return (unsigned long)frame;
 }
+#else
+static inline unsigned long caller_frame_pointer(void)
+{
+   return 0;
+}
+#endif
 
 void show_opcodes(struct pt_regs *regs, const char *loglvl);
 void show_ip(struct pt_regs *regs, const char *loglvl);
-- 
2.20.1



Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-05 Thread Kairui Song
On Sat, Apr 6, 2019 at 1:27 AM Josh Poimboeuf  wrote:
>
> On Sat, Apr 06, 2019 at 01:05:55AM +0800, Kairui Song wrote:
> > On Sat, Apr 6, 2019 at 12:57 AM Josh Poimboeuf  wrote:
> > >
> > > On Fri, Apr 05, 2019 at 11:13:02PM +0800, Kairui Song wrote:
> > > > Hi Josh, thanks for the review, I tried again, using latest upstream
> > > > kernel commit ea2cec24c8d429ee6f99040e4eb6c7ad627fe777:
> > > > # uname -a
> > > > Linux localhost.localdomain 5.1.0-rc3+ #29 SMP Fri Apr 5 22:53:05 CST
> > > > 2019 x86_64 x86_64 x86_64 GNU/Linux
> > > >
> > > > Having following config:
> > > > > CONFIG_UNWINDER_ORC=y
> > > > > # CONFIG_UNWINDER_FRAME_POINTER is not set
> > > > and CONFIG_FRAME_POINTER is off too.
> > > >
> > > > Then record something with perf (also latest upstream version):
> > > > ./perf record -g -e kmem:* -c 1
> > > >
> > > > Interrupt it, then view the output:
> > > > perf script | less
> > > >
> > > > Then I notice the stacktrace in kernle is incomplete like following.
> > > > Did I miss anything?
> > > > --
> > > > lvmetad   617 [000]55.600786: kmem:kfree:
> > > > call_site=b219e269 ptr=(nil)
> > > > b22b2d1c kfree+0x11c 
> > > > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > > > 7fba7e58fd0f __select+0x5f (/usr/lib64/libc-2.28.so)
> > > >
> > > > kworker/u2:5-rp   171 [000]55.628529:
> > > > kmem:kmem_cache_alloc: call_site=b20e963d
> > > > ptr=0xa07f39c581e0 bytes_req=80 bytes_alloc=80
> > > > gfp_flags=GFP_ATOMIC
> > > > b22b0dec kmem_cache_alloc+0x13c
> > > > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > > > -
> > > >
> > > > And for the patch, I debugged the problem, and found how it happend:
> > > > The reason is that we use following code for fetching the registers on
> > > > a trace point:
> > > > ...snip...
> > > > #define perf_arch_fetch_caller_regs(regs, __ip) { \
> > > > (regs)->ip = (__ip); \
> > > > (regs)->bp = caller_frame_pointer(); \
> > > > (regs)->cs = __KERNEL_CS;
> > > > ...snip...
> > >
> > > Thanks, I was able to recreate.  It only happens when unwinding from a
> > > tracepoint.  I haven't investigated yet, but
> > > perf_arch_fetch_caller_regs() looks highly suspect, since it's doing
> > > (regs)->bp = caller_frame_pointer(), even for ORC.
> > >
> > > My only explanation for how your patch works is that RBP just happens to
> > > point to somewhere higher on the stack, causing the unwinder to start at
> > > a semi-random location.  I suspect the real "fix" is that you're no
> > > longer passing the regs to unwind_start().
> > >
> >
> > Yes that's right. Simply not passing regs to unwind_start will let the
> > unwind start from the perf sample handling functions, and introduce a
> > lot of "noise", so I let it skipped the frames until it reached the
> > frame of the trace point. The regs->bp should still points to the
> > stack base of the function which get called in the tracepoint that
> > trigger perf sample, so let unwinder skip all the frames above it made
> > it work.
>
> Ah, now I think I understand, thanks.  perf_arch_fetch_caller_regs()
> puts it in regs->bp, and then perf_callchain_kernel() reads that value
> to tell the unwinder where to start dumping the stack trace.  I guess
> that explains why your patch works, though it still seems very odd that
> perf_arch_fetch_caller_regs() is using regs->bp to store the frame
> address.  Maybe regs->sp would be more appropriate.
>
> --
> Josh

Right, thanks for the comment. And after second thought there are some
other issues here in the patch indeed, it still won't fix the problem
when used with ebpf and tracepoint, I made some mistake about handling
the callchain with different ways, will rethink about this and post an
update later.


--
Best Regards,
Kairui Song


Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-05 Thread Kairui Song
On Sat, Apr 6, 2019 at 12:57 AM Josh Poimboeuf  wrote:
>
> On Fri, Apr 05, 2019 at 11:13:02PM +0800, Kairui Song wrote:
> > Hi Josh, thanks for the review, I tried again, using latest upstream
> > kernel commit ea2cec24c8d429ee6f99040e4eb6c7ad627fe777:
> > # uname -a
> > Linux localhost.localdomain 5.1.0-rc3+ #29 SMP Fri Apr 5 22:53:05 CST
> > 2019 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Having following config:
> > > CONFIG_UNWINDER_ORC=y
> > > # CONFIG_UNWINDER_FRAME_POINTER is not set
> > and CONFIG_FRAME_POINTER is off too.
> >
> > Then record something with perf (also latest upstream version):
> > ./perf record -g -e kmem:* -c 1
> >
> > Interrupt it, then view the output:
> > perf script | less
> >
> > Then I notice the stacktrace in kernle is incomplete like following.
> > Did I miss anything?
> > --
> > lvmetad   617 [000]55.600786: kmem:kfree:
> > call_site=b219e269 ptr=(nil)
> > b22b2d1c kfree+0x11c (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > 7fba7e58fd0f __select+0x5f (/usr/lib64/libc-2.28.so)
> >
> > kworker/u2:5-rp   171 [000]55.628529:
> > kmem:kmem_cache_alloc: call_site=b20e963d
> > ptr=0xa07f39c581e0 bytes_req=80 bytes_alloc=80
> > gfp_flags=GFP_ATOMIC
> > b22b0dec kmem_cache_alloc+0x13c
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > -
> >
> > And for the patch, I debugged the problem, and found how it happend:
> > The reason is that we use following code for fetching the registers on
> > a trace point:
> > ...snip...
> > #define perf_arch_fetch_caller_regs(regs, __ip) { \
> > (regs)->ip = (__ip); \
> > (regs)->bp = caller_frame_pointer(); \
> > (regs)->cs = __KERNEL_CS;
> > ...snip...
>
> Thanks, I was able to recreate.  It only happens when unwinding from a
> tracepoint.  I haven't investigated yet, but
> perf_arch_fetch_caller_regs() looks highly suspect, since it's doing
> (regs)->bp = caller_frame_pointer(), even for ORC.
>
> My only explanation for how your patch works is that RBP just happens to
> point to somewhere higher on the stack, causing the unwinder to start at
> a semi-random location.  I suspect the real "fix" is that you're no
> longer passing the regs to unwind_start().
>

Yes that's right. Simply not passing regs to unwind_start will let the
unwind start from the perf sample handling functions, and introduce a
lot of "noise", so I let it skipped the frames until it reached the
frame of the trace point. The regs->bp should still points to the
stack base of the function which get called in the tracepoint that
trigger perf sample, so let unwinder skip all the frames above it made
it work.

-- 
Best Regards,
Kairui Song


Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-05 Thread Kairui Song
On Fri, Apr 5, 2019 at 3:17 PM Peter Zijlstra  wrote:
>
> And you forgot to Cc Josh..
>

Hi, thanks for the reply and Cc more people, I just copies the list
from ./scripts/get_maintainer.pl, will pay more attention next time.

> >
> > Just found with ORC unwinder the perf callchain is unusable, and this
> > seems fixes it well, any suggestion is welcome, thanks!
>
> That whole .direct stuff is horrible crap.
>

Sorry if I did anything dumb, but I didn't find a better way to make
it work so sent this RFC... Would you mind tell me what I'm doing
wrong, or give any suggestion about how should I improve it?

--
Best Regards,
Kairui Song


Re: [RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-05 Thread Kairui Song
On Fri, Apr 5, 2019 at 10:09 PM Josh Poimboeuf  wrote:
>
> On Fri, Apr 05, 2019 at 01:25:45AM +0800, Kairui Song wrote:
> > Currently perf callchain is not working properly with ORC unwinder,
> > we'll get useless in kernel callchain like this:
> >
> > perf  6429 [000]22.498450: kmem:mm_page_alloc: 
> > page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
> > be23e32e __alloc_pages_nodemask+0x22e 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
> > 5651468729c1 [unknown] (/usr/bin/perf)
> > 5651467ee82a main+0x69a (/usr/bin/perf)
> > 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
> > 5541f689495641d7 [unknown] ([unknown])
> >
> > Without CONFIG_FRAME_POINTER, bp is not reserved as frame pointer so
> > can't get callers frame pointer, instead current frame pointer is
> > returned when trying to fetch caller registers. The unwinder will error
> > out early, and end the stacktrace early.
> >
> > So instead of let the unwinder start with the dumped register, we start
> > it right where the unwinding started when the stacktrace is triggered by
> > trace event directly. And skip until the frame pointer is reached.
> >
> > This makes the callchain get the full in kernel stacktrace again:
> >
> > perf  6503 [000]  1567.570191: kmem:mm_page_alloc: 
> > page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
> > b523e2ae __alloc_pages_nodemask+0x22e 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52383bd __get_free_pages+0xd 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52fd28a __pollwait+0x8a 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b521426f perf_poll+0x2f 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52fe3e2 do_sys_poll+0x252 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b52ff027 __x64_sys_poll+0x37 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b500418b do_syscall_64+0x5b 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
> > (/lib/modules/5.1.0-rc3+/build/vmlinux)
> > 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
> > 55a22960d9c1 [unknown] (/usr/bin/perf)
> > 55a22958982a main+0x69a (/usr/bin/perf)
> > 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
> > 5541f689495641d7 [unknown] ([unknown])
> >
> > 
> >
> > Just found with ORC unwinder the perf callchain is unusable, and this
> > seems fixes it well, any suggestion is welcome, thanks!
>
> Hi Kairui,
>
> Without CONFIG_FRAME_POINTER, the BP register has no meaning, so I don't
> see how this patch could work.
>
> Also, perf stack traces seem to work fine for me with ORC.  Can you give
> some details on how to recreate the issue?
>
> --
> Josh


Hi Josh, thanks for the review, I tried again, using latest upstream
kernel commit ea2cec24c8d429ee6f99040e4eb6c7ad627fe777:
# uname -a
Linux localhost.localdomain 5.1.0-rc3+ #29 SMP Fri Apr 5 22:53:05 CST
2019 x86_64 x86_64 x86_64 GNU/Linux

Having following config:
> CONFIG_UNWINDER_ORC=y
> # CONFIG_UNWINDER_FRAME_POINTER is not set
and CONFIG_FRAME_POINTER is off too.

Then record something with perf (also latest upstream version):
./perf record -g -e kmem:* -c 1

Interrupt it, then view the output:
perf script | less

Then I notice the stacktrace in kernle is incomplete like following.
Did I miss anything?
--
lvmetad   617 [000]55.600786: kmem:kfree:
call_site=b219e269 ptr=(nil)
b22b2d1c kfree+0x11c (/lib/modules/5.1.0-rc3+/build/vmlinux)
7fba7e58fd0f __select+0x5f (/usr/lib64/libc-2.28.so)

kworker/u2:5-rp   171 [000]55.628529:
kmem:kmem_cache_alloc: call_site=b20e963d
ptr=0xa07f39c581e0 bytes_req=80 bytes_alloc=80
gfp_flags=GFP_ATOMIC
b22b0dec kmem_cache_alloc+0x13c
(/lib/modules/5.1.0-rc3+/build/vmlinux)
-

And for the patch, I debugged the problem, and found how it happend:
The reason is that we use following code for fetching the registers on
a trace point:
...snip...
#define perf_arch_fetch_caller_regs(regs, __ip) { \
(regs)->ip = (__ip); \
(regs)->bp = caller_frame_pointer(); \
(regs)->cs = __KERNEL_CS;
...snip...

It tries to dump the registers of caller, but in the definition of
caller_frame_pointer:
static inline unsigned long caller_frame_pointer(void)
{
struct 

[RFC PATCH] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-04 Thread Kairui Song
Currently perf callchain is not working properly with ORC unwinder,
we'll get useless in kernel callchain like this:

perf  6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 
pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
be23e32e __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])

Without CONFIG_FRAME_POINTER, bp is not reserved as frame pointer so
can't get callers frame pointer, instead current frame pointer is
returned when trying to fetch caller registers. The unwinder will error
out early, and end the stacktrace early.

So instead of let the unwinder start with the dumped register, we start
it right where the unwinding started when the stacktrace is triggered by
trace event directly. And skip until the frame pointer is reached.

This makes the callchain get the full in kernel stacktrace again:

perf  6503 [000]  1567.570191: kmem:mm_page_alloc: page=0x16c904 
pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
b523e2ae __alloc_pages_nodemask+0x22e 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52383bd __get_free_pages+0xd 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
b52fe3e2 do_sys_poll+0x252 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b52ff027 __x64_sys_poll+0x37 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b500418b do_syscall_64+0x5b 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
(/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])



Just found with ORC unwinder the perf callchain is unusable, and this
seems fixes it well, any suggestion is welcome, thanks!

---
 arch/x86/events/core.c | 34 --
 include/linux/perf_event.h |  3 ++-
 kernel/bpf/stackmap.c  |  4 ++--
 kernel/events/callchain.c  | 13 +++--
 kernel/events/core.c   |  2 +-
 5 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447192a8..3f3e110794ac 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2355,8 +2355,9 @@ void arch_perf_update_userpage(struct perf_event *event,
cyc2ns_read_end();
 }
 
-void
-perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
+static void
+__perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs,
+   bool direct_call)
 {
struct unwind_state state;
unsigned long addr;
@@ -2366,17 +2367,38 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
return;
}
 
-   if (perf_callchain_store(entry, regs->ip))
-   return;
+   /*
+* Without frame pointer, we can't get a reliable caller bp value.
+* If this is called directly from a trace point, just start the
+* unwind from here and skip until the frame is reached.
+*/
+   if (IS_ENABLED(CONFIG_FRAME_POINTER) || !direct_call) {
+   if (perf_callchain_store(entry, regs->ip))
+   return;
+   unwind_start(, current, regs, NULL);
+   } else {
+   unwind_start(, current, NULL, (unsigned long*)regs->bp);
+   }
 
-   for (unwind_start(, current, regs, NULL); !unwind_done();
-unwind_next_frame()) {
+   for (; !unwind_done(); unwind_next_frame()) {
addr = unwind_get_return_address();
if (!addr || perf_callchain_store(entry, addr))
return;
}
 }
 
+void
+perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
+{
+   __perf_callchain_kernel(entry, regs, false);
+}
+
+void
+perf_callchain_kernel_direct(struct perf_callchain_entry_ctx *entry, struct 
pt_regs *regs)
+{
+   __perf_callchain_kernel(entry, regs, true);
+}
+
 static inline int
 valid_user_frame(const void __user *fp, unsigned long size)
 {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ef764f613..b0e33ba36695 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1154,9 +1154,10 @@ DECLARE_PER_CPU(struct perf_callchain_entry, 

[tip:x86/urgent] x86/gart: Exclude GART aperture from kcore

2019-03-23 Thread tip-bot for Kairui Song
Commit-ID:  ffc8599aa9763f39f6736a79da4d1575e7006f9a
Gitweb: https://git.kernel.org/tip/ffc8599aa9763f39f6736a79da4d1575e7006f9a
Author: Kairui Song 
AuthorDate: Fri, 8 Mar 2019 11:05:08 +0800
Committer:  Thomas Gleixner 
CommitDate: Sat, 23 Mar 2019 12:11:49 +0100

x86/gart: Exclude GART aperture from kcore

On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range. Accessing the GART range via
/proc/kcore results in a kernel crash.

vmcore used to have the same issue, until it was fixed with commit
2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', leveraging
existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
when attempting to read the aperture region, and so it won't read from the
actual memory.

Apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in the
previous vmcore fix. Just with some minor adjustment, rename some functions
for more general usage, and simplify the hook infrastructure a bit as there
is no module usage yet.

Suggested-by: Baoquan He 
Signed-off-by: Kairui Song 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Jiri Bohac 
Acked-by: Baoquan He 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Alexey Dobriyan 
Cc: Andrew Morton 
Cc: Omar Sandoval 
Cc: Dave Young 
Link: https://lkml.kernel.org/r/20190308030508.13548-1-kas...@redhat.com


---
 arch/x86/kernel/aperture_64.c | 20 +---
 fs/proc/kcore.c   | 27 +++
 include/linux/kcore.h |  2 ++
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 58176b56354e..294ed4392a0e 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -14,6 +14,7 @@
 #define pr_fmt(fmt) "AGP: " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -57,7 +58,7 @@ int fallback_aper_force __initdata;
 
 int fix_aperture __initdata = 1;
 
-#ifdef CONFIG_PROC_VMCORE
+#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
 /*
  * If the first kernel maps the aperture over e820 RAM, the kdump kernel will
  * use the same range because it will remain configured in the northbridge.
@@ -66,20 +67,25 @@ int fix_aperture __initdata = 1;
  */
 static unsigned long aperture_pfn_start, aperture_page_count;
 
-static int gart_oldmem_pfn_is_ram(unsigned long pfn)
+static int gart_mem_pfn_is_ram(unsigned long pfn)
 {
return likely((pfn < aperture_pfn_start) ||
  (pfn >= aperture_pfn_start + aperture_page_count));
 }
 
-static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
+static void __init exclude_from_core(u64 aper_base, u32 aper_order)
 {
aperture_pfn_start = aper_base >> PAGE_SHIFT;
aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
-   WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram));
+#ifdef CONFIG_PROC_VMCORE
+   WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram));
+#endif
+#ifdef CONFIG_PROC_KCORE
+   WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram));
+#endif
 }
 #else
-static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
+static void exclude_from_core(u64 aper_base, u32 aper_order)
 {
 }
 #endif
@@ -474,7 +480,7 @@ out:
 * may have allocated the range over its e820 RAM
 * and fixed up the northbridge
 */
-   exclude_from_vmcore(last_aper_base, last_aper_order);
+   exclude_from_core(last_aper_base, last_aper_order);
 
return 1;
}
@@ -520,7 +526,7 @@ out:
 * overlap with the first kernel's memory. We can't access the
 * range through vmcore even though it should be part of the dump.
 */
-   exclude_from_vmcore(aper_alloc, aper_order);
+   exclude_from_core(aper_alloc, aper_order);
 
/* Fix up the north bridges */
for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index bbcc185062bb..d29d869abec1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -54,6 +54,28 @@ static LIST_HEAD(kclist_head);
 static DECLARE_RWSEM(kclist_lock);
 static int kcore_need_update = 1;
 
+/*
+ * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
+ * Same as oldmem_pfn_is_ram in vmcore
+ */
+static int (*mem_pfn_is_ram)(unsigned long pfn);
+
+int __init register_mem_pfn_is_ram(int (*fn)(unsigned long pfn))
+{
+   if (mem_pfn_is_ram)
+   return -EBUSY;
+   mem_pfn_is_ram = fn;
+   return 0;
+}
+
+static int pfn_is_ram(unsigned long pfn)
+{
+   if (mem_pfn_is_ram)
+   return mem_pfn_is_ram(pfn);
+   else
+   return 1;
+}
+
 /* This doesn't grab kclist_lock, so it shou

Re: [PATCH v5] x86/gart/kcore: Exclude GART aperture from kcore

2019-03-21 Thread Kairui Song
On Fri, Mar 8, 2019 at 11:06 AM Kairui Song  wrote:
>
> On machines where the GART aperture is mapped over physical RAM,
> /proc/kcore contains the GART aperture range and reading it may lead
> to kernel panic.
>
> Vmcore used to have the same issue, until we fixed it in
> commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")',
> leveraging existing hook infrastructure in vmcore to let /proc/vmcore
> return zeroes when attempting to read the aperture region, and so it
> won't read from the actual memory.
>
> We apply the same workaround for kcore. First implement the same hook
> infrastructure for kcore, then reuse the hook functions introduced in
> previous vmcore fix. Just with some minor adjustment, rename some
> functions for more general usage, and simplify the hook infrastructure
> a bit as there is no module usage yet.
>
> Suggested-by: Baoquan He 
> Signed-off-by: Kairui Song 
>
> ---
>
> Update from V4:
> - Remove the unregistering funtion and move functions never used after
>   init to .init
>
> Update from V3:
> - Reuse the approach in V2, as Jiri noticed V3 approach may fail
>   some use case. It introduce overlapped region in kcore, and can't
>   garenteen the read request will fall into the region we wanted.
> - Improve some function naming suggested by Baoquan in V2.
> - Simplify the hook registering and checking, we are not exporting the
>   hook register function for now, no need to make it that complex.
>
> Update from V2:
> Instead of repeating the same hook infrastructure for kcore, introduce
> a new kcore area type to avoid reading from, and let kcore always bypass
> this kind of area.
>
> Update from V1:
> Fix a complie error when CONFIG_PROC_KCORE is not set
>
>  arch/x86/kernel/aperture_64.c | 20 +---
>  fs/proc/kcore.c   | 27 +++
>  include/linux/kcore.h |  2 ++
>  3 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
> index 58176b56354e..294ed4392a0e 100644
> --- a/arch/x86/kernel/aperture_64.c
> +++ b/arch/x86/kernel/aperture_64.c
> @@ -14,6 +14,7 @@
>  #define pr_fmt(fmt) "AGP: " fmt
>
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -57,7 +58,7 @@ int fallback_aper_force __initdata;
>
>  int fix_aperture __initdata = 1;
>
> -#ifdef CONFIG_PROC_VMCORE
> +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
>  /*
>   * If the first kernel maps the aperture over e820 RAM, the kdump kernel will
>   * use the same range because it will remain configured in the northbridge.
> @@ -66,20 +67,25 @@ int fix_aperture __initdata = 1;
>   */
>  static unsigned long aperture_pfn_start, aperture_page_count;
>
> -static int gart_oldmem_pfn_is_ram(unsigned long pfn)
> +static int gart_mem_pfn_is_ram(unsigned long pfn)
>  {
> return likely((pfn < aperture_pfn_start) ||
>   (pfn >= aperture_pfn_start + aperture_page_count));
>  }
>
> -static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
> +static void __init exclude_from_core(u64 aper_base, u32 aper_order)
>  {
> aperture_pfn_start = aper_base >> PAGE_SHIFT;
> aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
> -   WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram));
> +#ifdef CONFIG_PROC_VMCORE
> +   WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram));
> +#endif
> +#ifdef CONFIG_PROC_KCORE
> +   WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram));
> +#endif
>  }
>  #else
> -static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
> +static void exclude_from_core(u64 aper_base, u32 aper_order)
>  {
>  }
>  #endif
> @@ -474,7 +480,7 @@ int __init gart_iommu_hole_init(void)
>  * may have allocated the range over its e820 RAM
>  * and fixed up the northbridge
>  */
> -   exclude_from_vmcore(last_aper_base, last_aper_order);
> +   exclude_from_core(last_aper_base, last_aper_order);
>
> return 1;
> }
> @@ -520,7 +526,7 @@ int __init gart_iommu_hole_init(void)
>  * overlap with the first kernel's memory. We can't access the
>  * range through vmcore even though it should be part of the dump.
>  */
> -   exclude_from_vmcore(aper_alloc, aper_order);
> +   exclude_from_core(aper_alloc, aper_order);
>
> /* Fix up the north bridges */
> for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
> dif

[PATCH v5] x86/gart/kcore: Exclude GART aperture from kcore

2019-03-07 Thread Kairui Song
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range and reading it may lead
to kernel panic.

Vmcore used to have the same issue, until we fixed it in
commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")',
leveraging existing hook infrastructure in vmcore to let /proc/vmcore
return zeroes when attempting to read the aperture region, and so it
won't read from the actual memory.

We apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in
previous vmcore fix. Just with some minor adjustment, rename some
functions for more general usage, and simplify the hook infrastructure
a bit as there is no module usage yet.

Suggested-by: Baoquan He 
Signed-off-by: Kairui Song 

---

Update from V4:
- Remove the unregistering funtion and move functions never used after
  init to .init

Update from V3:
- Reuse the approach in V2, as Jiri noticed V3 approach may fail
  some use case. It introduce overlapped region in kcore, and can't
  garenteen the read request will fall into the region we wanted.
- Improve some function naming suggested by Baoquan in V2.
- Simplify the hook registering and checking, we are not exporting the
  hook register function for now, no need to make it that complex.

Update from V2:
Instead of repeating the same hook infrastructure for kcore, introduce
a new kcore area type to avoid reading from, and let kcore always bypass
this kind of area.

Update from V1:
Fix a complie error when CONFIG_PROC_KCORE is not set

 arch/x86/kernel/aperture_64.c | 20 +---
 fs/proc/kcore.c   | 27 +++
 include/linux/kcore.h |  2 ++
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 58176b56354e..294ed4392a0e 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -14,6 +14,7 @@
 #define pr_fmt(fmt) "AGP: " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -57,7 +58,7 @@ int fallback_aper_force __initdata;
 
 int fix_aperture __initdata = 1;
 
-#ifdef CONFIG_PROC_VMCORE
+#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
 /*
  * If the first kernel maps the aperture over e820 RAM, the kdump kernel will
  * use the same range because it will remain configured in the northbridge.
@@ -66,20 +67,25 @@ int fix_aperture __initdata = 1;
  */
 static unsigned long aperture_pfn_start, aperture_page_count;
 
-static int gart_oldmem_pfn_is_ram(unsigned long pfn)
+static int gart_mem_pfn_is_ram(unsigned long pfn)
 {
return likely((pfn < aperture_pfn_start) ||
  (pfn >= aperture_pfn_start + aperture_page_count));
 }
 
-static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
+static void __init exclude_from_core(u64 aper_base, u32 aper_order)
 {
aperture_pfn_start = aper_base >> PAGE_SHIFT;
aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
-   WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram));
+#ifdef CONFIG_PROC_VMCORE
+   WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram));
+#endif
+#ifdef CONFIG_PROC_KCORE
+   WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram));
+#endif
 }
 #else
-static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
+static void exclude_from_core(u64 aper_base, u32 aper_order)
 {
 }
 #endif
@@ -474,7 +480,7 @@ int __init gart_iommu_hole_init(void)
 * may have allocated the range over its e820 RAM
 * and fixed up the northbridge
 */
-   exclude_from_vmcore(last_aper_base, last_aper_order);
+   exclude_from_core(last_aper_base, last_aper_order);
 
return 1;
}
@@ -520,7 +526,7 @@ int __init gart_iommu_hole_init(void)
 * overlap with the first kernel's memory. We can't access the
 * range through vmcore even though it should be part of the dump.
 */
-   exclude_from_vmcore(aper_alloc, aper_order);
+   exclude_from_core(aper_alloc, aper_order);
 
/* Fix up the north bridges */
for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index bbcc185062bb..d29d869abec1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -54,6 +54,28 @@ static LIST_HEAD(kclist_head);
 static DECLARE_RWSEM(kclist_lock);
 static int kcore_need_update = 1;
 
+/*
+ * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
+ * Same as oldmem_pfn_is_ram in vmcore
+ */
+static int (*mem_pfn_is_ram)(unsigned long pfn);
+
+int __init register_mem_pfn_is_ram(int (*fn)(unsigned long pfn))
+{
+   if (mem_pfn_is_ram)
+   return -EBUSY;
+   mem_pfn_is_ram = fn;
+   re

Re: [PATCH v4] x86/gart/kcore: Exclude GART aperture from kcore

2019-03-06 Thread Kairui Song
On Thu, Mar 7, 2019 at 1:03 AM Jiri Bohac  wrote:
>
> Hi,
>
> On Wed, Mar 06, 2019 at 07:38:59PM +0800, Kairui Song wrote:
> > +int register_mem_pfn_is_ram(int (*fn)(unsigned long pfn))
> > +{
> > + if (mem_pfn_is_ram)
> > + return -EBUSY;
> > + mem_pfn_is_ram = fn;
> > + return 0;
> > +}
> > +
> > +void unregister_mem_pfn_is_ram(void)
> > +{
> > + mem_pfn_is_ram = NULL;
> > +}
> > +
> > +static int pfn_is_ram(unsigned long pfn)
> > +{
> > + if (mem_pfn_is_ram)
> > + return mem_pfn_is_ram(pfn);
> > + else
> > + return 1;
> > +}
> > +
>
> If anyone were ever to use unregister_mem_pfn_is_ram(),
> pfn_is_ram() would become racy.
>
> In V2 you had this:
> +   fn = mem_pfn_is_ram;
> +   if (fn)
> +   ret = fn(pfn);
>
> I agree it's unnecessary since nothing uses
> unregister_mem_pfn_is_ram(). But then I think it would be best to
> just drop the unregister function.
>
> Otherwise the patch looks good to me.
>

Good catch, let me remove the unregister function.
Also, I'd like to have an __init prefix for register_mem_pfn_is_ram,
will update in V5.

--
Best Regards,
Kairui Song


[tip:x86/urgent] x86/hyperv: Fix kernel panic when kexec on HyperV

2019-03-06 Thread tip-bot for Kairui Song
Commit-ID:  179fb36abb097976997f50733d5b122a29158cba
Gitweb: https://git.kernel.org/tip/179fb36abb097976997f50733d5b122a29158cba
Author: Kairui Song 
AuthorDate: Wed, 6 Mar 2019 19:18:27 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 6 Mar 2019 23:27:44 +0100

x86/hyperv: Fix kernel panic when kexec on HyperV

After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec fails with a kernel panic:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xc901d000

Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156

That happens if hypercall based IPIs are used because the hypercall page is
reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
which invokes the hypercall and dereferences the unusable page.

To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
any misuse, IPI sending will fall back to the non hypercall based
method. This only happens on kexec / kdump so just setting the pointer to
NULL is good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song 
Signed-off-by: Thomas Gleixner 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Vitaly Kuznetsov 
Cc: Dave Young 
Cc: de...@linuxdriverproject.org
Link: https://lkml.kernel.org/r/20190306111827.14131-1-kas...@redhat.com
---
 arch/x86/hyperv/hv_init.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..d3f42b6bbdac 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -406,6 +406,13 @@ void hyperv_cleanup(void)
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+   /*
+* Reset hypercall page reference before reset the page,
+* let hypercall operations fail safely rather than
+* panic the kernel for using invalid hypercall page
+*/
+   hv_hypercall_pg = NULL;
+
/* Reset the hypercall page */
hypercall_msr.as_uint64 = 0;
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);


[PATCH v4] x86/gart/kcore: Exclude GART aperture from kcore

2019-03-06 Thread Kairui Song
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range and reading it may lead
to kernel panic.

Vmcore used to have the same issue, until we fixed it in
commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")',
leveraging existing hook infrastructure in vmcore to let /proc/vmcore
return zeroes when attempting to read the aperture region, and so it
won't read from the actual memory.

We apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook function introduced in
previous vmcore fix. Just with some minor adjustment, rename some
functions for more general usage, and simplify the hook infrastructure
a bit as there is no module usage yet.

Suggested-by: Baoquan He 
Signed-off-by: Kairui Song 

---

Update from V3:
- Reuse the approach in V2, as Jiri noticed V3 approach may fail
  some use case. It introduce overlapped region in kcore, and can't
  garenteen the read request will fall into the region we wanted.
- Improve some function naming suggested by Baoquan in V2.
- Simplify the hook registering and checking, we are not exporting the
  hook register function for now, no need to make it that complex.
- Simplify the commit message

Update from V2:
Instead of repeating the same hook infrastructure for kcore, introduce
a new kcore area type to avoid reading from, and let kcore always bypass
this kind of area.

Update from V1:
Fix a complie error when CONFIG_PROC_KCORE is not set

 arch/x86/kernel/aperture_64.c | 20 +---
 fs/proc/kcore.c   | 32 
 include/linux/kcore.h |  3 +++
 3 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 58176b56354e..c1319567b441 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -14,6 +14,7 @@
 #define pr_fmt(fmt) "AGP: " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -57,7 +58,7 @@ int fallback_aper_force __initdata;
 
 int fix_aperture __initdata = 1;
 
-#ifdef CONFIG_PROC_VMCORE
+#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
 /*
  * If the first kernel maps the aperture over e820 RAM, the kdump kernel will
  * use the same range because it will remain configured in the northbridge.
@@ -66,20 +67,25 @@ int fix_aperture __initdata = 1;
  */
 static unsigned long aperture_pfn_start, aperture_page_count;
 
-static int gart_oldmem_pfn_is_ram(unsigned long pfn)
+static int gart_mem_pfn_is_ram(unsigned long pfn)
 {
return likely((pfn < aperture_pfn_start) ||
  (pfn >= aperture_pfn_start + aperture_page_count));
 }
 
-static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
+static void exclude_from_core(u64 aper_base, u32 aper_order)
 {
aperture_pfn_start = aper_base >> PAGE_SHIFT;
aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
-   WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram));
+#ifdef CONFIG_PROC_VMCORE
+   WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram));
+#endif
+#ifdef CONFIG_PROC_KCORE
+   WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram));
+#endif
 }
 #else
-static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
+static void exclude_from_core(u64 aper_base, u32 aper_order)
 {
 }
 #endif
@@ -474,7 +480,7 @@ int __init gart_iommu_hole_init(void)
 * may have allocated the range over its e820 RAM
 * and fixed up the northbridge
 */
-   exclude_from_vmcore(last_aper_base, last_aper_order);
+   exclude_from_core(last_aper_base, last_aper_order);
 
return 1;
}
@@ -520,7 +526,7 @@ int __init gart_iommu_hole_init(void)
 * overlap with the first kernel's memory. We can't access the
 * range through vmcore even though it should be part of the dump.
 */
-   exclude_from_vmcore(aper_alloc, aper_order);
+   exclude_from_core(aper_alloc, aper_order);
 
/* Fix up the north bridges */
for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index bbcc185062bb..e51b324450d6 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -54,6 +54,33 @@ static LIST_HEAD(kclist_head);
 static DECLARE_RWSEM(kclist_lock);
 static int kcore_need_update = 1;
 
+/*
+ * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
+ * Same as oldmem_pfn_is_ram in vmcore
+ */
+static int (*mem_pfn_is_ram)(unsigned long pfn);
+
+int register_mem_pfn_is_ram(int (*fn)(unsigned long pfn))
+{
+   if (mem_pfn_is_ram)
+   return -EBUSY;
+   mem_pfn_is_ram = fn;
+   return 0;
+}
+
+void unregister_mem_pfn_is_ram(void)
+{
+   mem_pfn_is_ram 

[PATCH v3] x86, hyperv: fix kernel panic when kexec on HyperV

2019-03-06 Thread Kairui Song
After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec will fail with a kernel panic like this:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 
PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0
Oops: 0002 [#1] SMP PTI
CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xc901d000
Code: Bad RIP value.
RSP: 0018:c9000495bcf0 EFLAGS: 00010046
RAX:  RBX: c901d000 RCX: 00020015
RDX: 7f553000 RSI:  RDI: c9000495bd28
RBP: 0002 R08:  R09: 8238aaf8
R10: 8238aae0 R11:  R12: 88007f553008
R13: 0001 R14: 8800ff553000 R15: 
FS:  7ff5c0e67b80() GS:880078e0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0
Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156
 ? __do_sys_reboot+0x1be/0x210
 ? kmem_cache_free+0x1b1/0x1e0
 ? __dentry_kill+0x10b/0x160
 ? _cond_resched+0x15/0x30
 ? dentry_kill+0x47/0x170
 ? dput.part.34+0xc6/0x100
 ? __fput+0x147/0x220
 ? _cond_resched+0x15/0x30
 ? task_work_run+0x38/0xa0
 ? do_syscall_64+0x5b/0x160
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security 
nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter 
ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw 
scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel 
hv_vmbus

That's because now we may use hypercall for sending IPI, the
hypercall page will be reset very early upon kexec reboot, but kexec
will need to send IPI for stopping CPUs, and it will reference this
no longer usable page, then kernel panics.

To fix it, simply reset hv_hypercall_pg to NULL before the page is
reset to avoid any misuse, IPI sending will fallback to use non
hypercall based method. This only happens on kexec / kdump so setting to
NULL should be good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song 

---

Update from V2:
- The memory barrier is not needed, remove it.

Update from V1:
- Add comment for the wmb call.

 arch/x86/hyperv/hv_init.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..d3f42b6bbdac 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -406,6 +406,13 @@ void hyperv_cleanup(void)
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+   /*
+* Reset hypercall page reference before reset the page,
+* let hypercall operations fail safely rather than
+* panic the kernel for using invalid hypercall page
+*/
+   hv_hypercall_pg = NULL;
+
/* Reset the hypercall page */
hypercall_msr.as_uint64 = 0;
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.20.1



Re: [PATCH v2] x86/gart/kcore: Exclude GART aperture from kcore

2019-03-06 Thread Kairui Song
On Tue, Feb 19, 2019 at 4:00 PM Kairui Song  wrote:
>
> On Thu, Jan 24, 2019 at 10:17 AM Baoquan He  wrote:
> >
> > On 01/23/19 at 10:50pm, Kairui Song wrote:
> > > > >  int fix_aperture __initdata = 1;
> > > > >
> > > > > -#ifdef CONFIG_PROC_VMCORE
> > > > > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
> > > > >  /*
> > > > >   * If the first kernel maps the aperture over e820 RAM, the kdump 
> > > > > kernel will
> > > > >   * use the same range because it will remain configured in the 
> > > > > northbridge.
> > > > > @@ -66,7 +67,7 @@ int fix_aperture __initdata = 1;
> > > > >   */
> > > > >  static unsigned long aperture_pfn_start, aperture_page_count;
> > > > >
> > > > > -static int gart_oldmem_pfn_is_ram(unsigned long pfn)
> > > > > +static int gart_mem_pfn_is_ram(unsigned long pfn)
> > > > >  {
> > > > >   return likely((pfn < aperture_pfn_start) ||
> > > > > (pfn >= aperture_pfn_start + 
> > > > > aperture_page_count));
> > > > > @@ -76,7 +77,12 @@ static void exclude_from_vmcore(u64 aper_base, u32 
> > > > > aper_order)
> > > >
> > > > Shouldn't this function name be changed? It's not only handling vmcore
> > > > stuff any more, but also kcore. And this function is not excluding, but
> > > > resgistering.
> > > >
> > > > Other than this, it looks good to me.
> > > >
> > > > Thanks
> > > > Baoquan
> > > >
> > >
> > > Good suggestion, it's good to change this function name too to avoid
> > > any misleading. This patch hasn't got any other reviews recently, I'll
> > > update it shortly.
> >
> > There's more.
> >
> > These two are doing the same thing:
> >   register_mem_pfn_is_ram
> >   register_oldmem_pfn_is_ram
> >
> > Need remove one of them and put it in a right place. Furthermore, may
> > need see if there's existing function which is used to register a
> > function to a hook.
> >
> > Secondly, exclude_from_vmcore() is not excluding anthing, it's only
> > registering a function which is used to judge if oldmem/pfn is ram. Need
> > rename it.
> >
> > Thanks
> > Baoquan
>

Hi Baoquan, after second thought, vmcore and kcore are doing similar
thing but still quite independent of each, didn't see any simple way
to share the logic.
And for the following naming issue I think considering the context
there is no problem, "exclude_from_vmcore(aper_alloc, aper_order)" is
clearly doing what it literally means, excluding the aperture from
vmcore.

Let me know if anything is wrong, will send V4 later reuse this approach.

--
Best Regards,
Kairui Song


Re: [PATCH v3] x86/gart/kcore: Exclude GART aperture from kcore

2019-03-06 Thread Kairui Song
On Fri, Mar 1, 2019 at 7:12 AM Jiri Bohac  wrote:
>
> On Wed, Feb 13, 2019 at 04:28:00PM +0800, Kairui Song wrote:
> > @@ -465,6 +472,12 @@ read_kcore(struct file *file, char __user *buffer, 
> > size_t buflen, loff_t *fpos)
> >   goto out;
> >   }
> >   m = NULL;   /* skip the list anchor */
> > + } else if (m->type == KCORE_NORAM) {
> > + /* for NORAM area just fill zero */
> > + if (clear_user(buffer, tsz)) {
> > + ret = -EFAULT;
> > + goto out;
> > + }
>
> I don't think this works reliably. The loop filling the buffer
> has this logic at the top:
>
> while (buflen) {
> /*
>  * If this is the first iteration or the address is not within
>  * the previous entry, search for a matching entry.
>  */
> if (!m || start < m->addr || start >= m->addr + m->size) {
> list_for_each_entry(m, _head, list) {
> if (start >= m->addr &&
> start < m->addr + m->size)
> break;
> }
> }
>
> This sets m to the kclist entry that contains the memory being
> read. But if we do a large read that starts in valid KCORE_RAM
> memory below the GART overlap and extends into the overlap, m
> will not be set to the KCORE_NORAM entry. It will keep pointing
> to the KCORE_RAM entry and the patch will have no effect.
>
> But this seems already broken in existing cases as well, various
> KCORE_* types overlap with KCORE_RAM, don't they?  So maybe
> bf991c2231117d50a7645792b514354fc8d19dae ("proc/kcore: optimize
> multiple page reads") broke this and once fixed, this KCORE_NORAM
> approach will work. Omar?
>

Thanks for the review! You are right, although I hid the NORAM region
from the elf header, but didn't notice this potential risk of having
overlapped region.
I didn't see other kcore regions overlap for now, if so the
optimization should be totally good.
Better to keep using a hook just like what we did in vmcore or we will
have a performance drop for "fixing" this.
Will send V4 using the previous approach if there are no further comments.

-- 
Best Regards,
Kairui Song


Re: [PATCH v2] x86, hyperv: fix kernel panic when kexec on HyperV

2019-03-05 Thread Kairui Song
On Tue, Mar 5, 2019 at 8:33 PM Peter Zijlstra  wrote:
>
> On Tue, Mar 05, 2019 at 08:17:03PM +0800, Kairui Song wrote:
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 7abb09e2eeb8..34aa1e953dfc 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -406,6 +406,12 @@ void hyperv_cleanup(void)
> >   /* Reset our OS id */
> >   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> >
> > + /* Cleanup hypercall page reference before reset the page */
> > + hv_hypercall_pg = NULL;
> > +
> > + /* Make sure page reference is cleared before wrmsr */
>
> This comment forgets to tell us who cares about this. And why the wrmsr
> itself isn't serializing enough.
>
> > + wmb();
> > +
> >   /* Reset the hypercall page */
> >   hypercall_msr.as_uint64 = 0;
> >   wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
>
> That looks like a fake MSR; and you're telling me that VMEXIT doesn't
> serialize?

Thanks for the review, seem I being a bit paranoid on this. Will drop
it and send a v3 if no one has any other complaint.

--
Best Regards,
Kairui Song


Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM

2019-03-05 Thread Kairui Song
On Tue, Mar 5, 2019 at 8:28 PM Peter Zijlstra  wrote:
>
> On Wed, Feb 27, 2019 at 10:55:46PM +0800, Kairui Song wrote:
> > On Wed, Feb 27, 2019 at 8:02 PM Peter Zijlstra  wrote:
> > >
> > > On Tue, Feb 26, 2019 at 11:56:15PM +0800, Kairui Song wrote:
> > > >  arch/x86/hyperv/hv_init.c | 4 
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > > > index 7abb09e2eeb8..92291c18d716 100644
> > > > --- a/arch/x86/hyperv/hv_init.c
> > > > +++ b/arch/x86/hyperv/hv_init.c
> > > > @@ -406,6 +406,10 @@ void hyperv_cleanup(void)
> > > >   /* Reset our OS id */
> > > >   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> > > >
> > > > + /* Cleanup page reference before reset the page */
> > > > + hv_hypercall_pg = NULL;
> > > > + wmb();
> > >
> > > What do we need that SFENCE for? Any why does it lack a comment?
> >
> > Hi, that's for ensuring the hv_hypercall_pg is reset to NULL before
> > the following wrmsr call. The wrmsr call will make the pointer address
> > invalid.
>
> WRMSR is a serializing instruction (except for TSC_DEADLINE and the
> X2APIC).
>

Many thanks for the info, I'm not aware of the exception condition, V2
is sent, will drop the barrier in V3 then.

-- 
Best Regards,
Kairui Song


[PATCH v2] x86, hyperv: fix kernel panic when kexec on HyperV

2019-03-05 Thread Kairui Song
After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec will fail with a kernel panic like this:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 
PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0
Oops: 0002 [#1] SMP PTI
CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xc901d000
Code: Bad RIP value.
RSP: 0018:c9000495bcf0 EFLAGS: 00010046
RAX:  RBX: c901d000 RCX: 00020015
RDX: 7f553000 RSI:  RDI: c9000495bd28
RBP: 0002 R08:  R09: 8238aaf8
R10: 8238aae0 R11:  R12: 88007f553008
R13: 0001 R14: 8800ff553000 R15: 
FS:  7ff5c0e67b80() GS:880078e0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0
Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156
 ? __do_sys_reboot+0x1be/0x210
 ? kmem_cache_free+0x1b1/0x1e0
 ? __dentry_kill+0x10b/0x160
 ? _cond_resched+0x15/0x30
 ? dentry_kill+0x47/0x170
 ? dput.part.34+0xc6/0x100
 ? __fput+0x147/0x220
 ? _cond_resched+0x15/0x30
 ? task_work_run+0x38/0xa0
 ? do_syscall_64+0x5b/0x160
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security 
nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter 
ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw 
scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel 
hv_vmbus

That's because now we may use hypercall for sending IPI, the
hypercall page will be reset very early upon kexec reboot, but kexec
will need to send IPI for stopping CPUs, and it will reference this
no longer usable page, then kernel panics.

To fix it, simply reset hv_hypercall_pg to NULL before the page is
reset to avoid any misuse, IPI sending will fallback to use non
hypercall based method. This only happens on kexec / kdump so setting to
NULL should be good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song 

---

Update from V1:
- Add comment for the barrier.

 arch/x86/hyperv/hv_init.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..34aa1e953dfc 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -406,6 +406,12 @@ void hyperv_cleanup(void)
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+   /* Cleanup hypercall page reference before reset the page */
+   hv_hypercall_pg = NULL;
+
+   /* Make sure page reference is cleared before wrmsr */
+   wmb();
+
/* Reset the hypercall page */
hypercall_msr.as_uint64 = 0;
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.20.1



Re: [RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM

2019-02-27 Thread Kairui Song
On Wed, Feb 27, 2019 at 8:02 PM Peter Zijlstra  wrote:
>
> On Tue, Feb 26, 2019 at 11:56:15PM +0800, Kairui Song wrote:
> >  arch/x86/hyperv/hv_init.c | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 7abb09e2eeb8..92291c18d716 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -406,6 +406,10 @@ void hyperv_cleanup(void)
> >   /* Reset our OS id */
> >   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> >
> > + /* Cleanup page reference before reset the page */
> > + hv_hypercall_pg = NULL;
> > + wmb();
>
> What do we need that SFENCE for? Any why does it lack a comment?

Hi, that's for ensuring the hv_hypercall_pg is reset to NULL before
the following wrmsr call. The wrmsr call will make the pointer address
invalid.
I can add some comment in V2 if this is OK.


--
Best Regards,
Kairui Song


[RFC PATCH] x86, hyperv: fix kernel panic when kexec on HyperV VM

2019-02-26 Thread Kairui Song
When hypercalls is used for sending IPIs, kexec will fail with a kernel
panic like this:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 
PGD 800057995067 P4D 800057995067 PUD 57990067 PMD 0
Oops: 0002 [#1] SMP PTI
CPU: 0 PID: 1016 Comm: kexec Not tainted 4.18.16-300.fc29.x86_64 #1
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xc901d000
Code: Bad RIP value.
RSP: 0018:c9000495bcf0 EFLAGS: 00010046
RAX:  RBX: c901d000 RCX: 00020015
RDX: 7f553000 RSI:  RDI: c9000495bd28
RBP: 0002 R08:  R09: 8238aaf8
R10: 8238aae0 R11:  R12: 88007f553008
R13: 0001 R14: 8800ff553000 R15: 
FS:  7ff5c0e67b80() GS:880078e0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: c901cfd6 CR3: 4f678006 CR4: 003606f0
Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156
 ? __do_sys_reboot+0x1be/0x210
 ? kmem_cache_free+0x1b1/0x1e0
 ? __dentry_kill+0x10b/0x160
 ? _cond_resched+0x15/0x30
 ? dentry_kill+0x47/0x170
 ? dput.part.34+0xc6/0x100
 ? __fput+0x147/0x220
 ? _cond_resched+0x15/0x30
 ? task_work_run+0x38/0xa0
 ? do_syscall_64+0x5b/0x160
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security 
nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter 
ip6_tables sunrpc vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
intel_rapl_perf hv_balloon joydev xfs libcrc32c hv_storvsc serio_raw 
scsi_transport_fc hv_netvsc hyperv_keyboard hyperv_fb hid_hyperv crc32c_intel 
hv_vmbus

That's because HyperV's machine_ops.shutdown allow registering a hook to
be called upon shutdown, hv_vmbus will invalidate the hypercall page
using this hook. But hv_hypercall_pg is still pointing to this invalid
page, any hypercall based operation will panic the kernel. And kexec
progress will send IPIs for stopping CPUs.

This fix this by simply reset hv_hypercall_pg to NULL when the page is
revoked to avoid any misuse. IPI sending will fallback to use non
hypercall based method. This only happens on kexec / kdump so setting to
NULL should be good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song 

---

I'm not sure about the details of what happened after the

wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

But this fix should be valid, please let me know if I get anything
wrong, thanks.

 arch/x86/hyperv/hv_init.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..92291c18d716 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -406,6 +406,10 @@ void hyperv_cleanup(void)
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+   /* Cleanup page reference before reset the page */
+   hv_hypercall_pg = NULL;
+   wmb();
+
/* Reset the hypercall page */
hypercall_msr.as_uint64 = 0;
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.20.1



Re: [PATCH v3] x86/gart/kcore: Exclude GART aperture from kcore

2019-02-24 Thread Kairui Song
On Wed, Feb 13, 2019 at 4:28 PM Kairui Song  wrote:
>
> On machines where the GART aperture is mapped over physical RAM,
> /proc/kcore contains the GART aperture range and reading it may lead
> to kernel panic.
>
> In 'commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")',
> a workaround is applied for vmcore to let /proc/vmcore return zeroes
> when attempting to read the GART region, as vmcore have the same issue,
> and after 'commit 707d4eefbdb3 ("Revert "[PATCH] Insert GART region
> into resource map"")', userspace tools won't be able to detect GART
> region so have to avoid it from being reading in kernel.
>
> This patch applies a similar workaround for kcore. Let /proc/kcore
> return zeroes for GART aperture.
>
> Both vmcore and kcore maintain a memory mapping list, in the vmcore
> workaround we exclude the GART region by registering a hook for checking
> if PFN is valid before reading, because vmcore's memory mapping could
> be generated by userspace which doesn't know about GART. But for kcore
> it will be simpler to just alter the memory area list, kcore's area list
> is always generated by kernel on init.
>
> Kcore's memory area list is generated very late so can't exclude the
> overlapped area when GART is initialized, instead simply introduce a
> new area enum type KCORE_NORAM, register GART aperture as KCORE_NORAM
> and let kcore return zeros for all KCORE_NORAM area. This fixes the
> problem well with minor code changes.
>
> ---
> Update from V2:
> Instead of repeating the same hook infrastructure for kcore, introduce
> a new kcore area type to avoid reading from, and let kcore always bypass
> this kind of area.
>
> Update from V1:
> Fix a complie error when CONFIG_PROC_KCORE is not set
>
>  arch/x86/kernel/aperture_64.c | 14 ++
>  fs/proc/kcore.c   | 13 +
>  include/linux/kcore.h |  1 +
>  3 files changed, 28 insertions(+)
>
> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
> index 58176b56354e..5fb04bdd3221 100644
> --- a/arch/x86/kernel/aperture_64.c
> +++ b/arch/x86/kernel/aperture_64.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /*
>   * Using 512M as goal, in case kexec will load kernel_big
> @@ -84,6 +85,17 @@ static void exclude_from_vmcore(u64 aper_base, u32 
> aper_order)
>  }
>  #endif
>
> +#ifdef CONFIG_PROC_KCORE
> +static struct kcore_list kcore_gart;
> +
> +static void __init exclude_from_kcore(u64 aper_base, u32 aper_order) {
> +   u32 aper_size = (32 * 1024 * 1024) << aper_order;
> +   kclist_add(_gart, __va(aper_base), aper_size, KCORE_NORAM);
> +}
> +#else
> +static inline void __init exclude_from_kcore(u64 aper_base, u32 aper_order) 
> { }
> +#endif
> +
>  /* This code runs before the PCI subsystem is initialized, so just
> access the northbridge directly. */
>
> @@ -475,6 +487,7 @@ int __init gart_iommu_hole_init(void)
>  * and fixed up the northbridge
>  */
> exclude_from_vmcore(last_aper_base, last_aper_order);
> +   exclude_from_kcore(last_aper_base, last_aper_order);
>
> return 1;
> }
> @@ -521,6 +534,7 @@ int __init gart_iommu_hole_init(void)
>  * range through vmcore even though it should be part of the dump.
>  */
> exclude_from_vmcore(aper_alloc, aper_order);
> +   exclude_from_kcore(aper_alloc, aper_order);
>
> /* Fix up the north bridges */
> for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index bbcc185062bb..15e0d74d7c56 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -75,6 +75,8 @@ static size_t get_kcore_size(int *nphdr, size_t *phdrs_len, 
> size_t *notes_len,
> size = 0;
>
> list_for_each_entry(m, _head, list) {
> +   if (m->type == KCORE_NORAM)
> +   continue;
> try = kc_vaddr_to_offset((size_t)m->addr + m->size);
> if (try > size)
> size = try;
> @@ -256,6 +258,9 @@ static int kcore_update_ram(void)
> list_for_each_entry_safe(pos, tmp, _head, list) {
> if (pos->type == KCORE_RAM || pos->type == KCORE_VMEMMAP)
> list_move(>list, );
> +   /* Move NORAM area to head of the new list */
> +   if (pos->type == KCORE_NORAM)
> +   list_move(>list, );
> }
> list_splice_tail(, _head)

Re: [PATCH v2] x86/gart/kcore: Exclude GART aperture from kcore

2019-02-19 Thread Kairui Song
On Thu, Jan 24, 2019 at 10:17 AM Baoquan He  wrote:
>
> On 01/23/19 at 10:50pm, Kairui Song wrote:
> > > >  int fix_aperture __initdata = 1;
> > > >
> > > > -#ifdef CONFIG_PROC_VMCORE
> > > > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
> > > >  /*
> > > >   * If the first kernel maps the aperture over e820 RAM, the kdump 
> > > > kernel will
> > > >   * use the same range because it will remain configured in the 
> > > > northbridge.
> > > > @@ -66,7 +67,7 @@ int fix_aperture __initdata = 1;
> > > >   */
> > > >  static unsigned long aperture_pfn_start, aperture_page_count;
> > > >
> > > > -static int gart_oldmem_pfn_is_ram(unsigned long pfn)
> > > > +static int gart_mem_pfn_is_ram(unsigned long pfn)
> > > >  {
> > > >   return likely((pfn < aperture_pfn_start) ||
> > > > (pfn >= aperture_pfn_start + aperture_page_count));
> > > > @@ -76,7 +77,12 @@ static void exclude_from_vmcore(u64 aper_base, u32 
> > > > aper_order)
> > >
> > > Shouldn't this function name be changed? It's not only handling vmcore
> > > stuff any more, but also kcore. And this function is not excluding, but
> > > resgistering.
> > >
> > > Other than this, it looks good to me.
> > >
> > > Thanks
> > > Baoquan
> > >
> >
> > Good suggestion, it's good to change this function name too to avoid
> > any misleading. This patch hasn't got any other reviews recently, I'll
> > update it shortly.
>
> There's more.
>
> These two are doing the same thing:
>   register_mem_pfn_is_ram
>   register_oldmem_pfn_is_ram
>
> Need remove one of them and put it in a right place. Furthermore, may
> need see if there's existing function which is used to register a
> function to a hook.
>
> Secondly, exclude_from_vmcore() is not excluding anthing, it's only
> registering a function which is used to judge if oldmem/pfn is ram. Need
> rename it.
>
> Thanks
> Baoquan

Thanks a lot for the review! I've sent V3, using a different approach.
It's true repeating the hook infrastructure cause duplication, but I
see vmcore/kcore didn't share much code, so instead of sharing a
common hook infrastructure / registering entry, I used a new kcore
memory mapping list enum type to fix it, it also introduced less code.
Please have a look at V3, let me know how you think about it, thanks!


--
Best Regards,
Kairui Song


[PATCH v3] x86/gart/kcore: Exclude GART aperture from kcore

2019-02-13 Thread Kairui Song
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range and reading it may lead
to kernel panic.

In 'commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")',
a workaround is applied for vmcore to let /proc/vmcore return zeroes
when attempting to read the GART region, as vmcore have the same issue,
and after 'commit 707d4eefbdb3 ("Revert "[PATCH] Insert GART region
into resource map"")', userspace tools won't be able to detect GART
region so have to avoid it from being reading in kernel.

This patch applies a similar workaround for kcore. Let /proc/kcore
return zeroes for GART aperture.

Both vmcore and kcore maintain a memory mapping list, in the vmcore
workaround we exclude the GART region by registering a hook for checking
if PFN is valid before reading, because vmcore's memory mapping could
be generated by userspace which doesn't know about GART. But for kcore
it will be simpler to just alter the memory area list, kcore's area list
is always generated by kernel on init.

Kcore's memory area list is generated very late so can't exclude the
overlapped area when GART is initialized, instead simply introduce a
new area enum type KCORE_NORAM, register GART aperture as KCORE_NORAM
and let kcore return zeros for all KCORE_NORAM area. This fixes the
problem well with minor code changes.

---
Update from V2:
Instead of repeating the same hook infrastructure for kcore, introduce
a new kcore area type to avoid reading from, and let kcore always bypass
this kind of area.

Update from V1:
Fix a complie error when CONFIG_PROC_KCORE is not set

 arch/x86/kernel/aperture_64.c | 14 ++
 fs/proc/kcore.c   | 13 +
 include/linux/kcore.h |  1 +
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 58176b56354e..5fb04bdd3221 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Using 512M as goal, in case kexec will load kernel_big
@@ -84,6 +85,17 @@ static void exclude_from_vmcore(u64 aper_base, u32 
aper_order)
 }
 #endif
 
+#ifdef CONFIG_PROC_KCORE
+static struct kcore_list kcore_gart;
+
+static void __init exclude_from_kcore(u64 aper_base, u32 aper_order) {
+   u32 aper_size = (32 * 1024 * 1024) << aper_order;
+   kclist_add(_gart, __va(aper_base), aper_size, KCORE_NORAM);
+}
+#else
+static inline void __init exclude_from_kcore(u64 aper_base, u32 aper_order) { }
+#endif
+
 /* This code runs before the PCI subsystem is initialized, so just
access the northbridge directly. */
 
@@ -475,6 +487,7 @@ int __init gart_iommu_hole_init(void)
 * and fixed up the northbridge
 */
exclude_from_vmcore(last_aper_base, last_aper_order);
+   exclude_from_kcore(last_aper_base, last_aper_order);
 
return 1;
}
@@ -521,6 +534,7 @@ int __init gart_iommu_hole_init(void)
 * range through vmcore even though it should be part of the dump.
 */
exclude_from_vmcore(aper_alloc, aper_order);
+   exclude_from_kcore(aper_alloc, aper_order);
 
/* Fix up the north bridges */
for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index bbcc185062bb..15e0d74d7c56 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -75,6 +75,8 @@ static size_t get_kcore_size(int *nphdr, size_t *phdrs_len, 
size_t *notes_len,
size = 0;
 
list_for_each_entry(m, _head, list) {
+   if (m->type == KCORE_NORAM)
+   continue;
try = kc_vaddr_to_offset((size_t)m->addr + m->size);
if (try > size)
size = try;
@@ -256,6 +258,9 @@ static int kcore_update_ram(void)
list_for_each_entry_safe(pos, tmp, _head, list) {
if (pos->type == KCORE_RAM || pos->type == KCORE_VMEMMAP)
list_move(>list, );
+   /* Move NORAM area to head of the new list */
+   if (pos->type == KCORE_NORAM)
+   list_move(>list, );
}
list_splice_tail(, _head);
 
@@ -356,6 +361,8 @@ read_kcore(struct file *file, char __user *buffer, size_t 
buflen, loff_t *fpos)
 
phdr = [1];
list_for_each_entry(m, _head, list) {
+   if (m->type == KCORE_NORAM)
+   continue;
phdr->p_type = PT_LOAD;
phdr->p_flags = PF_R | PF_W | PF_X;
phdr->p_offset = kc_vaddr_to_offset(m->addr) + 
data_offset;
@@ -465,6 +472,12 @@ read_kcore(struct file *file, char __user *buffer, size_t 
buflen, loff_t *fpos)
goto out;
}
  

[tip:x86/boot] x86/kexec: Fill in acpi_rsdp_addr from the first kernel

2019-02-06 Thread tip-bot for Kairui Song
Commit-ID:  ccec81e4251f5a5421e02874e394338a897056ca
Gitweb: https://git.kernel.org/tip/ccec81e4251f5a5421e02874e394338a897056ca
Author: Kairui Song 
AuthorDate: Tue, 5 Feb 2019 01:38:52 +0800
Committer:  Borislav Petkov 
CommitDate: Wed, 6 Feb 2019 15:29:03 +0100

x86/kexec: Fill in acpi_rsdp_addr from the first kernel

When efi=noruntime or efi=oldmap is used on the kernel command line, EFI
services won't be available in the second kernel, therefore the second
kernel will not be able to get the ACPI RSDP address from firmware by
calling EFI services and so it won't boot.

Commit

  e6e094e053af ("x86/acpi, x86/boot: Take RSDP address from boot params if 
available")

added an acpi_rsdp_addr field to boot_params which stores the RSDP
address for other kernel users.

Recently, after

  3a63f70bf4c3 ("x86/boot: Early parse RSDP and save it in boot_params")

the acpi_rsdp_addr will always be filled with a valid RSDP address.

So fill in that value into the second kernel's boot_params thus ensuring
that the second kernel receives the RSDP value from the first kernel.

 [ bp: massage commit message. ]

Signed-off-by: Kairui Song 
Signed-off-by: Borislav Petkov 
Cc: AKASHI Takahiro 
Cc: Andrew Morton 
Cc: Baoquan He 
Cc: Chao Fan 
Cc: Dave Young 
Cc: David Howells 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: ke...@lists.infradead.org
Cc: Philipp Rudo 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Yannik Sembritzki 
Link: https://lkml.kernel.org/r/20190204173852.4863-1-kas...@redhat.com
---
 arch/x86/kernel/kexec-bzimage64.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 0d5efa34f359..2a0ff871025a 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -215,6 +215,9 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
params->screen_info.ext_mem_k = 0;
params->alt_mem_k = 0;
 
+   /* Always fill in RSDP: it is either 0 or a valid value */
+   params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr;
+
/* Default APM info */
memset(>apm_bios_info, 0, sizeof(params->apm_bios_info));
 
@@ -253,7 +256,6 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
efi_setup_data_offset);
 #endif
-
/* Setup EDD info */
memcpy(params->eddbuf, boot_params.eddbuf,
EDDMAXNR * sizeof(struct edd_info));


[PATCH] x86, kexec_file_load: fill in acpi_rsdp_addr boot param unconditionally

2019-02-04 Thread Kairui Song
When efi=noruntime or efi=oldmap is used, EFI services won't be available
in the second kernel, therefore the second kernel will not be able to get
the ACPI RSDP address from firmware by calling EFI services so it won't
boot. Previously we are expecting the user to set the acpi_rsdp=
on kernel command line for second kernel as there was no other way to
pass RSDP address to second kernel.

After commit e6e094e053af ("x86/acpi, x86/boot: Take RSDP address from
boot params if available"), now it's possible to set an acpi_rsdp_addr
parameter in the boot_params passed to second kernel, and kernel will
prefer using this value for the RSDP address when it's set.

And with commit 3a63f70bf4c3 ("x86/boot: Early parse RSDP and save it in
boot_params"), now the acpi_rsdp_addr will always be filled with valid
RSDP address. So we just fill in that value for second kernel's
boot_params unconditionally, this ensure second kernel always use the
same RSDP value as the first kernel.

Tested with an EFI enabled KVM VM with efi=noruntime.

Signed-off-by: Kairui Song 
---

This is update of part of patch series: "[PATCH v3 0/3] make kexec work
with efi=noruntime or efi=old_map."

But "[PATCH v3 1/3] x86, kexec_file_load: Don't setup EFI info if EFI
runtime is not enabled" is already in [tip:x86/urgent], and with Chao's
commit 3a63f70bf4c3 in [tip:x86/boot], we can just fill in acpi_rsdp_addr
boot param unconditionally to fix the problem, so only I update and resend
this patch.

 arch/x86/kernel/kexec-bzimage64.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 53917a3ebf94..3611946dc7ea 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -218,6 +218,9 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
params->screen_info.ext_mem_k = 0;
params->alt_mem_k = 0;
 
+   /* Always fill in RSDP, it's either 0 or a valid value */
+   params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr;
+
/* Default APM info */
memset(>apm_bios_info, 0, sizeof(params->apm_bios_info));
 
@@ -256,7 +259,6 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
efi_setup_data_offset);
 #endif
-
/* Setup EDD info */
memcpy(params->eddbuf, boot_params.eddbuf,
EDDMAXNR * sizeof(struct edd_info));
-- 
2.20.1



[PATCH] integrity, KEYS: Fix build break with set_platform_trusted_keys

2019-02-03 Thread Kairui Song
Commit 15ebb2eb0705 ("integrity, KEYS: add a reference to platform
keyring") introduced a function set_platform_trusted_keys
and calls the function in __integrity_init_keyring.

It only checks if CONFIG_INTEGRITY_PLATFORM_KEYRING is enabled when
enabling this function, but actually this function also depends on
CONFIG_SYSTEM_TRUSTED_KEYRING.

So when built with CONFIG_INTEGRITY_PLATFORM_KEYRING &&
!CONFIG_SYSTEM_TRUSTED_KEYRING. we will get following error:

digsig.c:92: undefined reference to `set_platform_trusted_keys'

And it also mistakenly wrapped the function code in the ifdef block of
CONFIG_SYSTEM_DATA_VERIFICATION.

This commit fixes the issue by adding the missing check of
CONFIG_SYSTEM_TRUSTED_KEYRING and move the function code out of
CONFIG_SYSTEM_DATA_VERIFICATION's ifdef block.

Fixes: 15ebb2eb0705 ("integrity, KEYS: add a reference to platform keyring")
Signed-off-by: Kairui Song 
---
 certs/system_keyring.c| 4 ++--
 include/keys/system_keyring.h | 9 +++--
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 19bd0504bbcb..c05c29ae4d5d 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -279,11 +279,11 @@ int verify_pkcs7_signature(const void *data, size_t len,
 }
 EXPORT_SYMBOL_GPL(verify_pkcs7_signature);
 
+#endif /* CONFIG_SYSTEM_DATA_VERIFICATION */
+
 #ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
 void __init set_platform_trusted_keys(struct key *keyring)
 {
platform_trusted_keys = keyring;
 }
 #endif
-
-#endif /* CONFIG_SYSTEM_DATA_VERIFICATION */
diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h
index c7f899ee974e..42a93eda331c 100644
--- a/include/keys/system_keyring.h
+++ b/include/keys/system_keyring.h
@@ -61,16 +61,13 @@ static inline struct key *get_ima_blacklist_keyring(void)
 }
 #endif /* CONFIG_IMA_BLACKLIST_KEYRING */
 
-#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
-
+#if defined(CONFIG_INTEGRITY_PLATFORM_KEYRING) && \
+   defined(CONFIG_SYSTEM_TRUSTED_KEYRING)
 extern void __init set_platform_trusted_keys(struct key *keyring);
-
 #else
-
 static inline void set_platform_trusted_keys(struct key *keyring)
 {
 }
-
-#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */
+#endif
 
 #endif /* _KEYS_SYSTEM_KEYRING_H */
-- 
2.20.1



[tip:x86/urgent] x86/kexec: Don't setup EFI info if EFI runtime is not enabled

2019-02-01 Thread tip-bot for Kairui Song
Commit-ID:  2aa958c99c7fd3162b089a1a56a34a0cdb778de1
Gitweb: https://git.kernel.org/tip/2aa958c99c7fd3162b089a1a56a34a0cdb778de1
Author: Kairui Song 
AuthorDate: Fri, 18 Jan 2019 19:13:08 +0800
Committer:  Borislav Petkov 
CommitDate: Fri, 1 Feb 2019 18:18:54 +0100

x86/kexec: Don't setup EFI info if EFI runtime is not enabled

Kexec-ing a kernel with "efi=noruntime" on the first kernel's command
line causes the following null pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 
  #PF error: [normal kernel read fault]
  Call Trace:
   efi_runtime_map_copy+0x28/0x30
   bzImage64_load+0x688/0x872
   arch_kexec_kernel_image_load+0x6d/0x70
   kimage_file_alloc_init+0x13e/0x220
   __x64_sys_kexec_file_load+0x144/0x290
   do_syscall_64+0x55/0x1a0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Just skip the EFI info setup if EFI runtime services are not enabled.

 [ bp: Massage commit message. ]

Suggested-by: Dave Young 
Signed-off-by: Kairui Song 
Signed-off-by: Borislav Petkov 
Acked-by: Dave Young 
Cc: AKASHI Takahiro 
Cc: Andrew Morton 
Cc: Ard Biesheuvel 
Cc: b...@redhat.com
Cc: David Howells 
Cc: erik.schma...@intel.com
Cc: fanc.f...@cn.fujitsu.com
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: ke...@lists.infradead.org
Cc: l...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: Philipp Rudo 
Cc: rafael.j.wyso...@intel.com
Cc: robert.mo...@intel.com
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Yannik Sembritzki 
Link: https://lkml.kernel.org/r/20190118111310.29589-2-kas...@redhat.com
---
 arch/x86/kernel/kexec-bzimage64.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 0d5efa34f359..53917a3ebf94 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -167,6 +167,9 @@ setup_efi_state(struct boot_params *params, unsigned long 
params_load_addr,
struct efi_info *current_ei = _params.efi_info;
struct efi_info *ei = >efi_info;
 
+   if (!efi_enabled(EFI_RUNTIME_SERVICES))
+   return 0;
+
if (!current_ei->efi_memmap_size)
return 0;
 


Re: [PATCH v2] x86/gart/kcore: Exclude GART aperture from kcore

2019-01-23 Thread Kairui Song
On Wed, Jan 23, 2019 at 10:14 PM Baoquan He  wrote:
>
> On 01/02/19 at 06:54pm, Kairui Song wrote:
> > diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
> > index 58176b56354e..c8a56f083419 100644
> > --- a/arch/x86/kernel/aperture_64.c
> > +++ b/arch/x86/kernel/aperture_64.c
> > @@ -14,6 +14,7 @@
> >  #define pr_fmt(fmt) "AGP: " fmt
> >
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -57,7 +58,7 @@ int fallback_aper_force __initdata;
> >
> >  int fix_aperture __initdata = 1;
> >
> > -#ifdef CONFIG_PROC_VMCORE
> > +#if defined(CONFIG_PROC_VMCORE) || defined(CONFIG_PROC_KCORE)
> >  /*
> >   * If the first kernel maps the aperture over e820 RAM, the kdump kernel 
> > will
> >   * use the same range because it will remain configured in the northbridge.
> > @@ -66,7 +67,7 @@ int fix_aperture __initdata = 1;
> >   */
> >  static unsigned long aperture_pfn_start, aperture_page_count;
> >
> > -static int gart_oldmem_pfn_is_ram(unsigned long pfn)
> > +static int gart_mem_pfn_is_ram(unsigned long pfn)
> >  {
> >   return likely((pfn < aperture_pfn_start) ||
> > (pfn >= aperture_pfn_start + aperture_page_count));
> > @@ -76,7 +77,12 @@ static void exclude_from_vmcore(u64 aper_base, u32 
> > aper_order)
>
> Shouldn't this function name be changed? It's not only handling vmcore
> stuff any more, but also kcore. And this function is not excluding, but
> resgistering.
>
> Other than this, it looks good to me.
>
> Thanks
> Baoquan
>

Good suggestion, it's good to change this function name too to avoid
any misleading. This patch hasn't got any other reviews recently, I'll
update it shortly.

-- 
Best Regards,
Kairui Song


Re: [PATCH v5 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-21 Thread Kairui Song
On Mon, Jan 21, 2019 at 6:00 PM Kairui Song  wrote:
>
> This patch series adds a .platform_trusted_keys in system_keyring as the
> reference to .platform keyring in integrity subsystem, when platform
> keyring is being initialized it will be updated, so it will be
> accessable for verifying PE signed kernel image.
>
> This patch series let kexec_file_load use platform keyring as fall
> back if it failed to verify the image against secondary keyring,
> so the actually PE signature verify process will use keys provides
> by firmware.
>
> After this patch kexec_file_load will be able to verify a signed PE
> bzImage using keys in platform keyring.
>
> Tested in a VM with locally signed kernel with pesign and imported the
> cert to EFI's MokList variable.
>
> To test this patch series on latest kernel, you need to ensure this commit
> is applied as there is an regression bug in sanity_check_segment_list():
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b
>
> Update from V4:
>   - Drop ifdef in security/integrity/digsig.c to make code clearer
>   - Fix a potential issue, set_platform_trusted_keys should not be
> called when keyring initialization failed
>
> Update from V3:
>   - Tweak and simplify commit message as suggested by Mimi Zohar
>
> Update from V2:
>   - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys
> should be used for verifying image as suggested by Mimi Zohar
>
> Update from V1:
>   - Make platform_trusted_keys static, and update commit message as suggested
> by Mimi Zohar
>   - Always check if platform keyring is initialized before use it
>
> Kairui Song (2):
>   integrity, KEYS: add a reference to platform keyring
>   kexec, KEYS: Make use of platform keyring for signature verify
>
>  arch/x86/kernel/kexec-bzimage64.c | 13 ++---
>  certs/system_keyring.c| 22 +-
>  include/keys/system_keyring.h |  9 +
>  include/linux/verification.h  |  1 +
>  security/integrity/digsig.c   |  3 +++
>  5 files changed, 44 insertions(+), 4 deletions(-)
>
> --
> 2.20.1
>

Hi Mimi,

I've updated the patch series again and as the code changed a bit I
didn't include previous Reviewd-by / Tested-by, it worked with no
problem, could you help have a review again? Thank you.

-- 
Best Regards,
Kairui Song


[PATCH v5 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-21 Thread Kairui Song
This patch series adds a .platform_trusted_keys in system_keyring as the
reference to .platform keyring in integrity subsystem, when platform
keyring is being initialized it will be updated, so it will be
accessable for verifying PE signed kernel image.

This patch series let kexec_file_load use platform keyring as fall
back if it failed to verify the image against secondary keyring,
so the actually PE signature verify process will use keys provides
by firmware.

After this patch kexec_file_load will be able to verify a signed PE
bzImage using keys in platform keyring.

Tested in a VM with locally signed kernel with pesign and imported the
cert to EFI's MokList variable.

To test this patch series on latest kernel, you need to ensure this commit
is applied as there is an regression bug in sanity_check_segment_list():

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b

Update from V4:
  - Drop ifdef in security/integrity/digsig.c to make code clearer
  - Fix a potential issue, set_platform_trusted_keys should not be
called when keyring initialization failed

Update from V3:
  - Tweak and simplify commit message as suggested by Mimi Zohar

Update from V2:
  - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys
should be used for verifying image as suggested by Mimi Zohar

Update from V1:
  - Make platform_trusted_keys static, and update commit message as suggested
by Mimi Zohar
  - Always check if platform keyring is initialized before use it

Kairui Song (2):
  integrity, KEYS: add a reference to platform keyring
  kexec, KEYS: Make use of platform keyring for signature verify

 arch/x86/kernel/kexec-bzimage64.c | 13 ++---
 certs/system_keyring.c| 22 +-
 include/keys/system_keyring.h |  9 +
 include/linux/verification.h  |  1 +
 security/integrity/digsig.c   |  3 +++
 5 files changed, 44 insertions(+), 4 deletions(-)

-- 
2.20.1



[PATCH v5 1/2] integrity, KEYS: add a reference to platform keyring

2019-01-21 Thread Kairui Song
commit 9dc92c45177a ('integrity: Define a trusted platform keyring')
introduced a .platform keyring for storing preboot keys, used for
verifying kernel images' signature. Currently only IMA-appraisal is able
to use the keyring to verify kernel images that have their signature
stored in xattr.

This patch exposes the .platform keyring, making it
accessible for verifying PE signed kernel images as well.

Suggested-by: Mimi Zohar 
Signed-off-by: Kairui Song 
---
 certs/system_keyring.c| 9 +
 include/keys/system_keyring.h | 9 +
 security/integrity/digsig.c   | 3 +++
 3 files changed, 21 insertions(+)

diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 81728717523d..4690ef9cda8a 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys;
 #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
 static struct key *secondary_trusted_keys;
 #endif
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+static struct key *platform_trusted_keys;
+#endif
 
 extern __initconst const u8 system_certificate_list[];
 extern __initconst const unsigned long system_certificate_list_size;
@@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t len,
 }
 EXPORT_SYMBOL_GPL(verify_pkcs7_signature);
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+void __init set_platform_trusted_keys(struct key *keyring) {
+   platform_trusted_keys = keyring;
+}
+#endif
+
 #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */
diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h
index 359c2f936004..df766ef8f03c 100644
--- a/include/keys/system_keyring.h
+++ b/include/keys/system_keyring.h
@@ -61,5 +61,14 @@ static inline struct key *get_ima_blacklist_keyring(void)
 }
 #endif /* CONFIG_IMA_BLACKLIST_KEYRING */
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+
+extern void __init set_platform_trusted_keys(struct key* keyring);
+
+#else
+
+static inline void set_platform_trusted_keys(struct key* keyring) { };
+
+#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */
 
 #endif /* _KEYS_SYSTEM_KEYRING_H */
diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
index f45d6edecf99..e19c2eb72c51 100644
--- a/security/integrity/digsig.c
+++ b/security/integrity/digsig.c
@@ -87,6 +87,9 @@ static int __integrity_init_keyring(const unsigned int id, 
key_perm_t perm,
pr_info("Can't allocate %s keyring (%d)\n",
keyring_name[id], err);
keyring[id] = NULL;
+   } else {
+   if (id == INTEGRITY_KEYRING_PLATFORM)
+   set_platform_trusted_keys(keyring[id]);
}
 
return err;
-- 
2.20.1



[PATCH v5 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-21 Thread Kairui Song
This patch let kexec_file_load makes use of .platform keyring as fall
back if it failed to verify a PE signed image against secondary or
builtin key ring, make it possible to verify kernel image signed with
preboot keys as well.

This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous
VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature
should verify the signature using platform keyring. Also, decrease
the error message log level when verification failed with -ENOKEY,
so that if called tried multiple time with different keyring it
won't generate extra noises.

Signed-off-by: Kairui Song 
---
 arch/x86/kernel/kexec-bzimage64.c | 13 ++---
 certs/system_keyring.c| 13 -
 include/linux/verification.h  |  1 +
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 7d97e432cbbc..2c007abd3d40 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data)
 #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
 static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
 {
-   return verify_pefile_signature(kernel, kernel_len,
-  VERIFY_USE_SECONDARY_KEYRING,
-  VERIFYING_KEXEC_PE_SIGNATURE);
+   int ret;
+   ret = verify_pefile_signature(kernel, kernel_len,
+ VERIFY_USE_SECONDARY_KEYRING,
+ VERIFYING_KEXEC_PE_SIGNATURE);
+   if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
+   ret = verify_pefile_signature(kernel, kernel_len,
+ VERIFY_USE_PLATFORM_KEYRING,
+ VERIFYING_KEXEC_PE_SIGNATURE);
+   }
+   return ret;
 }
 #endif
 
diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 4690ef9cda8a..7085c286f4bd 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len,
 #else
trusted_keys = builtin_trusted_keys;
 #endif
+   } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) {
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+   trusted_keys = platform_trusted_keys;
+#else
+   trusted_keys = NULL;
+#endif
+   if (!trusted_keys) {
+   ret = -ENOKEY;
+   pr_devel("PKCS#7 platform keyring is not available\n");
+   goto error;
+   }
}
ret = pkcs7_validate_trust(pkcs7, trusted_keys);
if (ret < 0) {
if (ret == -ENOKEY)
-   pr_err("PKCS#7 signature not signed with a trusted 
key\n");
+   pr_devel("PKCS#7 signature not signed with a trusted 
key\n");
goto error;
}
 
diff --git a/include/linux/verification.h b/include/linux/verification.h
index cfa4730d607a..018fb5f13d44 100644
--- a/include/linux/verification.h
+++ b/include/linux/verification.h
@@ -17,6 +17,7 @@
  * should be used.
  */
 #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL)
+#define VERIFY_USE_PLATFORM_KEYRING  ((struct key *)2UL)
 
 /*
  * The use to which an asymmetric key is being put.
-- 
2.20.1



Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-21 Thread Kairui Song
On Fri, Jan 18, 2019 at 10:28 PM Kairui Song  wrote:
>
> On Fri, Jan 18, 2019 at 9:42 PM Kairui Song  wrote:
> >
> > On Fri, Jan 18, 2019 at 8:37 PM Dave Young  wrote:
> > >
> > > On 01/18/19 at 08:34pm, Dave Young wrote:
> > > > On 01/18/19 at 06:53am, Mimi Zohar wrote:
> > > > > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote:
> > > > > > This patch series adds a .platform_trusted_keys in system_keyring 
> > > > > > as the
> > > > > > reference to .platform keyring in integrity subsystem, when platform
> > > > > > keyring is being initialized it will be updated. So other component 
> > > > > > could
> > > > > > use this keyring as well.
> > > > >
> > > > > Kairui, when people review patches, the comments could be specific,
> > > > > but are normally generic.  My review included a couple of generic
> > > > > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the
> > > > > term "preboot" keys, and remove any references to "other components".
> > > > >
> > > > > After all the wording suggestions I've made, you are still saying, "So
> > > > > other components could use this keyring as well".  Really?!  How the
> > > > > platform keyring will be used in the future, is up to you and others
> > > > > to convince Linus.  At least for now, please limit its usage to
> > > > > verifying the PE signed kernel image.  If this patch set needs to be
> > > > > reposted, please remove all references to "other components".
> > > > >
> > > > > Dave/David, are you ok with Kairui's usage of "#ifdef's"?  Dave, you
> > > > > Acked the original post.  Can I include it?  Can we get some
> > > > > additional Ack's on these patches?
> > > >
> > > > It is better to update patch to use IS_ENABLED in patch 1/2 as well.
> > >
> > > Hmm, not only for patch 1/2, patch 2/2 also need an update
> > >
> > > > Other than that, for kexec part I'm fine with an ack.
> > > >
> > > > Thanks
> > > > Dave
> >
> > Thanks for the review again, will update the patch using IS_ENABLED
> > along with update the cover letter shortly.
> >
> > --
> > Best Regards,
> > Kairui Song
>
> Hi, before I update it again, most part of the new
> platform_trusted_keyring related code is following how
> secondary_trusted_keyring is implemented (surrounded by ifdefs). I
> thought this could reduce unused code when the keyring is not enabled.
> Else, all ifdef could be simply removed, when platform_keyring is not
> enabled, the platform_trusted_keys will always be NULL, and
> verify_pkcs7_signature will simply return NOKEY if anyone try to use
> platform keyring.
>
> Any suggestions? Or I can just remove the ifdef in
> security/integrity/digsig.c and make set_platform_trusted_keys a
> inline empty function in system_keyring.h.
>
> --
> Best Regards,
> Kairui Song

Hi, after a second thought I'll drop the #ifdef in
security/integrity/digsig.c in PATCH 1/2, and make the
set_platform_trusted_keys function a empty inline function when
CONFIG_INTEGRITY_PLATFORM_KEYRING is undefined.
But for other ifdefs in certs/system_keyring.c I think maybe just keep
then untouched. They were used to strip out the
platform_trusted_keyring variable and related function when
CONFIG_INTEGRITY_PLATFORM_KEYRING is not used, this should help reduce
unused code and prevent compile error, also make code style aligns
with existing code in system_keyring.c.

Will sent v5 with above updates and fix a potential problem found by Nayna.


--
Best Regards,
Kairui Song


Re: [PATCH v4 1/2] integrity, KEYS: add a reference to platform keyring

2019-01-18 Thread Kairui Song
On Fri, Jan 18, 2019 at 10:36 PM Nayna  wrote:
> On 01/18/2019 04:17 AM, Kairui Song wrote:
> > commit 9dc92c45177a ('integrity: Define a trusted platform keyring')
> > introduced a .platform keyring for storing preboot keys, used for
> > verifying kernel images' signature. Currently only IMA-appraisal is able
> > to use the keyring to verify kernel images that have their signature
> > stored in xattr.
> >
> > This patch exposes the .platform keyring, making it accessible for
> > verifying PE signed kernel images as well.
> >
> > Suggested-by: Mimi Zohar 
> > Signed-off-by: Kairui Song 
> > Reviewed-by: Mimi Zohar 
> > Tested-by: Mimi Zohar 
> > ---
> >   certs/system_keyring.c| 9 +
> >   include/keys/system_keyring.h | 5 +
> >   security/integrity/digsig.c   | 6 ++
> >   3 files changed, 20 insertions(+)
> >
> > diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> > index 81728717523d..4690ef9cda8a 100644
> > --- a/certs/system_keyring.c
> > +++ b/certs/system_keyring.c
> > @@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys;
> >   #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
> >   static struct key *secondary_trusted_keys;
> >   #endif
> > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> > +static struct key *platform_trusted_keys;
> > +#endif
> >
> >   extern __initconst const u8 system_certificate_list[];
> >   extern __initconst const unsigned long system_certificate_list_size;
> > @@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t 
> > len,
> >   }
> >   EXPORT_SYMBOL_GPL(verify_pkcs7_signature);
> >
> > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> > +void __init set_platform_trusted_keys(struct key *keyring) {
> > + platform_trusted_keys = keyring;
> > +}
> > +#endif
> > +
> >   #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */
> > diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h
> > index 359c2f936004..9e1b7849b6aa 100644
> > --- a/include/keys/system_keyring.h
> > +++ b/include/keys/system_keyring.h
> > @@ -61,5 +61,10 @@ static inline struct key *get_ima_blacklist_keyring(void)
> >   }
> >   #endif /* CONFIG_IMA_BLACKLIST_KEYRING */
> >
> > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> > +
> > +extern void __init set_platform_trusted_keys(struct key* keyring);
> > +
> > +#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */
> >
> >   #endif /* _KEYS_SYSTEM_KEYRING_H */
> > diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
> > index f45d6edecf99..bfabc2a8111d 100644
> > --- a/security/integrity/digsig.c
> > +++ b/security/integrity/digsig.c
> > @@ -89,6 +89,12 @@ static int __integrity_init_keyring(const unsigned int 
> > id, key_perm_t perm,
> >   keyring[id] = NULL;
> >   }
> >
> > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> > + if (id == INTEGRITY_KEYRING_PLATFORM) {
>
> Shouldn't it also check that keyring[id] is not NULL ?

Good catch, if it's NULL then platform_trusted_keyring will be set to
NULL as well, which will work just fine as in this case
platform_trusted_keyring is still considered not initialized. I'll add
a sanity check here to check err value just in case.
Thanks for your suggestion!

>
> Thanks & Regards,
>  - Nayna
>
> > + set_platform_trusted_keys(keyring[id]);
> > + }
> > +#endif
> > +
> >   return err;
> >   }
> >
>
>
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec



-- 
Best Regards,
Kairui Song


Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-18 Thread Kairui Song
On Fri, Jan 18, 2019 at 9:42 PM Kairui Song  wrote:
>
> On Fri, Jan 18, 2019 at 8:37 PM Dave Young  wrote:
> >
> > On 01/18/19 at 08:34pm, Dave Young wrote:
> > > On 01/18/19 at 06:53am, Mimi Zohar wrote:
> > > > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote:
> > > > > This patch series adds a .platform_trusted_keys in system_keyring as 
> > > > > the
> > > > > reference to .platform keyring in integrity subsystem, when platform
> > > > > keyring is being initialized it will be updated. So other component 
> > > > > could
> > > > > use this keyring as well.
> > > >
> > > > Kairui, when people review patches, the comments could be specific,
> > > > but are normally generic.  My review included a couple of generic
> > > > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the
> > > > term "preboot" keys, and remove any references to "other components".
> > > >
> > > > After all the wording suggestions I've made, you are still saying, "So
> > > > other components could use this keyring as well".  Really?!  How the
> > > > platform keyring will be used in the future, is up to you and others
> > > > to convince Linus.  At least for now, please limit its usage to
> > > > verifying the PE signed kernel image.  If this patch set needs to be
> > > > reposted, please remove all references to "other components".
> > > >
> > > > Dave/David, are you ok with Kairui's usage of "#ifdef's"?  Dave, you
> > > > Acked the original post.  Can I include it?  Can we get some
> > > > additional Ack's on these patches?
> > >
> > > It is better to update patch to use IS_ENABLED in patch 1/2 as well.
> >
> > Hmm, not only for patch 1/2, patch 2/2 also need an update
> >
> > > Other than that, for kexec part I'm fine with an ack.
> > >
> > > Thanks
> > > Dave
>
> Thanks for the review again, will update the patch using IS_ENABLED
> along with update the cover letter shortly.
>
> --
> Best Regards,
> Kairui Song

Hi, before I update it again, most part of the new
platform_trusted_keyring related code is following how
secondary_trusted_keyring is implemented (surrounded by ifdefs). I
thought this could reduce unused code when the keyring is not enabled.
Else, all ifdef could be simply removed, when platform_keyring is not
enabled, the platform_trusted_keys will always be NULL, and
verify_pkcs7_signature will simply return NOKEY if anyone try to use
platform keyring.

Any suggestions? Or I can just remove the ifdef in
security/integrity/digsig.c and make set_platform_trusted_keys a
inline empty function in system_keyring.h.

-- 
Best Regards,
Kairui Song


Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-18 Thread Kairui Song
On Fri, Jan 18, 2019 at 8:37 PM Dave Young  wrote:
>
> On 01/18/19 at 08:34pm, Dave Young wrote:
> > On 01/18/19 at 06:53am, Mimi Zohar wrote:
> > > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote:
> > > > This patch series adds a .platform_trusted_keys in system_keyring as the
> > > > reference to .platform keyring in integrity subsystem, when platform
> > > > keyring is being initialized it will be updated. So other component 
> > > > could
> > > > use this keyring as well.
> > >
> > > Kairui, when people review patches, the comments could be specific,
> > > but are normally generic.  My review included a couple of generic
> > > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the
> > > term "preboot" keys, and remove any references to "other components".
> > >
> > > After all the wording suggestions I've made, you are still saying, "So
> > > other components could use this keyring as well".  Really?!  How the
> > > platform keyring will be used in the future, is up to you and others
> > > to convince Linus.  At least for now, please limit its usage to
> > > verifying the PE signed kernel image.  If this patch set needs to be
> > > reposted, please remove all references to "other components".
> > >
> > > Dave/David, are you ok with Kairui's usage of "#ifdef's"?  Dave, you
> > > Acked the original post.  Can I include it?  Can we get some
> > > additional Ack's on these patches?
> >
> > It is better to update patch to use IS_ENABLED in patch 1/2 as well.
>
> Hmm, not only for patch 1/2, patch 2/2 also need an update
>
> > Other than that, for kexec part I'm fine with an ack.
> >
> > Thanks
> > Dave

Thanks for the review again, will update the patch using IS_ENABLED
along with update the cover letter shortly.

-- 
Best Regards,
Kairui Song


Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-18 Thread Kairui Song
On Fri, Jan 18, 2019, 19:54 Mimi Zohar 
> On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote:
> > This patch series adds a .platform_trusted_keys in system_keyring as the
> > reference to .platform keyring in integrity subsystem, when platform
> > keyring is being initialized it will be updated. So other component could
> > use this keyring as well.
>
> Kairui, when people review patches, the comments could be specific,
> but are normally generic.  My review included a couple of generic
> suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the
> term "preboot" keys, and remove any references to "other components".
>
> After all the wording suggestions I've made, you are still saying, "So
> other components could use this keyring as well".  Really?!  How the
> platform keyring will be used in the future, is up to you and others
> to convince Linus.  At least for now, please limit its usage to
> verifying the PE signed kernel image.  If this patch set needs to be
> reposted, please remove all references to "other components".
>
> Dave/David, are you ok with Kairui's usage of "#ifdef's"?  Dave, you
> Acked the original post.  Can I include it?  Can we get some
> additional Ack's on these patches?
>
> thanks!
>
> Mimi
>

Hi, Mimi, thanks for your feedback. My bad I reused the old cover
letter without checking it carefully, hopefully, the commit messages
should have a proper wording now. If the cover letter needs to be
updated I can resend the patch, let me just hold a while before update
again.


Re: [PATCH v3 2/3] acpi: store acpi_rsdp address for later kexec usage

2019-01-18 Thread Kairui Song
On Fri, Jan 18, 2019 at 7:26 PM Borislav Petkov  wrote:

> No, this is getting completely nuts: there's a bunch of functions which
> all end up returning boot_params's field except pvh_get_root_pointer().
>
> And now you're adding a late variant. And the cmdline paramater
> acpi_rsdp is in a CONFIG_KEXEC wrapper, and and...
>
> Wait until Chao Fan's stuff is applied, then do your changes ontop
> an drop all that ifdeffery. We will make this RDSP thing enabled
> unconditionally so that there's no need for ifdeffery and function
> wrappers.
>
> Also, after Chao's stuff, you won't need to call
> acpi_os_get_root_pointer() because the early code would've done that.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.

Good suggestion, will wait for Chao's update then.


--
Best Regards,
Kairui Song


[PATCH v3 1/3] x86, kexec_file_load: Don't setup EFI info if EFI runtime is not enabled

2019-01-18 Thread Kairui Song
Currently with "efi=noruntime" in kernel command line, calling
kexec_file_load will raise below problem:

[   97.967067] BUG: unable to handle kernel NULL pointer dereference at 

[   97.967894] #PF error: [normal kernel read fault]
...
[   97.980456] Call Trace:
[   97.980724]  efi_runtime_map_copy+0x28/0x30
[   97.981267]  bzImage64_load+0x688/0x872
[   97.981794]  arch_kexec_kernel_image_load+0x6d/0x70
[   97.982441]  kimage_file_alloc_init+0x13e/0x220
[   97.983035]  __x64_sys_kexec_file_load+0x144/0x290
[   97.983586]  do_syscall_64+0x55/0x1a0
[   97.983962]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

When efi runtime is not enabled, efi memmap is not mapped, so just skip
EFI info setup.

Suggested-by: Dave Young 
Signed-off-by: Kairui Song 
---
 arch/x86/kernel/kexec-bzimage64.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 2c007abd3d40..097f52fb02e3 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -167,6 +167,9 @@ setup_efi_state(struct boot_params *params, unsigned long 
params_load_addr,
struct efi_info *current_ei = _params.efi_info;
struct efi_info *ei = >efi_info;
 
+   if (!efi_enabled(EFI_RUNTIME_SERVICES))
+   return 0;
+
if (!current_ei->efi_memmap_size)
return 0;
 
-- 
2.20.1



[PATCH v3 3/3] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-18 Thread Kairui Song
When efi=noruntime or efi=oldmap is used, EFI services won't be available
in the second kernel, therefore the second kernel will not be able to get
the ACPI RSDP address from firmware by calling EFI services and won't
boot. Previously we are expecting the user to set the acpi_rsdp=
on kernel command line for second kernel as there was no way to pass RSDP
address to second kernel.

After commit e6e094e053af ('x86/acpi, x86/boot: Take RSDP address from
boot params if available'), now it's possible to set an acpi_rsdp_addr
parameter in the boot_params passed to second kernel, this commit makes
use of it, detect and set the RSDP address when it's required for second
kernel to boot.

Tested with an EFI enabled KVM VM with efi=noruntime.

Suggested-by: Dave Young 
Signed-off-by: Kairui Song 
---
 arch/x86/kernel/kexec-bzimage64.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 097f52fb02e3..63101b2194fb 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -255,8 +256,17 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
/* Setup EFI state */
setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
efi_setup_data_offset);
+
+#ifdef CONFIG_ACPI
+   /* Setup ACPI RSDP pointer in case EFI is not available in second 
kernel */
+   if (!acpi_disabled && (!efi_enabled(EFI_RUNTIME_SERVICES) || 
efi_enabled(EFI_OLD_MEMMAP))) {
+   params->acpi_rsdp_addr = acpi_os_get_root_pointer_late();
+   if (!params->acpi_rsdp_addr)
+   pr_warn("RSDP is not available for second kernel\n");
+   }
 #endif
 
+#endif
/* Setup EDD info */
memcpy(params->eddbuf, boot_params.eddbuf,
EDDMAXNR * sizeof(struct edd_info));
-- 
2.20.1



[PATCH v3 2/3] acpi: store acpi_rsdp address for later kexec usage

2019-01-18 Thread Kairui Song
Currently we have acpi_os_get_root_pointer as the universal function
to get RSDP address. But the function itself and some functions it
depends on are in .init section and make it not easy to retrieve the
RSDP value once kernel is initialized.

And for kexec, it need to retrive RSDP again if EFI is disabled, because
the second kernel will not be able get the RSDP value in such case, so
it expects either the user specify the RSDP value using kernel cmdline,
or kexec could retrive and pass the RSDP value using boot_params.

This patch stores the RSDP address when initialized is done, and
introduce an acpi_os_get_root_pointer_late for later kexec usage.

Signed-off-by: Kairui Song 
---
 drivers/acpi/osl.c   | 10 ++
 include/linux/acpi.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index f29e427d0d1d..6340d34d0df1 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -187,6 +187,16 @@ static int __init setup_acpi_rsdp(char *arg)
return kstrtoul(arg, 16, _rsdp);
 }
 early_param("acpi_rsdp", setup_acpi_rsdp);
+
+acpi_physical_address acpi_os_get_root_pointer_late(void) {
+   return acpi_rsdp;
+}
+
+static int __init acpi_store_root_pointer(void) {
+   acpi_rsdp = acpi_os_get_root_pointer();
+   return 0;
+}
+late_initcall(acpi_store_root_pointer);
 #endif
 
 acpi_physical_address __init acpi_os_get_root_pointer(void)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 87715f20b69a..226f2572eb8e 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -892,6 +892,9 @@ static inline void 
arch_reserve_mem_area(acpi_physical_address addr,
 {
 }
 #endif /* CONFIG_X86 */
+#ifdef CONFIG_KEXEC
+acpi_physical_address acpi_os_get_root_pointer_late(void);
+#endif
 #else
 #define acpi_os_set_prepare_sleep(func, pm1a_ctrl, pm1b_ctrl) do { } while (0)
 #endif
-- 
2.20.1



[PATCH v3 0/3] make kexec work with efi=noruntime or efi=old_map

2019-01-18 Thread Kairui Song
This patch series fix the kexec panic on efi=noruntime or efi=old_map
pass acpi_rsdp_addr to the second kernel and make it boot up properly.

Update from V2:
 - Store acpi rsdp value, and add an acpi_os_get_root_pointer_late as
   a helper, leveraging existing codes so we don't need to reparse RSDP.

Update from V1:
 - Add a cover letter and fix some type in commit message
 - Previous patches are not sent in a single thread

Kairui Song (3):
  x86, kexec_file_load: Don't setup EFI info if EFI runtime is not
enabled
  acpi: store acpi_rsdp address for later kexec usage
  x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

 arch/x86/kernel/kexec-bzimage64.c | 13 +
 drivers/acpi/osl.c| 10 ++
 include/linux/acpi.h  |  3 +++
 3 files changed, 26 insertions(+)

-- 
2.20.1



[PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-18 Thread Kairui Song
This patch series adds a .platform_trusted_keys in system_keyring as the
reference to .platform keyring in integrity subsystem, when platform
keyring is being initialized it will be updated. So other component could
use this keyring as well.

This patch series also let kexec_file_load use platform keyring as fall
back if it failed to verify the image against secondary keyring, make it
possible to load kernel signed by keys provides by firmware.

After this patch kexec_file_load will be able to verify a signed PE
bzImage using keys in platform keyring.

Tested in a VM with locally signed kernel with pesign and imported the
cert to EFI's MokList variable.

To test this patch series on latest kernel, you need to ensure this commit
is applied as there is an regression bug in sanity_check_segment_list():

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b

Update from V3:
  - Tweak and simplify commit message as suggested by Mimi Zohar

Update from V2:
  - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys
should be used for verifying image as suggested by Mimi Zohar

Update from V1:
  - Make platform_trusted_keys static, and update commit message as suggested
by Mimi Zohar
  - Always check if platform keyring is initialized before use it

Kairui Song (2):
  integrity, KEYS: add a reference to platform keyring
  kexec, KEYS: Make use of platform keyring for signature verify

 arch/x86/kernel/kexec-bzimage64.c | 13 ++---
 certs/system_keyring.c| 22 +-
 include/keys/system_keyring.h |  5 +
 include/linux/verification.h  |  1 +
 security/integrity/digsig.c   |  6 ++
 5 files changed, 43 insertions(+), 4 deletions(-)

-- 
2.20.1



[PATCH v4 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-18 Thread Kairui Song
This patch let kexec_file_load makes use of .platform keyring as fall
back if it failed to verify a PE signed image against secondary or
builtin keyring, make it possible to verify kernel image signed with
preboot keys as well.

This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous
VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature
should verify the signature using platform keyring. Also, decrease
the error message log level when verification failed with -ENOKEY,
so that if called tried multiple time with different keyring it
won't generate extra noises.

Signed-off-by: Kairui Song 
Reviewed-by: Mimi Zohar 
Tested-by: Mimi Zohar 
---
 arch/x86/kernel/kexec-bzimage64.c | 13 ++---
 certs/system_keyring.c| 13 -
 include/linux/verification.h  |  1 +
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 7d97e432cbbc..2c007abd3d40 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data)
 #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
 static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
 {
-   return verify_pefile_signature(kernel, kernel_len,
-  VERIFY_USE_SECONDARY_KEYRING,
-  VERIFYING_KEXEC_PE_SIGNATURE);
+   int ret;
+   ret = verify_pefile_signature(kernel, kernel_len,
+ VERIFY_USE_SECONDARY_KEYRING,
+ VERIFYING_KEXEC_PE_SIGNATURE);
+   if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
+   ret = verify_pefile_signature(kernel, kernel_len,
+ VERIFY_USE_PLATFORM_KEYRING,
+ VERIFYING_KEXEC_PE_SIGNATURE);
+   }
+   return ret;
 }
 #endif
 
diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 4690ef9cda8a..7085c286f4bd 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len,
 #else
trusted_keys = builtin_trusted_keys;
 #endif
+   } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) {
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+   trusted_keys = platform_trusted_keys;
+#else
+   trusted_keys = NULL;
+#endif
+   if (!trusted_keys) {
+   ret = -ENOKEY;
+   pr_devel("PKCS#7 platform keyring is not available\n");
+   goto error;
+   }
}
ret = pkcs7_validate_trust(pkcs7, trusted_keys);
if (ret < 0) {
if (ret == -ENOKEY)
-   pr_err("PKCS#7 signature not signed with a trusted 
key\n");
+   pr_devel("PKCS#7 signature not signed with a trusted 
key\n");
goto error;
}
 
diff --git a/include/linux/verification.h b/include/linux/verification.h
index cfa4730d607a..018fb5f13d44 100644
--- a/include/linux/verification.h
+++ b/include/linux/verification.h
@@ -17,6 +17,7 @@
  * should be used.
  */
 #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL)
+#define VERIFY_USE_PLATFORM_KEYRING  ((struct key *)2UL)
 
 /*
  * The use to which an asymmetric key is being put.
-- 
2.20.1



[PATCH v4 1/2] integrity, KEYS: add a reference to platform keyring

2019-01-18 Thread Kairui Song
commit 9dc92c45177a ('integrity: Define a trusted platform keyring')
introduced a .platform keyring for storing preboot keys, used for
verifying kernel images' signature. Currently only IMA-appraisal is able
to use the keyring to verify kernel images that have their signature
stored in xattr.

This patch exposes the .platform keyring, making it accessible for
verifying PE signed kernel images as well.

Suggested-by: Mimi Zohar 
Signed-off-by: Kairui Song 
Reviewed-by: Mimi Zohar 
Tested-by: Mimi Zohar 
---
 certs/system_keyring.c| 9 +
 include/keys/system_keyring.h | 5 +
 security/integrity/digsig.c   | 6 ++
 3 files changed, 20 insertions(+)

diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 81728717523d..4690ef9cda8a 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys;
 #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
 static struct key *secondary_trusted_keys;
 #endif
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+static struct key *platform_trusted_keys;
+#endif
 
 extern __initconst const u8 system_certificate_list[];
 extern __initconst const unsigned long system_certificate_list_size;
@@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t len,
 }
 EXPORT_SYMBOL_GPL(verify_pkcs7_signature);
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+void __init set_platform_trusted_keys(struct key *keyring) {
+   platform_trusted_keys = keyring;
+}
+#endif
+
 #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */
diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h
index 359c2f936004..9e1b7849b6aa 100644
--- a/include/keys/system_keyring.h
+++ b/include/keys/system_keyring.h
@@ -61,5 +61,10 @@ static inline struct key *get_ima_blacklist_keyring(void)
 }
 #endif /* CONFIG_IMA_BLACKLIST_KEYRING */
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+
+extern void __init set_platform_trusted_keys(struct key* keyring);
+
+#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */
 
 #endif /* _KEYS_SYSTEM_KEYRING_H */
diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
index f45d6edecf99..bfabc2a8111d 100644
--- a/security/integrity/digsig.c
+++ b/security/integrity/digsig.c
@@ -89,6 +89,12 @@ static int __integrity_init_keyring(const unsigned int id, 
key_perm_t perm,
keyring[id] = NULL;
}
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+   if (id == INTEGRITY_KEYRING_PLATFORM) {
+   set_platform_trusted_keys(keyring[id]);
+   }
+#endif
+
return err;
 }
 
-- 
2.20.1



Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-17 Thread Kairui Song
On Thu, Jan 17, 2019 at 5:40 PM Rafael J. Wysocki  wrote:
>
> On Thu, Jan 17, 2019 at 9:53 AM Dave Young  wrote:
> >
> > Add linux-acpi list
>
> Well, thanks, but please resend the patches with a CC to linux-acpi.
>

Hi, sure will do.
Any thought on adding an acpi_os_get_root_pointer_late and store rsdp
pointer as mentioned? Will updat the patch and post V2, and cc
linux-acpi as well later.

> > On 01/17/19 at 03:41pm, Kairui Song wrote:
> > > On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov  wrote:
> > > >
> > > > On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote:
> > > > > I didn't see a way to reuse things in that patch series, situation is
> > > > > different, in that patch it needs to get RSDP in very early boot stage
> > > > > so it did everything from scratch, in this patch kexec_file_load need
> > > > > to get RSDP too, but everything is well setup so things are a lot
> > > > > easier, just read from current boot_prams, efi and fallback to
> > > > > acpi_find_root_pointer should be good.
> > > >
> > > > No no. Early code should find out that venerable RSDP thing once and
> > > > will save it somewhere for further use. No gazillion parsings of it.
> > > > Just once and share it with the rest of the code that needs it.
> > > >
> > >
> > > How about we refill the boot_params.acpi_rsdp_addr if it is not valid
> > > in early code, so it could be used as a reliable RSDP address source?
> > > That should make things easier.
> > >
> > > But if early code should parse it and store it should be done in
> > > Chao's patch, or I can post another patch to do it if Chao's patch is
> > > merged.
> > >
> > > For now I think good to have something like this in this patch series
> > > to always keep storing acpi_rsdp in late code,
> > > acpi_os_get_root_pointer_late (maybe comeup with a better name later)
> > > could be used anytime to get RSDP and no extra parsing:
> > >
> > > --- a/drivers/acpi/osl.c
> > > +++ b/drivers/acpi/osl.c
> > > @@ -180,8 +180,8 @@ void acpi_os_vprintf(const char *fmt, va_list args)
> > >  #endif
> > >  }
> > >
> > > -#ifdef CONFIG_KEXEC
> > >  static unsigned long acpi_rsdp;
> > > +#ifdef CONFIG_KEXEC
> > >  static int __init setup_acpi_rsdp(char *arg)
> > >  {
> > > return kstrtoul(arg, 16, _rsdp);
> > > @@ -189,28 +189,38 @@ static int __init setup_acpi_rsdp(char *arg)
> > >  early_param("acpi_rsdp", setup_acpi_rsdp);
> > >  #endif
> > >
> > > +acpi_physical_address acpi_os_get_root_pointer_late(void) {
> > > +   return acpi_rsdp;
> > > +}
> > > +
> > >  acpi_physical_address __init acpi_os_get_root_pointer(void)
> > >  {
> > > acpi_physical_address pa;
> > >
> > > -#ifdef CONFIG_KEXEC
> > > if (acpi_rsdp)
> > > return acpi_rsdp;
> > > -#endif
> > > +
> > > pa = acpi_arch_get_root_pointer();
> > > -   if (pa)
> > > +   if (pa) {
> > > +   acpi_rsdp = pa;
> > > return pa;
> > > +   }
> > >
> > > if (efi_enabled(EFI_CONFIG_TABLES)) {
> > > -   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
> > > +   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) {
> > > +   acpi_rsdp = efi.acpi20;
> > > return efi.acpi20;
> > > -       if (efi.acpi != EFI_INVALID_TABLE_ADDR)
> > > +   }
> > > +   if (efi.acpi != EFI_INVALID_TABLE_ADDR) {
> > > +   acpi_rsdp = efi.acpi;
> > > return efi.acpi;
> > > +   }
> > > pr_err(PREFIX "System description tables not found\n");
> > > } else if (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
> > > acpi_find_root_pointer();
> > > }
> > >
> > >  +   acpi_rsdp = pa;
> > > return pa;
> > >  }
> > >
> > > > --
> > > > Regards/Gruss,
> > > > Boris.
> > > >
> > > > Good mailing practices for 400: avoid top-posting and trim the reply.
> > > --
> > > Best Regards,
> > > Kairui Song



-- 
Best Regards,
Kairui Song


Re: [PATCH v3 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-17 Thread Kairui Song
On Fri, Jan 18, 2019 at 10:00 AM Dave Young  wrote:
>
> On 01/18/19 at 09:35am, Dave Young wrote:
> > On 01/17/19 at 08:08pm, Mimi Zohar wrote:
> > > On Wed, 2019-01-16 at 18:16 +0800, Kairui Song wrote:
> > > > This patch series adds a .platform_trusted_keys in system_keyring as the
> > > > reference to .platform keyring in integrity subsystem, when platform
> > > > keyring is being initialized it will be updated. So other component 
> > > > could
> > > > use this keyring as well.
> > >
> > > Remove "other component could use ...".
> > > >
> > > > This patch series also let kexec_file_load use platform keyring as fall
> > > > back if it failed to verify the image against secondary keyring, make it
> > > > possible to load kernel signed by third part key if third party key is
> > > > imported in the firmware.
> > >
> > > This is the only reason for these patches.  Please remove "also".
> > >
> > > >
> > > > After this patch kexec_file_load will be able to verify a signed PE
> > > > bzImage using keys in platform keyring.
> > > >
> > > > Tested in a VM with locally signed kernel with pesign and imported the
> > > > cert to EFI's MokList variable.
> > >
> > > It's taken so long for me to review/test this patch set due to a
> > > regression in sanity_check_segment_list(), introduced somewhere
> > > between 4.20 and 5.0.0-rc1.  The sgement overlap test - "if ((mend >
> > > pstart) && (mstart < pend))" - fails, returning a -EINVAL.
> > >
> > > Is anyone else seeing this?
> >
> > Mimi, should be this issue?  I have sent a fix for that.
> > https://lore.kernel.org/lkml/20181228011247.ga9...@dhcp-128-65.nay.redhat.com/
>
> Hi, Kairui, I think you should know this while working on this series,
> It is good to mention the test dependency in cover letter so that reviewers
> can save time.
>
> BTW, Boris took it in tip already:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b
>

Hi, thanks for the suggestion, I did apply your patch to avoid the
failure. Will add such info next time.

Will send out V4 and update commit message as suggested by Mimi


--
Best Regards,
Kairui Song


Re: [RFC PATCH 1/1] KEYS, integrity: Link .platform keyring to .secondary_trusted_keys

2019-01-17 Thread Kairui Song
On Thu, Jan 17, 2019 at 11:04 PM David Howells  wrote:
>
> Kairui Song  wrote:
>
> > +extern const struct key* __init integrity_get_platform_keyring(void);
>
> This should really be in keys/system_keyring.h and probably shouldn't be
> exposed directly if it can be avoided.
>
> David

Thanks for the review, I've sent V3 of this patch series, the
implementation changed a bit, would you mind take a look of that patch
instead?
https://lore.kernel.org/lkml/20190116101654.7288-1-kas...@redhat.com/

-- 
Best Regards,
Kairui Song


Re: [PATCH v15 5/6] x86/boot: Parse SRAT address from RSDP and store immovable memory

2019-01-17 Thread Kairui Song
On Thu, Jan 17, 2019 at 3:58 PM Chao Fan  wrote:
>
> On Wed, Jan 16, 2019 at 03:28:52PM +0800, Kairui Song wrote:
> >On Mon, Jan 7, 2019 at 11:24 AM Chao Fan  wrote:
> >>
> >> +
> >> +/* Determine RSDP, based on acpi_os_get_root_pointer(). */
> >> +static acpi_physical_address get_rsdp_addr(void)
> >> +{
> >> +   acpi_physical_address pa;
> >> +
> >> +   pa = get_acpi_rsdp();
> >> +
> >> +   if (!pa)
> >> +   pa = efi_get_rsdp_addr();
> >> +
> >> +   if (!pa)
> >> +   pa = bios_get_rsdp_addr();
> >> +
> >> +   return pa;
> >> +}
> >
> >acpi_rsdp might be provided by boot_params.acpi_rsdp_addr too,
> >it's introduced in ae7e1238e68f2a for Xen PVH guest and later move to
> >boot_params in e6e094e053af,
> >kexec could use it to pass RSDP to second kernel as well. Please check
> >it as well make sure it always works.
> >
>
> Hi Kairui,
>
> I saw the parsing code has been added to kernel, but I didn't see
> where to fill in the 'acpi_rsdp_addr'. If only you(KEXEC) use it,
> I can add "#ifdef CONFIG_KEXEC", but you said Xen will use it also,
> so I didn't add ifdef to control it. I was trying to do as below:
>
> static inline acpi_physical_address get_boot_params_rsdp(void)
> {
> return boot_params->acpi_rsdp_addr;
> }
>
> static acpi_physical_address get_rsdp_addr(void)
> {
> bool boot_params_rsdp_exist;
> acpi_physical_address pa;
>
> pa = get_acpi_rsdp();
>
> if (!pa)
> pa = get_boot_params_rsdp();
>
> if (!pa) {
> pa = efi_get_rsdp_addr();
> boot_params_rsdp_exist = false;
> }
> else
> boot_params_rsdp_exist = true;
>
> if (!pa)
> pa = bios_get_rsdp_addr();
>
> if (pa && !boot_params_rsdp_exist)
> boot_params.acpi_rsdp_addr = pa;
>
> return pa;
> }
>
> At the same time, I notice kernel only parses it when
> "#ifdef CONFIG_ACPI", we should keep sync with kernel, but I think
> we are parsing SRAT, CONFIG_ACPI is needed sure, so I am going to
> update the define of EARLY_SRAT_PARSE:
>
> config EARLY_SRAT_PARSE
> bool "EARLY SRAT parsing"
> def_bool y
> depends on RANDOMIZE_BASE && MEMORY_HOTREMOVE && ACPI
>
> Boris, how do you think the update for the new acpi_rsdp_addr?
> If I misunderstand something, please let me know.
>
> Thanks,
> Chao Fan
>

Hi, thanks for considering kexec usage,

but I think "boot_params_rsdp_exist" is not necessary,
boot_params->acpi_rsdp_addr should be either NULL or a valid value if
I, later initialization code considers it a valid value if it's not
NULL.

For the usage for Xen I'm not sure either, the info comes from commit
message of ae7e1238e68f2a that's also where boot_params.acpi_rsdp_addr
is first introduced, lets cc Juergen as well.


--
Best Regards,
Kairui Song


Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-17 Thread Kairui Song
On Thu, Jan 17, 2019 at 3:51 PM Chao Fan  wrote:
>
> On Thu, Jan 17, 2019 at 03:41:13PM +0800, Kairui Song wrote:
> >On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov  wrote:
> >>
> >> On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote:
> >> > I didn't see a way to reuse things in that patch series, situation is
> >> > different, in that patch it needs to get RSDP in very early boot stage
> >> > so it did everything from scratch, in this patch kexec_file_load need
> >> > to get RSDP too, but everything is well setup so things are a lot
> >> > easier, just read from current boot_prams, efi and fallback to
> >> > acpi_find_root_pointer should be good.
> >>
> >> No no. Early code should find out that venerable RSDP thing once and
> >> will save it somewhere for further use. No gazillion parsings of it.
> >> Just once and share it with the rest of the code that needs it.
> >>
> >
> >How about we refill the boot_params.acpi_rsdp_addr if it is not valid
> >in early code, so it could be used as a reliable RSDP address source?
> >That should make things easier.
>
> I think it's OK.
> Try to read it, if get RSDP, use it.
> If not, search in EFI/BIOS/... and refill the RSDP to
> boot_params.acpi_rsdp_addr.
> By the way, I search kernel code, I didn't find other code fill and
> use it, only you(KEXEC) are trying to fill it.
> If I miss something, please let me know.

Yes, kexec would read RSDP again to pass it to second kernel, and only
if EFI is disabled (efi=noruntime/old_map, else second kernel will get
rsdp just fine). Not sure if any other component would use it.

>
> Thanks,
> Chao Fan

>

--
Best Regards,
Kairui Song


Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-16 Thread Kairui Song
On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov  wrote:
>
> On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote:
> > I didn't see a way to reuse things in that patch series, situation is
> > different, in that patch it needs to get RSDP in very early boot stage
> > so it did everything from scratch, in this patch kexec_file_load need
> > to get RSDP too, but everything is well setup so things are a lot
> > easier, just read from current boot_prams, efi and fallback to
> > acpi_find_root_pointer should be good.
>
> No no. Early code should find out that venerable RSDP thing once and
> will save it somewhere for further use. No gazillion parsings of it.
> Just once and share it with the rest of the code that needs it.
>

How about we refill the boot_params.acpi_rsdp_addr if it is not valid
in early code, so it could be used as a reliable RSDP address source?
That should make things easier.

But if early code should parse it and store it should be done in
Chao's patch, or I can post another patch to do it if Chao's patch is
merged.

For now I think good to have something like this in this patch series
to always keep storing acpi_rsdp in late code,
acpi_os_get_root_pointer_late (maybe comeup with a better name later)
could be used anytime to get RSDP and no extra parsing:

--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -180,8 +180,8 @@ void acpi_os_vprintf(const char *fmt, va_list args)
 #endif
 }

-#ifdef CONFIG_KEXEC
 static unsigned long acpi_rsdp;
+#ifdef CONFIG_KEXEC
 static int __init setup_acpi_rsdp(char *arg)
 {
return kstrtoul(arg, 16, _rsdp);
@@ -189,28 +189,38 @@ static int __init setup_acpi_rsdp(char *arg)
 early_param("acpi_rsdp", setup_acpi_rsdp);
 #endif

+acpi_physical_address acpi_os_get_root_pointer_late(void) {
+   return acpi_rsdp;
+}
+
 acpi_physical_address __init acpi_os_get_root_pointer(void)
 {
acpi_physical_address pa;

-#ifdef CONFIG_KEXEC
if (acpi_rsdp)
return acpi_rsdp;
-#endif
+
pa = acpi_arch_get_root_pointer();
-   if (pa)
+   if (pa) {
+   acpi_rsdp = pa;
return pa;
+   }

if (efi_enabled(EFI_CONFIG_TABLES)) {
-   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
+   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) {
+   acpi_rsdp = efi.acpi20;
return efi.acpi20;
-   if (efi.acpi != EFI_INVALID_TABLE_ADDR)
+   }
+   if (efi.acpi != EFI_INVALID_TABLE_ADDR) {
+   acpi_rsdp = efi.acpi;
return efi.acpi;
+   }
pr_err(PREFIX "System description tables not found\n");
} else if (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
acpi_find_root_pointer();
}

 +   acpi_rsdp = pa;
return pa;
 }

> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
--
Best Regards,
Kairui Song


[PATCH v3 1/2] integrity, KEYS: add a reference to platform keyring

2019-01-16 Thread Kairui Song
Currently when loading new kernel via kexec_file_load syscall, it is able
to verify the signed PE bzimage against .builtin_trusted_keys or
.secondary_trusted_keys. But the image could be signed with third part
keys which will be provided by platform or firmware as EFI variable (eg.
stored in MokListRT EFI variable), and the keys won't be available in
keyrings mentioned above.

After commit 9dc92c45177a ('integrity: Define a trusted platform keyring')
a .platform keyring is introduced to store the keys provided by platform
or firmware, this keyring is intended to be used for verifying kernel
images being loaded by kexec_file_load syscall. And with a few following
up commits, keys provided by firmware is being loaded into this keyring,
and IMA-appraisal is able to use the keyring to verify kernel images.
IMA is the currently the only user of that keyring.

This patch exposes the .platform, and makes it useable for other
components. For example, kexec_file_load could use this .platform
keyring to verify the kernel image's image.

Suggested-by: Mimi Zohar 
Signed-off-by: Kairui Song 
---
 certs/system_keyring.c| 9 +
 include/keys/system_keyring.h | 5 +
 security/integrity/digsig.c   | 6 ++
 3 files changed, 20 insertions(+)

diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 81728717523d..4690ef9cda8a 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -24,6 +24,9 @@ static struct key *builtin_trusted_keys;
 #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
 static struct key *secondary_trusted_keys;
 #endif
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+static struct key *platform_trusted_keys;
+#endif
 
 extern __initconst const u8 system_certificate_list[];
 extern __initconst const unsigned long system_certificate_list_size;
@@ -265,4 +268,10 @@ int verify_pkcs7_signature(const void *data, size_t len,
 }
 EXPORT_SYMBOL_GPL(verify_pkcs7_signature);
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+void __init set_platform_trusted_keys(struct key *keyring) {
+   platform_trusted_keys = keyring;
+}
+#endif
+
 #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */
diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h
index 359c2f936004..9e1b7849b6aa 100644
--- a/include/keys/system_keyring.h
+++ b/include/keys/system_keyring.h
@@ -61,5 +61,10 @@ static inline struct key *get_ima_blacklist_keyring(void)
 }
 #endif /* CONFIG_IMA_BLACKLIST_KEYRING */
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+
+extern void __init set_platform_trusted_keys(struct key* keyring);
+
+#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */
 
 #endif /* _KEYS_SYSTEM_KEYRING_H */
diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
index f45d6edecf99..bfabc2a8111d 100644
--- a/security/integrity/digsig.c
+++ b/security/integrity/digsig.c
@@ -89,6 +89,12 @@ static int __integrity_init_keyring(const unsigned int id, 
key_perm_t perm,
keyring[id] = NULL;
}
 
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+   if (id == INTEGRITY_KEYRING_PLATFORM) {
+   set_platform_trusted_keys(keyring[id]);
+   }
+#endif
+
return err;
 }
 
-- 
2.20.1



[PATCH v3 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-16 Thread Kairui Song
This patch series adds a .platform_trusted_keys in system_keyring as the
reference to .platform keyring in integrity subsystem, when platform
keyring is being initialized it will be updated. So other component could
use this keyring as well.

This patch series also let kexec_file_load use platform keyring as fall
back if it failed to verify the image against secondary keyring, make it
possible to load kernel signed by third part key if third party key is
imported in the firmware.

After this patch kexec_file_load will be able to verify a signed PE
bzImage using keys in platform keyring.

Tested in a VM with locally signed kernel with pesign and imported the
cert to EFI's MokList variable.

Kairui Song (2):
  integrity, KEYS: add a reference to platform keyring
  kexec, KEYS: Make use of platform keyring for signature verify

Update from V2:
  - Use IS_ENABLED in kexec_file_load to judge if platform_trusted_keys
should be used for verifying image as suggested by Mimi Zohar

Update from V1:
  - Make platform_trusted_keys static, and update commit message as suggested
by Mimi Zohar
  - Always check if platform keyring is initialized before use it

Kairui Song (2):
  integrity, KEYS: add a reference to platform keyring
  kexec, KEYS: Make use of platform keyring for signature verify

 arch/x86/kernel/kexec-bzimage64.c | 13 ++---
 certs/system_keyring.c| 22 +-
 include/keys/system_keyring.h |  5 +
 include/linux/verification.h  |  1 +
 security/integrity/digsig.c   |  6 ++
 5 files changed, 43 insertions(+), 4 deletions(-)

-- 
2.20.1


[PATCH v3 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-16 Thread Kairui Song
With KEXEC_BZIMAGE_VERIFY_SIG enabled, kexec_file_load will need to
verify the kernel image. The image might be signed with third part keys,
and the keys could be stored in firmware, then got loaded into the
.platform keyring. Now we have a symbol .platform_trusted_keyring as the
reference to .platform keyring, this patch makes use if it and allow
kexec_file_load to verify the image against keys in .platform keyring.

This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous
VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature
should verify the signature using platform keyring. Also, decrease
the error message log level when verification failed with -ENOKEY,
so that if called tried multiple time with different keyring it
won't generate extra noises.

Signed-off-by: Kairui Song 
---
 arch/x86/kernel/kexec-bzimage64.c | 13 ++---
 certs/system_keyring.c| 13 -
 include/linux/verification.h  |  1 +
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 7d97e432cbbc..2c007abd3d40 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data)
 #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
 static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
 {
-   return verify_pefile_signature(kernel, kernel_len,
-  VERIFY_USE_SECONDARY_KEYRING,
-  VERIFYING_KEXEC_PE_SIGNATURE);
+   int ret;
+   ret = verify_pefile_signature(kernel, kernel_len,
+ VERIFY_USE_SECONDARY_KEYRING,
+ VERIFYING_KEXEC_PE_SIGNATURE);
+   if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
+   ret = verify_pefile_signature(kernel, kernel_len,
+ VERIFY_USE_PLATFORM_KEYRING,
+ VERIFYING_KEXEC_PE_SIGNATURE);
+   }
+   return ret;
 }
 #endif
 
diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 4690ef9cda8a..7085c286f4bd 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len,
 #else
trusted_keys = builtin_trusted_keys;
 #endif
+   } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) {
+#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
+   trusted_keys = platform_trusted_keys;
+#else
+   trusted_keys = NULL;
+#endif
+   if (!trusted_keys) {
+   ret = -ENOKEY;
+   pr_devel("PKCS#7 platform keyring is not available\n");
+   goto error;
+   }
}
ret = pkcs7_validate_trust(pkcs7, trusted_keys);
if (ret < 0) {
if (ret == -ENOKEY)
-   pr_err("PKCS#7 signature not signed with a trusted 
key\n");
+   pr_devel("PKCS#7 signature not signed with a trusted 
key\n");
goto error;
}
 
diff --git a/include/linux/verification.h b/include/linux/verification.h
index cfa4730d607a..018fb5f13d44 100644
--- a/include/linux/verification.h
+++ b/include/linux/verification.h
@@ -17,6 +17,7 @@
  * should be used.
  */
 #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL)
+#define VERIFY_USE_PLATFORM_KEYRING  ((struct key *)2UL)
 
 /*
  * The use to which an asymmetric key is being put.
-- 
2.20.1



Re: [PATCH v15 5/6] x86/boot: Parse SRAT address from RSDP and store immovable memory

2019-01-15 Thread Kairui Song
gt; +   }
> +   table = (struct acpi_subtable_header *)
> +   ((unsigned long)table + table->length);
> +   }
> +   num_immovable_mem = i;
> +}
> diff --git a/arch/x86/boot/compressed/kaslr.c 
> b/arch/x86/boot/compressed/kaslr.c
> index 9ed9709d9947..b251572e77af 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -87,10 +87,6 @@ static unsigned long get_boot_seed(void)
>  #define KASLR_COMPRESSED_BOOT
>  #include "../../lib/kaslr.c"
>
> -struct mem_vector {
> -   unsigned long long start;
> -   unsigned long long size;
> -};
>
>  /* Only supporting at most 4 unusable memmap regions with kaslr */
>  #define MAX_MEMMAP_REGIONS 4
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index a1d5918765f3..b49748366a5b 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -77,6 +77,11 @@ void choose_random_location(unsigned long input,
> unsigned long *output,
> unsigned long output_size,
> unsigned long *virt_addr);
> +struct mem_vector {
> +   unsigned long long start;
> +   unsigned long long size;
> +};
> +
>  /* cpuflags.c */
>  bool has_cpuflag(int flag);
>  #else
> @@ -116,3 +121,17 @@ static inline void console_init(void)
>  void set_sev_encryption_mask(void);
>
>  #endif
> +
> +/* acpi.c */
> +#ifdef CONFIG_RANDOMIZE_BASE
> +/* Amount of immovable memory regions */
> +int num_immovable_mem;
> +#endif
> +
> +#ifdef CONFIG_EARLY_SRAT_PARSE
> +void get_immovable_mem(void);
> +#else
> +static void get_immovable_mem(void)
> +{
> +}
> +#endif
> --
> 2.20.1
>
>
>


-- 
Best Regards,
Kairui Song


  1   2   >