[PATCH] x86/efi: update e820 about reserved EFI boot services data to fix kexec breakage

2019-12-03 Thread Dave Young
Michael Weiser reported he got below error during a kexec rebooting:
esrt: Unsupported ESRT version 2904149718861218184.

The ESRT memory stays in EFI boot services data, and it was reserved
in kernel via efi_mem_reserve().  The initial purpose of the reservation
is to reuse the EFI boot services data across kexec reboot. For example
the BGRT image data and some ESRT memory like Michael reported. 

But although the memory is reserved it is not updated in X86 e820 table.
And kexec_file_load iterate system ram in io resource list to find places
for kernel, initramfs and other stuff. In Michael's case the kexec loaded
initramfs overwritten the ESRT memory and then the failure happened.

Since kexec_file_load depends on the e820 to be updated, just fix this
by updating the reserved EFI boot services memory as reserved type in e820.

Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
bypassed in the reservation code path because they are assumed as reserved.
But the reservation is still needed for multiple kexec reboot.
And it is the only possible case we come here thus just drop the code
chunk then everything works without side effects. 

On my machine the ESRT memory sits in an EFI runtime data range, it does
not trigger the problem, but I successfully tested with BGRT instead.
both kexec_load and kexec_file_load work and kdump works as well.

Signed-off-by: Dave Young 
---
 arch/x86/platform/efi/quirks.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -260,10 +260,6 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
 
-   /* No need to reserve regions that will never be freed. */
-   if (md.attribute & EFI_MEMORY_RUNTIME)
-   return;
-
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
@@ -293,6 +289,8 @@ void __init efi_arch_mem_reserve(phys_ad
early_memunmap(new, new_size);
 
efi_memmap_install(new_phys, num_entries);
+   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
+   e820__update_table(e820_table);
 }
 
 /*


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Dave Young
On 12/03/19 at 10:11pm, Michael Weiser wrote:
> Hi Dave,
> 
> On Tue, Dec 03, 2019 at 07:54:35PM +0800, Dave Young wrote:
> 
> > > Neither adding add_efi_memmap nor adding your patch and setting that 
> > > option
> > > does make the ESRT memory region appear in /proc/iomem. kexec_file still
> > > loads the kernel across the ESRT region.
> > Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
> > EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.
> 
> > Will read your replied information later, did not get time today, but
> > probably below chunk can help?
> 
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 3b9fd679cea9..516307617621 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> > size)
> > early_memunmap(new, new_size);
> 
> > efi_memmap_install(new_phys, num_entries);
> > +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > +   e820__update_table(e820_table);
> >  }
> 
> >  /*
> 
> Yes, that did it:
> 
> -0fff : Reserved
> 1000-0009efff : System RAM
> 0009f000-000f : Reserved
>   000a-000b : PCI Bus :00
>   000e-000e3fff : PCI Bus :00
>   000e4000-000e7fff : PCI Bus :00
>   000e8000-000ebfff : PCI Bus :00
>   000ec000-000e : PCI Bus :00
>   000f-000f : PCI Bus :00
> 000f-000f : System ROM
> 0010-74dd1fff : System RAM
>   6500-6aff : Crash kernel
> 74dd2000-74dd2fff : Reserved   <- ESRT
> 74dd3000-763f5fff : System RAM
> 763f6000-79974fff : Reserved
> 79975000-799f1fff : ACPI Tables
> 799f2000-79aa6fff : ACPI Non-volatile Storage
>   79a17000-79a17fff : USBC000:00

Ok, good to know it works.  I will think about it and file a patch
later.  There are more things to consider, eg. kexec reboot multiple
times, userspace kexec loader etc.

If we choose to fix it in kexec_file path to avoid those region then we
need to do same in userspace, there will be compatibility issues so I
would still prefer to go with this way you tested.

BTW, on my laptop the ESRT stays in EFI runtime area so I do not see the
problem.  This should be machine/firmware specific.

Here is the info on my laptop:
[0.00] efi: mem34: [Runtime Data   |RUN|  |  |  |  |  |  |   
|WB|WT|WC|UC] range=[0x7a4b-0x7a676fff] (1MB)
[0.020670] esrt: Reserving ESRT space from 0x7a4ec000 to 
0x7a4ec088.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] makedumpfile/s390: Add get_kaslr_offset() for s390x

2019-12-03 Thread Mikhail Zaslonko
Since kernel v5.2 KASLR is supported on s390. In makedumpfile however no
support has been added yet. This patch adds the arch specific function
get_kaslr_offset() for s390x.
Since the values in vmcoreinfo are already relocated, the patch is
mainly relevant for vmlinux processing (-x option).

Signed-off-by: Philipp Rudo 
Signed-off-by: Mikhail Zaslonko 
---
 arch/s390x.c   | 32 
 makedumpfile.h |  3 ++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/s390x.c b/arch/s390x.c
index bf9d58e..892df14 100644
--- a/arch/s390x.c
+++ b/arch/s390x.c
@@ -122,6 +122,38 @@ get_machdep_info_s390x(void)
return TRUE;
 }
 
+unsigned long
+get_kaslr_offset_s390x(unsigned long vaddr)
+{
+   unsigned int i;
+   char buf[BUFSIZE_FGETS], *endp;
+
+   if (!info->file_vmcoreinfo)
+   return FALSE;
+
+   if (fseek(info->file_vmcoreinfo, 0, SEEK_SET) < 0) {
+   ERRMSG("Can't seek the vmcoreinfo file(%s). %s\n",
+  info->name_vmcoreinfo, strerror(errno));
+   return FALSE;
+   }
+
+   while (fgets(buf, BUFSIZE_FGETS, info->file_vmcoreinfo)) {
+   i = strlen(buf);
+   if (!i)
+   break;
+   if (buf[i - 1] == '\n')
+   buf[i - 1] = '\0';
+   if (strncmp(buf, STR_KERNELOFFSET,
+   strlen(STR_KERNELOFFSET)) == 0) {
+   info->kaslr_offset =
+   strtoul(buf + strlen(STR_KERNELOFFSET), , 
16);
+   DEBUG_MSG("info->kaslr_offset: %lx\n", 
info->kaslr_offset);
+   }
+   }
+
+   return info->kaslr_offset;
+}
+
 static int
 is_vmalloc_addr_s390x(unsigned long vaddr)
 {
diff --git a/makedumpfile.h b/makedumpfile.h
index ac11e90..26f6247 100644
--- a/makedumpfile.h
+++ b/makedumpfile.h
@@ -1071,11 +1071,12 @@ unsigned long long vaddr_to_paddr_ppc(unsigned long 
vaddr);
 int get_machdep_info_s390x(void);
 unsigned long long vaddr_to_paddr_s390x(unsigned long vaddr);
 int is_iomem_phys_addr_s390x(unsigned long addr);
+unsigned long get_kaslr_offset_s390x(unsigned long vaddr);
 #define find_vmemmap() stub_false()
 #define get_phys_base()stub_true()
 #define get_machdep_info() get_machdep_info_s390x()
 #define get_versiondep_info()  stub_true()
-#define get_kaslr_offset(X)stub_false()
+#define get_kaslr_offset(X)get_kaslr_offset_s390x(X)
 #define vaddr_to_paddr(X)  vaddr_to_paddr_s390x(X)
 #define paddr_to_vaddr(X)  paddr_to_vaddr_general(X)
 #define is_phys_addr(X)is_iomem_phys_addr_s390x(X)
-- 
2.17.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Michael Weiser
Hi Dave,

On Tue, Dec 03, 2019 at 07:54:35PM +0800, Dave Young wrote:

> > Neither adding add_efi_memmap nor adding your patch and setting that option
> > does make the ESRT memory region appear in /proc/iomem. kexec_file still
> > loads the kernel across the ESRT region.
> Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
> EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.

> Will read your replied information later, did not get time today, but
> probably below chunk can help?

> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 3b9fd679cea9..516307617621 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
>   early_memunmap(new, new_size);

>   efi_memmap_install(new_phys, num_entries);
> + e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> + e820__update_table(e820_table);
>  }

>  /*

Yes, that did it:

-0fff : Reserved
1000-0009efff : System RAM
0009f000-000f : Reserved
  000a-000b : PCI Bus :00
  000e-000e3fff : PCI Bus :00
  000e4000-000e7fff : PCI Bus :00
  000e8000-000ebfff : PCI Bus :00
  000ec000-000e : PCI Bus :00
  000f-000f : PCI Bus :00
000f-000f : System ROM
0010-74dd1fff : System RAM
  6500-6aff : Crash kernel
74dd2000-74dd2fff : Reserved   <- ESRT
74dd3000-763f5fff : System RAM
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

[0.001381] esrt: Reserving ESRT space from 0x74dd2f98 to 
0x74dd2fd0.
[0.001382] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
efi_mem_reserve+0x1d/0x2b
[0.001383] memblock_reserve: [0x0009e640-0x0009efcf] 
memblock_alloc_range_nid+0x93/0xfa
[0.001384] e820: update [mem 0x74dd2000-0x74dd2fff] usable ==> reserved
[...]
[0.043610] PM: Registered nosave memory: [mem 0x-0x0fff]
[0.043611] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 
from=0x max_addr=0x 
__register_nosave_region+0x6b/0xca
[0.043612] memblock_reserve: [0x00047dff95c0-0x00047dff95df] 
memblock_alloc_range_nid+0x93/0xfa
[0.043613] PM: Registered nosave memory: [mem 0x0009f000-0x000f]
[0.043615] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 
from=0x max_addr=0x 
__register_nosave_region+0x6b/0xca
[0.043616] memblock_reserve: [0x00047dff9580-0x00047dff959f] 
memblock_alloc_range_nid+0x93/0xfa
[0.043617] PM: Registered nosave memory: [mem 0x74dd2000-0x74dd2fff]   
< ESRT
[0.043618] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 
from=0x max_addr=0x 
__register_nosave_region+0x6b/0xca
[0.043619] memblock_reserve: [0x00047dff9540-0x00047dff955f] 
memblock_alloc_range_nid+0x93/0xfa
[0.043620] PM: Registered nosave memory: [mem 0x763f6000-0x79974fff]
[0.043620] PM: Registered nosave memory: [mem 0x79975000-0x799f1fff]
[0.043621] PM: Registered nosave memory: [mem 0x799f2000-0x79aa6fff]
[0.043621] PM: Registered nosave memory: [mem 0x79aa7000-0x7a40dfff]
[...]
[5.993928] PCI: pci_cache_line_size set to 64 bytes
[5.994563] e820: reserve RAM buffer [mem 0x0009f000-0x0009]
[5.994565] e820: reserve RAM buffer [mem 0x74dd2000-0x77ff]
<- ESRT
[5.994565] e820: reserve RAM buffer [mem 0x763f6000-0x77ff]
[5.994566] e820: reserve RAM buffer [mem 0x7a40f000-0x7bff]
[5.994567] e820: reserve RAM buffer [mem 0x47e00-0x47fff]
[5.995513] acpi PNP0C14:02: duplicate WMI GUID 
05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[5.995549] acpi PNP0C14:03: duplicate WMI GUID 
05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[...]
[   86.508053] kexec-bzImage64: Loaded purgatory at 0x98000
[   86.508056] kexec_file: Considering 0x1000-0x9efff
[   86.508057] kexec-bzImage64: Loaded boot_param, command line and misc at 
0x96000 bufsz=0x1240 memsz=0x1240
[   86.508057] kexec_file: Considering 0x10-0x74dd1fff
[   86.508058] kexec-bzImage64: Loaded 64bit kernel at 0x7200 
bufsz=0x1140888 memsz=0x24b7000
[   86.508058] kexec-bzImage64: Final command line is: 
[   86.584668] kexec_file: Loading segment 0: buf=0xd5ec82bc 
bufsz=0x5000 mem=0x98000 memsz=0x6000
[   86.584672] kexec_file: Loading segment 1: buf=0xaf539c69 
bufsz=0x1240 mem=0x96000 memsz=0x2000
[   86.584674] kexec_file: Loading segment 2: buf=0x29f9b9a8 
bufsz=0x1140888 mem=0x7200 memsz=0x24b7000   < not ESRT :)

And no more invalid version error message from the kexec'd kernel.
-- 
Thanks,
Michael


[PATCH v2 2/2] efi: arm64: Introduce /proc/efi/memreserve to tell the persistent pages

2019-12-03 Thread Masayoshi Mizuma
From: Masayoshi Mizuma 

kexec reboot stops in early boot sequence because efi_config_parse_tables()
refers garbage data. We can see the log with memblock=debug kernel option:

  efi:  ACPI 2.0=0x9821790014  PROP=0x8757f5c0  SMBIOS 3.0=0x982074  
MEMRESERVE=0x9820bfdc58
  memblock_reserve: [0x009820bfdc58-0x009820bfdc67] 
efi_config_parse_tables+0x228/0x278
  memblock_reserve: [0x8276-0x324d07ff] 
efi_config_parse_tables+0x228/0x278
  memblock_reserve: [0xcc4f84ecc0511670-0x5f6e5214a7fd91f9] 
efi_config_parse_tables+0x244/0x278
  memblock_reserve: [0xd2fd4144b9af693d-0xad0c1db1086f40a2] 
efi_config_parse_tables+0x244/0x278
  memblock_reserve: [0x0c719bb159b1fadc-0x5aa6e62a1417ce12] 
efi_config_parse_tables+0x244/0x278
  ...

That happens because 0x8276, struct linux_efi_memreserve, is destroyed.
0x8276 is pointed from efi.mem_reseve, and efi.mem_reserve points the
head page of LPI pending table and LPI property table which are allocated by
gic_reserve_range().

The destroyer is kexec. kexec locates the initrd to the area:

  ]# kexec -d -l /boot/vmlinuz-5.4.0-rc7 /boot/initramfs-5.4.0-rc7.img 
--reuse-cmdline
  ...
  initrd: base 8229, size 388dd8ah (59301258)
  ...

>From dynamic debug log. initrd is located in segment[1]:
  machine_kexec_prepare:70:
kexec kimage info:
  type:0
  start:   85b30680
  head:0
  nr_segments: 4
segment[0]: 8048 - 8229, 0x1e1 bytes, 481 
pages
segment[1]: 8229 - 85b2, 0x389 bytes, 905 
pages
segment[2]: 85b2 - 85b3, 0x1 bytes, 1 pages
segment[3]: 85b3 - 85b4, 0x1 bytes, 1 pages

kexec searches the memory region to locate initrd through
"System RAM" in /proc/iomem. The pending tables are included in
"System RAM" because they are allocated by alloc_pages(), so kexec
destroys the LPI pending tables.

Introduce /proc/efi/memreserve to tell the pages pointed by
efi.mem_reserve so that kexec can avoid the area to locate initrd.

Signed-off-by: Masayoshi Mizuma 
---
 drivers/firmware/efi/efi.c | 75 --
 1 file changed, 72 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d8157cb34..80bbe0b3e 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -325,17 +325,87 @@ static __init int efivar_ssdt_load(void)
 static inline int efivar_ssdt_load(void) { return 0; }
 #endif
 
+static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init;
+
 #ifdef CONFIG_PROC_FS
 static struct proc_dir_entry *proc_efi;
+#ifdef CONFIG_KEXEC
+static int memreserve_show(struct seq_file *m, void *v)
+{
+   struct linux_efi_memreserve *rsv;
+   phys_addr_t start, end;
+   unsigned long prsv;
+   int count, i;
+
+   if ((efi_memreserve_root == (void *)ULONG_MAX) ||
+   (!efi_memreserve_root))
+   return -ENODEV;
+
+   for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) {
+   rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB);
+   if (!rsv) {
+   pr_err("Could not map efi_memreserve\n");
+   return -ENOMEM;
+   }
+   count = atomic_read(>count);
+   for (i = 0; i < count; i++) {
+   start = rsv->entry[i].base;
+   end = start + rsv->entry[i].size - 1;
+
+   seq_printf(m, "%pa-%pa\n", , );
+   }
+   memunmap(rsv);
+   }
+
+   return 0;
+}
+
+static int memreserve_open(struct inode *inode, struct file *filp)
+{
+   return single_open(filp, memreserve_show, NULL);
+}
+
+static const struct file_operations memreserve_fops = {
+   .owner  = THIS_MODULE,
+   .open   = memreserve_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int __init efi_proc_memreserve(void)
+{
+   struct proc_dir_entry *pde;
+
+   if ((efi_memreserve_root == (void *)ULONG_MAX) ||
+   (!efi_memreserve_root))
+   return 0;
+
+   pde = proc_create("memreserve", 0444, proc_efi, _fops);
+   if (!pde) {
+   pr_err("/proc/efi: Cannot create /proc/efi/memreserve file.\n");
+   return 1;
+   }
+
+   return 0;
+}
+#else
+static inline int efi_proc_memreserve(void) { return 0; }
+#endif /* CONFIG_KEXEC */
+
 static int __init efi_proc_init(void)
 {
+   int error = 1;
+
proc_efi = proc_mkdir("efi", NULL);
if (!proc_efi) {
pr_err("/proc/efi: Cannot create /proc/efi directroy.\n");
-   return 1;
+   return error;
}
 
-   return 0;
+   error = efi_proc_memreserve();
+
+   return error;
 }
 #else
 

[PATCH v2 1/2] efi: add /proc/efi directory

2019-12-03 Thread Masayoshi Mizuma
From: Masayoshi Mizuma 

Add /proc/efi directory to show some efi internal information.

Signed-off-by: Masayoshi Mizuma 
---
 drivers/firmware/efi/efi.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d101f072c..d8157cb34 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -325,6 +325,22 @@ static __init int efivar_ssdt_load(void)
 static inline int efivar_ssdt_load(void) { return 0; }
 #endif
 
+#ifdef CONFIG_PROC_FS
+static struct proc_dir_entry *proc_efi;
+static int __init efi_proc_init(void)
+{
+   proc_efi = proc_mkdir("efi", NULL);
+   if (!proc_efi) {
+   pr_err("/proc/efi: Cannot create /proc/efi directory.\n");
+   return 1;
+   }
+
+   return 0;
+}
+#else
+static inline int efi_proc_init(void) { return 0; }
+#endif /* CONFIG_PROC_FS */
+
 /*
  * We register the efi subsystem with the firmware subsystem and the
  * efivars subsystem with the efi subsystem, if the system was booted with
@@ -381,6 +397,12 @@ static int __init efisubsys_init(void)
goto err_remove_group;
}
 
+   error = efi_proc_init();
+   if (error) {
+   sysfs_remove_mount_point(efi_kobj, "efivars");
+   goto err_remove_group;
+   }
+
return 0;
 
 err_remove_group:
-- 
2.18.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 0/2] efi: arm64: Introduce /proc/efi/memreserve to tell the persistent pages

2019-12-03 Thread Masayoshi Mizuma
From: Masayoshi Mizuma 

kexec reboot sometime fails in early boot sequence on aarch64 machine.
That is because kexec overwrites the LPI property tables and pending
tables with the initrd.

To avoid the overwrite, introduce /proc/efi/memreserve to tell the
tables region to kexec so that kexec can avoid the memory region to
locate initrd.

kexec also needs a patch to handle /proc/efi/memreserve. I'm preparing
the patch for kexec.

Changelog
v2: - Change memreserve file location from sysfs to procfs.
  memreserve may exceed the PAGE_SIZE in case efi_memreserve_root
  has a lot of entries. So we cannot use sysfs_kf_seq_show().
  Use seq_printf() in procfs instead.

Masayoshi Mizuma (2):
  efi: add /proc/efi directory
  efi: arm64: Introduce /proc/efi/memreserve to tell the persistent
pages

 drivers/firmware/efi/efi.c | 93 +-
 1 file changed, 92 insertions(+), 1 deletion(-)

-- 
2.18.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 7/8] linux/log2.h: Fix 64bit calculations in roundup/down_pow_two()

2019-12-03 Thread Chuck Lever



> On Dec 3, 2019, at 6:47 AM, Nicolas Saenz Julienne  
> wrote:
> 
> Some users need to make sure their rounding function accepts and returns
> 64bit long variables regardless of the architecture. Sadly
> roundup/rounddown_pow_two() takes and returns unsigned longs. It turns
> out ilog2() already handles 32/64bit calculations properly, and being
> the building block to the round functions we can rework them as a
> wrapper around it.
> 
> Suggested-by: Robin Murphy 
> Signed-off-by: Nicolas Saenz Julienne 
> ---
> drivers/clk/clk-divider.c|  8 ++--
> drivers/clk/sunxi/clk-sunxi.c|  2 +-
> drivers/infiniband/hw/hfi1/chip.c|  4 +-
> drivers/infiniband/hw/hfi1/init.c|  4 +-
> drivers/infiniband/hw/mlx4/srq.c |  2 +-
> drivers/infiniband/hw/mthca/mthca_srq.c  |  2 +-
> drivers/infiniband/sw/rxe/rxe_qp.c   |  4 +-
> drivers/iommu/intel-iommu.c  |  4 +-
> drivers/iommu/intel-svm.c|  4 +-
> drivers/iommu/intel_irq_remapping.c  |  2 +-
> drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c |  4 +-
> drivers/net/ethernet/marvell/sky2.c  |  2 +-
> drivers/net/ethernet/rocker/rocker_hw.h  |  4 +-
> drivers/net/ethernet/sfc/ef10.c  |  2 +-
> drivers/net/ethernet/sfc/efx.h   |  2 +-
> drivers/net/ethernet/sfc/falcon/efx.h|  2 +-
> drivers/pci/msi.c|  2 +-
> include/linux/log2.h | 44 +---
> kernel/kexec_core.c  |  3 +-
> lib/rhashtable.c |  2 +-
> net/sunrpc/xprtrdma/verbs.c  |  2 +-
> 21 files changed, 41 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/clk/clk-divider.c b/drivers/clk/clk-divider.c
> index 098b2b01f0af..ba947e4c8193 100644
> --- a/drivers/clk/clk-divider.c
> +++ b/drivers/clk/clk-divider.c
> @@ -222,7 +222,7 @@ static int _div_round_up(const struct clk_div_table 
> *table,
>   int div = DIV_ROUND_UP_ULL((u64)parent_rate, rate);
> 
>   if (flags & CLK_DIVIDER_POWER_OF_TWO)
> - div = __roundup_pow_of_two(div);
> + div = roundup_pow_of_two(div);
>   if (table)
>   div = _round_up_table(table, div);
> 
> @@ -240,8 +240,8 @@ static int _div_round_closest(const struct clk_div_table 
> *table,
>   down = parent_rate / rate;
> 
>   if (flags & CLK_DIVIDER_POWER_OF_TWO) {
> - up = __roundup_pow_of_two(up);
> - down = __rounddown_pow_of_two(down);
> + up = roundup_pow_of_two(up);
> + down = rounddown_pow_of_two(down);
>   } else if (table) {
>   up = _round_up_table(table, up);
>   down = _round_down_table(table, down);
> @@ -278,7 +278,7 @@ static int _next_div(const struct clk_div_table *table, 
> int div,
>   div++;
> 
>   if (flags & CLK_DIVIDER_POWER_OF_TWO)
> - return __roundup_pow_of_two(div);
> + return roundup_pow_of_two(div);
>   if (table)
>   return _round_up_table(table, div);
> 
> diff --git a/drivers/clk/sunxi/clk-sunxi.c b/drivers/clk/sunxi/clk-sunxi.c
> index 27201fd26e44..faec99dc09c0 100644
> --- a/drivers/clk/sunxi/clk-sunxi.c
> +++ b/drivers/clk/sunxi/clk-sunxi.c
> @@ -311,7 +311,7 @@ static void sun6i_get_ahb1_factors(struct factors_request 
> *req)
> 
>   calcm = DIV_ROUND_UP(div, 1 << calcp);
>   } else {
> - calcp = __roundup_pow_of_two(div);
> + calcp = roundup_pow_of_two(div);
>   calcp = calcp > 3 ? 3 : calcp;
>   }
> 
> diff --git a/drivers/infiniband/hw/hfi1/chip.c 
> b/drivers/infiniband/hw/hfi1/chip.c
> index 9b1fb84a3d45..96b1d343c32f 100644
> --- a/drivers/infiniband/hw/hfi1/chip.c
> +++ b/drivers/infiniband/hw/hfi1/chip.c
> @@ -14199,10 +14199,10 @@ static int qos_rmt_entries(struct hfi1_devdata *dd, 
> unsigned int *mp,
>   max_by_vl = krcvqs[i];
>   if (max_by_vl > 32)
>   goto no_qos;
> - m = ilog2(__roundup_pow_of_two(max_by_vl));
> + m = ilog2(roundup_pow_of_two(max_by_vl));
> 
>   /* determine bits for vl */
> - n = ilog2(__roundup_pow_of_two(num_vls));
> + n = ilog2(roundup_pow_of_two(num_vls));
> 
>   /* reject if too much is used */
>   if ((m + n) > 7)
> diff --git a/drivers/infiniband/hw/hfi1/init.c 
> b/drivers/infiniband/hw/hfi1/init.c
> index 26b792bb1027..838c789c7cce 100644
> --- a/drivers/infiniband/hw/hfi1/init.c
> +++ b/drivers/infiniband/hw/hfi1/init.c
> @@ -467,7 +467,7 @@ int hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, int 
> numa,
>* MTU supported.
>*/
>   if (rcd->egrbufs.size < hfi1_max_mtu) {
> - rcd->egrbufs.size = __roundup_pow_of_two(hfi1_max_mtu);
> + rcd->egrbufs.size = roundup_pow_of_two(hfi1_max_mtu);
>   hfi1_cdbg(PROC,
>  

Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread Petr Mladek
On Tue 2019-12-03 15:13:36, John Ogness wrote:
> On 2019-12-02, Petr Mladek  wrote:
> >> +/*
> >> + * Sanity checker for reserve size. The ringbuffer code assumes that a 
> >> data
> >> + * block does not exceed the maximum possible size that could fit within 
> >> the
> >> + * ringbuffer. This function provides that basic size check so that the
> >> + * assumption is safe.
> >> + */
> >> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int 
> >> size)
> >> +{
> >> +  struct prb_data_block *db = NULL;
> >> +
> >> +  /* Writers are not allowed to write data-less records. */
> >> +  if (size == 0)
> >> +  return false;
> >
> > I would personally let this decision to the API caller.
> >
> > The code actually have to support data-less records. They are used
> > when the descriptor is reserved but the data block can't get reserved.
> 
> Exactly. Data-less records are how the ringbuffer identifies that data
> has been lost. If users were allowed to store data-less records, that
> destinction is no longer possible (unless I created some extra field in
> the descriptor). Does it even make sense for printk to store data-less
> records explicitly?

>From my POV, it does not make much sense.

> The current printk implementation silently ignores empty messages.

I am not able to find it. I only see that empty continuous framgments
are ignored in log_output(). Normal empty lines seems to be strored.

Well, I can't see any usecase for this. I think that we could ignore
all empty strings.


> > The above statement might create false feeling that it could not
> > happen later in the code. It might lead to bugs in writer code.
> 
> Then let me change the statement to describe that data-less records are
> used internally by the ringbuffer and cannot be explicitly created by
> writers.

Sounds godo to me.

> > Also reader API users might not expect to get a "valid" data-less
> > record.
> 
> Readers will never see them. The reader API implementation skips over
> data-less records.

Yeah. They will see bump in the seq number. We would need to
distinguish empty records and lost records as you wrote above.
It looks better to ignore empty records already during write.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread Steven Rostedt
On Tue, 3 Dec 2019 10:17:21 +0900
Sergey Senozhatsky  wrote:

> > > BTW: If I am counting correctly. The ABA problem would happen when
> > > exactly 2^30 (1G) messages is written in the mean time.  
> > 
> > All the ringbuffer code assumes that the use of index numbers handles
> > the ABA problem (i.e. there must not be 1 billion printk's within an
> > NMI). If we want to support 1 billion+ printk's within an NMI, then
> > perhaps the index number should be increased. For 64-bit systems it
> > would be no problem to go to 62 bits. For 32-bit systems, I don't know
> > how well the 64-bit atomic operations are supported.  
> 
> ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG
> machine)? 1G seems large enough, but who knows.

ftrace dump from NMI is the most likely case to hit this, but when that
happens, you are in debugging mode, and the system usually becomes
unreliable at this moment. I agree with Petr, that we should not
complicate the code more to handle this theoretical condition.

-- Steve

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread John Ogness
On 2019-12-02, Petr Mladek  wrote:
>> +/*
>> + * Sanity checker for reserve size. The ringbuffer code assumes that a data
>> + * block does not exceed the maximum possible size that could fit within the
>> + * ringbuffer. This function provides that basic size check so that the
>> + * assumption is safe.
>> + */
>> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int 
>> size)
>> +{
>> +struct prb_data_block *db = NULL;
>> +
>> +/* Writers are not allowed to write data-less records. */
>> +if (size == 0)
>> +return false;
>
> I would personally let this decision to the API caller.
>
> The code actually have to support data-less records. They are used
> when the descriptor is reserved but the data block can't get reserved.

Exactly. Data-less records are how the ringbuffer identifies that data
has been lost. If users were allowed to store data-less records, that
destinction is no longer possible (unless I created some extra field in
the descriptor). Does it even make sense for printk to store data-less
records explicitly?

The current printk implementation silently ignores empty messages.

> The above statement might create false feeling that it could not
> happen later in the code. It might lead to bugs in writer code.

Then let me change the statement to describe that data-less records are
used internally by the ringbuffer and cannot be explicitly created by
writers.

> Also reader API users might not expect to get a "valid" data-less
> record.

Readers will never see them. The reader API implementation skips over
data-less records.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 2/3] printk-rb: new printk ringbuffer implementation (reader)

2019-12-03 Thread John Ogness
On 2019-12-03, Petr Mladek  wrote:
>> Add the reader implementation for the new ringbuffer.
>> 
>> Signed-off-by: John Ogness 
>> ---
>>  kernel/printk/printk_ringbuffer.c | 234 ++
>>  kernel/printk/printk_ringbuffer.h |  12 +-
>>  2 files changed, 245 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/printk/printk_ringbuffer.c 
>> b/kernel/printk/printk_ringbuffer.c
>> index 09c32e52fd40..f85762713583 100644
>> --- a/kernel/printk/printk_ringbuffer.c
>> +++ b/kernel/printk/printk_ringbuffer.c
>> @@ -674,3 +674,237 @@ void prb_commit(struct prb_reserved_entry *e)
>>  local_irq_restore(e->irqflags);
>>  }
>>  EXPORT_SYMBOL(prb_commit);
>> +
>> +/*
>> + * Given @blk_lpos, return a pointer to the raw data from the data block
>> + * and calculate the size of the data part. A NULL pointer is returned
>> + * if @blk_lpos specifies values that could never be legal.
>> + *
>> + * This function (used by readers) performs strict validation on the lpos
>> + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
>> + * triggered if an internal error is detected.
>> + */
>> +static char *get_data(struct prb_data_ring *data_ring,
>> +  struct prb_data_blk_lpos *blk_lpos,
>> +  unsigned long *data_size)
>> +{
>> +struct prb_data_block *db;
>> +
>> +if (blk_lpos->begin == INVALID_LPOS &&
>> +blk_lpos->next == INVALID_LPOS) {
>> +/* descriptor without a data block */
>> +return NULL;
>> +} else if (DATA_WRAPS(data_ring, blk_lpos->begin) ==
>> +   DATA_WRAPS(data_ring, blk_lpos->next)) {
>> +/* regular data block */
>> +if (WARN_ON_ONCE(blk_lpos->next <= blk_lpos->begin))
>> +return NULL;
>> +db = to_block(data_ring, blk_lpos->begin);
>> +*data_size = blk_lpos->next - blk_lpos->begin;
>> +
>> +} else if ((DATA_WRAPS(data_ring, blk_lpos->begin) + 1 ==
>> +DATA_WRAPS(data_ring, blk_lpos->next)) ||
>> +   ((DATA_WRAPS(data_ring, blk_lpos->begin) ==
>> + DATA_WRAPS(data_ring, -1UL)) &&
>> +(DATA_WRAPS(data_ring, blk_lpos->next) == 0))) {
>
> I am a bit confused. I would expect that (-1UL + 1) = 0. So the second
> condition after || looks just like a special variant of the first
> valid condition.
>
> Or do I miss anything? Is there a problems with type casting?

Sorry, this code deserves a comment.

Here we are only comparing the number of wraps. For a wrapping data
block, @begin will be 1 wrap less than @next. The first part of the
check is checking the typical case, making sure that:

   1 + WRAPS(@begin) == WRAPS(@next)

There is also the case when the lpos overflows. In that case the number
of wraps starts over at zero (without having overflowed). (Note: The
lpos overflows, _not_ the number of wraps. This is why the first check
is not enough.) In this case, the number of wraps of the highest
possible lpos value (-1UL) should be the same as the number of wraps of
@begin. And the number of wraps of @next should be 0. The simplified
pseudo-code check is:

   WRAPS(@begin) == WRAPS(-1UL)
  &&
   WRAPS(@next) == 0

>> +/* wrapping data block */
>> +db = to_block(data_ring, 0);
>> +*data_size = DATA_INDEX(data_ring, blk_lpos->next);
>> +
>> +} else {
>> +WARN_ON_ONCE(1);
>> +return NULL;
>> +}
>> +
>> +/* A valid data block will always be aligned to the ID size. */
>> +if (WARN_ON_ONCE(blk_lpos->begin !=
>> + ALIGN(blk_lpos->begin, sizeof(db->id))) ||
>> +WARN_ON_ONCE(blk_lpos->next !=
>> + ALIGN(blk_lpos->next, sizeof(db->id {
>> +return NULL;
>> +}
>> +
>> +/* A valid data block will always have at least an ID. */
>> +if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
>> +return NULL;
>> +
>> +/* Subtract descriptor ID space from size. */
>> +*data_size -= sizeof(db->id);
>> +
>> +return >data[0];
>> +}
>> +
>> +/* Given @blk_lpos, copy an expected @len of data into the provided buffer. 
>> */
>> +static bool copy_data(struct prb_data_ring *data_ring,
>> +  struct prb_data_blk_lpos *blk_lpos, u16 len, char *buf,
>> +  unsigned int buf_size)
>> +{
>> +unsigned long data_size;
>> +char *data;
>> +
>> +/* Caller might not want the data. */
>> +if (!buf || !buf_size)
>> +return true;
>> +
>> +data = get_data(data_ring, blk_lpos, _size);
>> +if (!data)
>> +return false;
>> +
>> +/* Actual cannot be less than expected. */
>> +if (WARN_ON_ONCE(data_size < len))
>> +return false;
>
> I do not have a good feeling that the record gets lost here.
>
> I could imagine that a writer would reserve more space than
> needed in the end. Then it would want to modify desc.info.text_len
> and could do a 

Re: [RFC PATCH v5 2/3] printk-rb: new printk ringbuffer implementation (reader)

2019-12-03 Thread Petr Mladek
On Thu 2019-11-28 02:58:34, John Ogness wrote:
> Add the reader implementation for the new ringbuffer.
> 
> Signed-off-by: John Ogness 
> ---
>  kernel/printk/printk_ringbuffer.c | 234 ++
>  kernel/printk/printk_ringbuffer.h |  12 +-
>  2 files changed, 245 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/printk/printk_ringbuffer.c 
> b/kernel/printk/printk_ringbuffer.c
> index 09c32e52fd40..f85762713583 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -674,3 +674,237 @@ void prb_commit(struct prb_reserved_entry *e)
>   local_irq_restore(e->irqflags);
>  }
>  EXPORT_SYMBOL(prb_commit);
> +
> +/*
> + * Given @blk_lpos, return a pointer to the raw data from the data block
> + * and calculate the size of the data part. A NULL pointer is returned
> + * if @blk_lpos specifies values that could never be legal.
> + *
> + * This function (used by readers) performs strict validation on the lpos
> + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
> + * triggered if an internal error is detected.
> + */
> +static char *get_data(struct prb_data_ring *data_ring,
> +   struct prb_data_blk_lpos *blk_lpos,
> +   unsigned long *data_size)
> +{
> + struct prb_data_block *db;
> +
> + if (blk_lpos->begin == INVALID_LPOS &&
> + blk_lpos->next == INVALID_LPOS) {
> + /* descriptor without a data block */
> + return NULL;
> + } else if (DATA_WRAPS(data_ring, blk_lpos->begin) ==
> +DATA_WRAPS(data_ring, blk_lpos->next)) {
> + /* regular data block */
> + if (WARN_ON_ONCE(blk_lpos->next <= blk_lpos->begin))
> + return NULL;
> + db = to_block(data_ring, blk_lpos->begin);
> + *data_size = blk_lpos->next - blk_lpos->begin;
> +
> + } else if ((DATA_WRAPS(data_ring, blk_lpos->begin) + 1 ==
> + DATA_WRAPS(data_ring, blk_lpos->next)) ||
> +((DATA_WRAPS(data_ring, blk_lpos->begin) ==
> +  DATA_WRAPS(data_ring, -1UL)) &&
> + (DATA_WRAPS(data_ring, blk_lpos->next) == 0))) {

I am a bit confused. I would expect that (-1UL + 1) = 0. So the second
condition after || looks just like a special variant of the first
valid condition.

Or do I miss anything? Is there a problems with type casting?


> + /* wrapping data block */
> + db = to_block(data_ring, 0);
> + *data_size = DATA_INDEX(data_ring, blk_lpos->next);
> +
> + } else {
> + WARN_ON_ONCE(1);
> + return NULL;
> + }
> +
> + /* A valid data block will always be aligned to the ID size. */
> + if (WARN_ON_ONCE(blk_lpos->begin !=
> +  ALIGN(blk_lpos->begin, sizeof(db->id))) ||
> + WARN_ON_ONCE(blk_lpos->next !=
> +  ALIGN(blk_lpos->next, sizeof(db->id {
> + return NULL;
> + }
> +
> + /* A valid data block will always have at least an ID. */
> + if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
> + return NULL;
> +
> + /* Subtract descriptor ID space from size. */
> + *data_size -= sizeof(db->id);
> +
> + return >data[0];
> +}
> +
> +/* Given @blk_lpos, copy an expected @len of data into the provided buffer. 
> */
> +static bool copy_data(struct prb_data_ring *data_ring,
> +   struct prb_data_blk_lpos *blk_lpos, u16 len, char *buf,
> +   unsigned int buf_size)
> +{
> + unsigned long data_size;
> + char *data;
> +
> + /* Caller might not want the data. */
> + if (!buf || !buf_size)
> + return true;
> +
> + data = get_data(data_ring, blk_lpos, _size);
> + if (!data)
> + return false;
> +
> + /* Actual cannot be less than expected. */
> + if (WARN_ON_ONCE(data_size < len))
> + return false;

I do not have a good feeling that the record gets lost here.

I could imagine that a writer would reserve more space than
needed in the end. Then it would want to modify desc.info.text_len
and could do a mistake.

By other words, I would expect a bug on the writer side here.
And I would try to preserve the data by calling:

pr_warn_once("Wrong data_size (%lu) for data: %.*s\n", data_size,
data_size, data);

Well, I do not resist on it. WARN_ON_ONCE() is fine as well.

> +
> + data_size = min_t(u16, buf_size, len);
> +
> + if (!WARN_ON_ONCE(!data_size))
> + memcpy([0], data, data_size);
> + return true;
> +}
> +

Otherwise it looks good to me. I wonder how the conversion
of the printk.c code will look with this API.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Dave Young
On 12/03/19 at 10:01am, Ard Biesheuvel wrote:
> On Mon, 2 Dec 2019 at 09:05, Dave Young  wrote:
> >
> > Add more cc
> > On 12/02/19 at 04:58pm, Dave Young wrote:
> > > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > > Hello Dave,
> > > >
> > > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > > >
> > > > > > > Fundamentally when deciding where to place a new kernel kexec 
> > > > > > > (either
> > > > > > > user space or the in kernel kexec_file implementation) needs to 
> > > > > > > be able
> > > > > > > to ask the question which memory ares are reserved.
> > > > [...]
> > > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > > /proc/iomem?
> > > > > >
> > > > > > My guess is that the focus was that some EFI structures need to be 
> > > > > > kept
> > > > > > around accross the life cycle of *one* running kernel and
> > > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > > kexecing another kernel might just never have cropped up thus far. 
> > > > > > Ard
> > > > > > or Matt would know.
> > > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > > BOOT* then in X86 you can use something like below if it is not 
> > > > > covered:
> > > >
> > > > > void __init efi_esrt_init(void)
> > > > > {
> > > > > ...
> > > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, 
> > > > > );
> > > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > > > ...
> > > > > }
> > > >
> > > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > > the esrt module reports at boot:
> > > >
> > > > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > > > 0x74dd2fd0.
> > > >
> > > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > > code you quote reserve it using memblock_reserve() shown by
> > > > memblock=debug:
> > > >
> > > > [0.001246] memblock_reserve: 
> > > > [0x74dd2f98-0x74dd2fcf] efi_mem_reserve+0x1d/0x2b
> > > >
> > > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > > as shown by efi=debug:
> > > >
> > > > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  
> > > > |   |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > > > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  
> > > > |   |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > > > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  
> > > > |   |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > > >
> > > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > > from calling __memblock_free_late() on it. And indeed, memblock=debug 
> > > > does
> > > > not report this area as being free'd while the surrounding ones are:
> > > >
> > > > [0.178369] __memblock_free_late: 
> > > > [0x74dd3000-0x75becfff] 
> > > > efi_free_boot_services+0x126/0x1f8
> > > > [0.178658] __memblock_free_late: 
> > > > [0x6d635000-0x74dd1fff] 
> > > > efi_free_boot_services+0x126/0x1f8
> > > >
> > > > The esrt area does not show up in /proc/iomem though:
> > > >
> > > > 0010-763f5fff : System RAM
> > > >   6200-62a00d80 : Kernel code
> > > >   62c0-62f15fff : Kernel rodata
> > > >   6300-630ea8bf : Kernel data
> > > >   63fed000-641f : Kernel bss
> > > >   6500-6aff : Crash kernel
> > > >
> > > > And thus kexec loads the new kernel right over that area as shown when
> > > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > > > and 0x7300+0x24be000 = 0x754be000):
> > > >
> > > > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > > > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > > > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > > > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > > >
> > > > ... because it looks for any memory hole large enough in iomem resources
> > > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > > excluded from on my system.
> > > >
> > > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > > the area with efi.memmap and installs it using efi_memmap_install().
> > > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > > of the comments in the source of memremap(), MEMREMAP_WB does 
> > > > specifically
> > > > *not* reserve that memory in any way.
> > > >
> > > > > Unfortunately I noticed there are different requirements/ways for
> > > > > different types of "reserved" memory.  But 

Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Dave Young
On 12/03/19 at 12:45am, Michael Weiser wrote:
> Hi Dave,
> 
> On Mon, Dec 02, 2019 at 05:05:20PM +0800, Dave Young wrote:
> 
> > > It seems a serious problem, the EFI modified memmap does not get an
> > > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > > X86.
> > > 
> > > There is an question from Sai about why add_efi_memmap is not enabled by
> > > default:
> > > https://www.spinics.net/lists/linux-mm/msg185166.html
> 
> Incidentally, a data point I did not think to mention: I do boot the
> kernel as EFI application directly from the firmware as a boot entry
> with compiled in initrd and command line:
> 
> $ grep EFI nobak/kernel/linux/.config
> CONFIG_EFI=y
> CONFIG_EFI_STUB=y
> # CONFIG_EFI_MIXED is not set
> CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
> # EFI (Extensible Firmware Interface) Support
> CONFIG_EFI_VARS=m
> CONFIG_EFI_ESRT=y
> CONFIG_EFI_VARS_PSTORE=m
> # CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
> CONFIG_EFI_RUNTIME_MAP=y
> # CONFIG_EFI_FAKE_MEMMAP is not set
> CONFIG_EFI_RUNTIME_WRAPPERS=y
> # CONFIG_EFI_BOOTLOADER_CONTROL is not set
> # CONFIG_EFI_CAPSULE_LOADER is not set
> # CONFIG_EFI_TEST is not set
> # CONFIG_EFI_RCI2_TABLE is not set
> # end of EFI (Extensible Firmware Interface) Support
> CONFIG_UEFI_CPER=y
> CONFIG_UEFI_CPER_X86=y
> CONFIG_EFI_EARLYCON=y
> CONFIG_EFI_PARTITION=y
> CONFIG_FB_EFI=y
> CONFIG_EFIVAR_FS=y
> # CONFIG_EFI_PGT_DUMP is not set
> 
> $ grep CMDLINE nobak/kernel/linux/.config
> CONFIG_CMDLINE_BOOL=y
> CONFIG_CMDLINE="root=UUID=97[...]e4 rd.luks.uuid=8a[...]c3 
> rd.luks.allow-discards=8a[...]c3 mem_sleep_default=deep resume=UUID=97[...]e4 
> resume_offset=96256 efi=debug memblock=debug"
> CONFIG_CMDLINE_OVERRIDE=y
> # CONFIG_BLK_CMDLINE_PARSER is not set
> # CONFIG_CMDLINE_PARTITION is not set
> CONFIG_FB_CMDLINE=y
> 
> $ efibootmgr -v
> BootCurrent: 000A
> Timeout: 2 seconds
> BootOrder: 000A,0009,0008,0005,0007,0006,0004,0002,0001,,0003
> [...]
> Boot0005* gentoo-5.4.0-next-20191127+-clear
> HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.0-next-20191127+-clear)
> [...]
> Boot000A* gentoo-5.4.1-gentoo
> HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.1-gentoo)
> 
> So there's no boot loader that could construct an e820 table for the
> kernel to consume. I understand it's then up to the EFI stub to come up
> with a e820 table from the EFI memory map.
> 
> > > Long time ago the add_efi_memmap is only enabled in case we explict
> > > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > > should.   Need opinion from X86 maintainers..
> > > Can you try below diff see if it works for you? (not tested, and need
> > > explicitly 'add_efi_memmap' in kernel cmdline param)
> 
> Neither adding add_efi_memmap nor adding your patch and setting that option
> does make the ESRT memory region appear in /proc/iomem. kexec_file still
> loads the kernel across the ESRT region.
> 

Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.

Will read your replied information later, did not get time today, but
probably below chunk can help?

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 3b9fd679cea9..516307617621 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
early_memunmap(new, new_size);
 
efi_memmap_install(new_phys, num_entries);
+   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
+   e820__update_table(e820_table);
 }
 
 /*

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 7/8] linux/log2.h: Fix 64bit calculations in roundup/down_pow_two()

2019-12-03 Thread Nicolas Saenz Julienne
Some users need to make sure their rounding function accepts and returns
64bit long variables regardless of the architecture. Sadly
roundup/rounddown_pow_two() takes and returns unsigned longs. It turns
out ilog2() already handles 32/64bit calculations properly, and being
the building block to the round functions we can rework them as a
wrapper around it.

Suggested-by: Robin Murphy 
Signed-off-by: Nicolas Saenz Julienne 
---
 drivers/clk/clk-divider.c|  8 ++--
 drivers/clk/sunxi/clk-sunxi.c|  2 +-
 drivers/infiniband/hw/hfi1/chip.c|  4 +-
 drivers/infiniband/hw/hfi1/init.c|  4 +-
 drivers/infiniband/hw/mlx4/srq.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_srq.c  |  2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c   |  4 +-
 drivers/iommu/intel-iommu.c  |  4 +-
 drivers/iommu/intel-svm.c|  4 +-
 drivers/iommu/intel_irq_remapping.c  |  2 +-
 drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c |  4 +-
 drivers/net/ethernet/marvell/sky2.c  |  2 +-
 drivers/net/ethernet/rocker/rocker_hw.h  |  4 +-
 drivers/net/ethernet/sfc/ef10.c  |  2 +-
 drivers/net/ethernet/sfc/efx.h   |  2 +-
 drivers/net/ethernet/sfc/falcon/efx.h|  2 +-
 drivers/pci/msi.c|  2 +-
 include/linux/log2.h | 44 +---
 kernel/kexec_core.c  |  3 +-
 lib/rhashtable.c |  2 +-
 net/sunrpc/xprtrdma/verbs.c  |  2 +-
 21 files changed, 41 insertions(+), 64 deletions(-)

diff --git a/drivers/clk/clk-divider.c b/drivers/clk/clk-divider.c
index 098b2b01f0af..ba947e4c8193 100644
--- a/drivers/clk/clk-divider.c
+++ b/drivers/clk/clk-divider.c
@@ -222,7 +222,7 @@ static int _div_round_up(const struct clk_div_table *table,
int div = DIV_ROUND_UP_ULL((u64)parent_rate, rate);
 
if (flags & CLK_DIVIDER_POWER_OF_TWO)
-   div = __roundup_pow_of_two(div);
+   div = roundup_pow_of_two(div);
if (table)
div = _round_up_table(table, div);
 
@@ -240,8 +240,8 @@ static int _div_round_closest(const struct clk_div_table 
*table,
down = parent_rate / rate;
 
if (flags & CLK_DIVIDER_POWER_OF_TWO) {
-   up = __roundup_pow_of_two(up);
-   down = __rounddown_pow_of_two(down);
+   up = roundup_pow_of_two(up);
+   down = rounddown_pow_of_two(down);
} else if (table) {
up = _round_up_table(table, up);
down = _round_down_table(table, down);
@@ -278,7 +278,7 @@ static int _next_div(const struct clk_div_table *table, int 
div,
div++;
 
if (flags & CLK_DIVIDER_POWER_OF_TWO)
-   return __roundup_pow_of_two(div);
+   return roundup_pow_of_two(div);
if (table)
return _round_up_table(table, div);
 
diff --git a/drivers/clk/sunxi/clk-sunxi.c b/drivers/clk/sunxi/clk-sunxi.c
index 27201fd26e44..faec99dc09c0 100644
--- a/drivers/clk/sunxi/clk-sunxi.c
+++ b/drivers/clk/sunxi/clk-sunxi.c
@@ -311,7 +311,7 @@ static void sun6i_get_ahb1_factors(struct factors_request 
*req)
 
calcm = DIV_ROUND_UP(div, 1 << calcp);
} else {
-   calcp = __roundup_pow_of_two(div);
+   calcp = roundup_pow_of_two(div);
calcp = calcp > 3 ? 3 : calcp;
}
 
diff --git a/drivers/infiniband/hw/hfi1/chip.c 
b/drivers/infiniband/hw/hfi1/chip.c
index 9b1fb84a3d45..96b1d343c32f 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -14199,10 +14199,10 @@ static int qos_rmt_entries(struct hfi1_devdata *dd, 
unsigned int *mp,
max_by_vl = krcvqs[i];
if (max_by_vl > 32)
goto no_qos;
-   m = ilog2(__roundup_pow_of_two(max_by_vl));
+   m = ilog2(roundup_pow_of_two(max_by_vl));
 
/* determine bits for vl */
-   n = ilog2(__roundup_pow_of_two(num_vls));
+   n = ilog2(roundup_pow_of_two(num_vls));
 
/* reject if too much is used */
if ((m + n) > 7)
diff --git a/drivers/infiniband/hw/hfi1/init.c 
b/drivers/infiniband/hw/hfi1/init.c
index 26b792bb1027..838c789c7cce 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -467,7 +467,7 @@ int hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, int 
numa,
 * MTU supported.
 */
if (rcd->egrbufs.size < hfi1_max_mtu) {
-   rcd->egrbufs.size = __roundup_pow_of_two(hfi1_max_mtu);
+   rcd->egrbufs.size = roundup_pow_of_two(hfi1_max_mtu);
hfi1_cdbg(PROC,
  "ctxt%u: eager bufs size too small. Adjusting 
to %u\n",
rcd->ctxt, rcd->egrbufs.size);
@@ -1959,7 +1959,7 @@ int 

[PATCH v4 0/8] Raspberry Pi 4 PCIe support

2019-12-03 Thread Nicolas Saenz Julienne
This series aims at providing support for Raspberry Pi 4's PCIe
controller, which is also shared with the Broadcom STB family of
devices.

There was a previous attempt to upstream this some years ago[1] but was
blocked as most STB PCIe integrations have a sparse DMA mapping[2] which
is something currently not supported by the kernel.  Luckily this is not
the case for the Raspberry Pi 4.

Note that the driver code is to be based on top of Rob Herring's series
simplifying inbound and outbound range parsing.

[1] https://patchwork.kernel.org/cover/10605933/
[2] https://patchwork.kernel.org/patch/10605957/

---

Changes since v3:
  - Moved all the log2.h related changes at the end of the series, as I
presume they will be contentious and I don't want the PCIe patches
to depend on them. Ultimately I think I'll respin them on their own
series but wanted to keep them in for this submission just for the
sake of continuity.
  - Addressed small nits here and there.

Changes since v2:
  - Redo register access in driver avoiding indirection while keeping
the naming intact
  - Add patch editing ARM64's config
  - Last MSI cleanups, notably removing MSIX flag
  - Got rid of all _RB writes
  - Got rid of all of_data
  - Overall churn removal
  - Address the rest of Andrew's comments

Changes since v1:
  - add generic rounddown/roundup_pow_two64() patch
  - Add MAINTAINERS patch
  - Fix Kconfig
  - Cleanup probe, use up to date APIs, exit on MSI failure
  - Get rid of linux,pci-domain and other unused constructs
  - Use edge triggered setup for MSI
  - Cleanup MSI implementation
  - Fix multiple cosmetic issues
  - Remove supend/resume code

Jim Quinlan (3):
  dt-bindings: PCI: Add bindings for brcmstb's PCIe device
  PCI: brcmstb: Add Broadcom STB PCIe host controller driver
  PCI: brcmstb: Add MSI support

Nicolas Saenz Julienne (5):
  ARM: dts: bcm2711: Enable PCIe controller
  MAINTAINERS: Add brcmstb PCIe controller
  arm64: defconfig: Enable Broadcom's STB PCIe controller
  linux/log2.h: Fix 64bit calculations in roundup/down_pow_two()
  linux/log2.h: Use roundup/dow_pow_two() on 64bit calculations

 .../bindings/pci/brcm,stb-pcie.yaml   |   97 ++
 MAINTAINERS   |4 +
 arch/arm/boot/dts/bcm2711.dtsi|   37 +
 arch/arm64/configs/defconfig  |1 +
 drivers/acpi/arm64/iort.c |2 +-
 drivers/clk/clk-divider.c |8 +-
 drivers/clk/sunxi/clk-sunxi.c |2 +-
 drivers/infiniband/hw/hfi1/chip.c |4 +-
 drivers/infiniband/hw/hfi1/init.c |4 +-
 drivers/infiniband/hw/mlx4/srq.c  |2 +-
 drivers/infiniband/hw/mthca/mthca_srq.c   |2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c|4 +-
 drivers/iommu/intel-iommu.c   |4 +-
 drivers/iommu/intel-svm.c |4 +-
 drivers/iommu/intel_irq_remapping.c   |2 +-
 drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c  |4 +-
 drivers/net/ethernet/marvell/sky2.c   |2 +-
 drivers/net/ethernet/mellanox/mlx4/en_clock.c |3 +-
 drivers/net/ethernet/rocker/rocker_hw.h   |4 +-
 drivers/net/ethernet/sfc/ef10.c   |2 +-
 drivers/net/ethernet/sfc/efx.h|2 +-
 drivers/net/ethernet/sfc/falcon/efx.h |2 +-
 drivers/of/device.c   |3 +-
 drivers/pci/controller/Kconfig|9 +
 drivers/pci/controller/Makefile   |1 +
 .../pci/controller/cadence/pcie-cadence-ep.c  |3 +-
 drivers/pci/controller/cadence/pcie-cadence.c |3 +-
 drivers/pci/controller/pcie-brcmstb.c | 1008 +
 drivers/pci/controller/pcie-rockchip-ep.c |5 +-
 drivers/pci/msi.c |2 +-
 include/linux/log2.h  |   44 +-
 kernel/dma/direct.c   |2 +-
 kernel/kexec_core.c   |3 +-
 lib/rhashtable.c  |2 +-
 net/sunrpc/xprtrdma/verbs.c   |2 +-
 35 files changed, 1211 insertions(+), 72 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/pci/brcm,stb-pcie.yaml
 create mode 100644 drivers/pci/controller/pcie-brcmstb.c

-- 
2.24.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Using kexec for rpi-4-b

2019-12-03 Thread Khouloud Touil
Hello,

I am trying to boot another kernel for rpi4, using a kexec tools,
The kernel booted already, is compiled with KEXEC=y,
using the official sources for rpi (4.19.86-v8+)
The rootfs generated with debootstarp for arch 64
The command I am trying to execute is:
kexec -l /boot/Image --dtb=/boot/bcm2711-rpi-4-b.dtb --ramdisk=/boot/rootfs.cpio
-- reuse-cmdline
this doesn't work and the logs are:
[99.107018] Can't kexec: CPUs are stuck in the kernel.
kexec_load failed: Device or resource busy
entry  = 0x3d9d690 flags = 0xb7
nr_segments = 4
segment[0].buf = 0x7f84316010
segment[0].bufsz = 0x1951a99
segment[0].mem = 0x8
segment[0].memsz = 0x19ca000
segment[1].buf = 0x7f81fc9010
segment[1.bufsz = 0x234c800
segment[1].mem = 0x1a4a000
segment[1].memsz = 0x234d000
segment[2].buf = 0x5597abe9b0
segment[2].bufsz = 0x521b
segment[2].mem = 3d97000
segment[2].memsz = 0x6000
segment[3].buf = 0x5597ac4030
segment[3].bufsz = 0x3458
segment[3].mem = 0x3d9d000
segment[3].memsz =  0x4000

Could someone please help to figure out what's the problem ?

Regards,

Khouloud

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v5 0/5] Append new variables to vmcoreinfo (TCR_EL1.T1SZ for arm64 and MAX_PHYSMEM_BITS for all archs)

2019-12-03 Thread Will Deacon
On Sat, Nov 30, 2019 at 01:35:36AM +0530, Bhupesh Sharma wrote:
> On Fri, Nov 29, 2019 at 3:54 PM Will Deacon  wrote:
> > On Fri, Nov 29, 2019 at 01:53:36AM +0530, Bhupesh Sharma wrote:
> > > Changes since v4:
> > > 
> > > - v4 can be seen here:
> > >   http://lists.infradead.org/pipermail/kexec/2019-November/023961.html
> > > - Addressed comments from Dave and added patches for documenting
> > >   new variables appended to vmcoreinfo documentation.
> > > - Added testing report shared by Akashi for PATCH 2/5.
> >
> > Please can you fix your mail setup? The last two times you've sent this
> > series it seems to get split into two threads, which is really hard to
> > track in my inbox:
> >
> > First thread:
> >
> > https://lore.kernel.org/lkml/1574972621-25750-1-git-send-email-bhsha...@redhat.com/
> >
> > Second thread:
> >
> > https://lore.kernel.org/lkml/1574972716-25858-1-git-send-email-bhsha...@redhat.com/
> 
> There seems to be some issue with my server's msmtp settings. I have
> tried resending the v5 (see
> ).
> 
> I hope the threading is ok this time.

Much better now, thanks for sorting it out.

Will

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Ard Biesheuvel
On Mon, 2 Dec 2019 at 09:05, Dave Young  wrote:
>
> Add more cc
> On 12/02/19 at 04:58pm, Dave Young wrote:
> > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > Hello Dave,
> > >
> > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > >
> > > > > > Fundamentally when deciding where to place a new kernel kexec 
> > > > > > (either
> > > > > > user space or the in kernel kexec_file implementation) needs to be 
> > > > > > able
> > > > > > to ask the question which memory ares are reserved.
> > > [...]
> > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > /proc/iomem?
> > > > >
> > > > > My guess is that the focus was that some EFI structures need to be 
> > > > > kept
> > > > > around accross the life cycle of *one* running kernel and
> > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > > or Matt would know.
> > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > BOOT* then in X86 you can use something like below if it is not covered:
> > >
> > > > void __init efi_esrt_init(void)
> > > > {
> > > > ...
> > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
> > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > > ...
> > > > }
> > >
> > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > the esrt module reports at boot:
> > >
> > > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > > 0x74dd2fd0.
> > >
> > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > code you quote reserve it using memblock_reserve() shown by
> > > memblock=debug:
> > >
> > > [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> > > efi_mem_reserve+0x1d/0x2b
> > >
> > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > as shown by efi=debug:
> > >
> > > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > >
> > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > > not report this area as being free'd while the surrounding ones are:
> > >
> > > [0.178369] __memblock_free_late: 
> > > [0x74dd3000-0x75becfff] efi_free_boot_services+0x126/0x1f8
> > > [0.178658] __memblock_free_late: 
> > > [0x6d635000-0x74dd1fff] efi_free_boot_services+0x126/0x1f8
> > >
> > > The esrt area does not show up in /proc/iomem though:
> > >
> > > 0010-763f5fff : System RAM
> > >   6200-62a00d80 : Kernel code
> > >   62c0-62f15fff : Kernel rodata
> > >   6300-630ea8bf : Kernel data
> > >   63fed000-641f : Kernel bss
> > >   6500-6aff : Crash kernel
> > >
> > > And thus kexec loads the new kernel right over that area as shown when
> > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > > and 0x7300+0x24be000 = 0x754be000):
> > >
> > > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > >
> > > ... because it looks for any memory hole large enough in iomem resources
> > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > excluded from on my system.
> > >
> > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > the area with efi.memmap and installs it using efi_memmap_install().
> > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > > *not* reserve that memory in any way.
> > >
> > > > Unfortunately I noticed there are different requirements/ways for
> > > > different types of "reserved" memory.  But that is another topic..
> > >
> > > I tried to reserve the area with something like this:
> > >
> > > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > index 4de244683a7e..b86a5df027a2 100644
> > > --- a/arch/x86/platform/efi/quirks.c
> > > +++ b/arch/x86/platform/efi/quirks.c
> > > @@ -249,6 

Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-03 Thread Petr Mladek
On Mon 2019-12-02 17:37:26, John Ogness wrote:
> On 2019-12-02, Petr Mladek  wrote:
> >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> >> > +{
> >> > +struct prb_desc_ring *desc_ring = >desc_ring;
> >> > +struct prb_desc *desc;
> >> > +u32 id_prev_wrap;
> >> > +u32 head_id;
> >> > +u32 id;
> >> > +
> >> > +head_id = atomic_read(_ring->head_id);
> >> > +
> >> > +do {
> >> > +desc = to_desc(desc_ring, head_id);
> >> > +
> >> > +id = DESC_ID(head_id + 1);
> >> > +id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> >> > +
> >> > +if (id_prev_wrap == atomic_read(_ring->tail_id)) {
> >> > +if (!desc_push_tail(rb, id_prev_wrap))
> >> > +return false;
> >> > +}
> >> > +} while (!atomic_try_cmpxchg(_ring->head_id, _id, 
> >> > id));
> >> 
> >> Hmm, in theory, ABA problem might cause that we successfully
> >> move desc_ring->head_id when tail has not been pushed yet.
> >> 
> >> As a result we would never call desc_push_tail() until
> >> it overflows again.
> >> 
> >> I am not sure if we need to take care of it. The code is called with
> >> interrupts disabled. IMHO, only NMI could cause ABA problem
> >> in reality. But the game (debugging) is lost anyway when NMI ovewrites
> >> the buffer several times.
> >
> > BTW: If I am counting correctly. The ABA problem would happen when
> > exactly 2^30 (1G) messages is written in the mean time.
> 
> All the ringbuffer code assumes that the use of index numbers handles
> the ABA problem (i.e. there must not be 1 billion printk's within an
> NMI). If we want to support 1 billion+ printk's within an NMI, then
> perhaps the index number should be increased. For 64-bit systems it
> would be no problem to go to 62 bits. For 32-bit systems, I don't know
> how well the 64-bit atomic operations are supported.

I am not super happy but I really think that it does not make sense
to complicate the code too much because of this theoretical race.

1 billion printks in NMI is crazy. Also the race is a problem only
when we hit exactly the 2^30 message. If we hit 2^30 + 1 and more
than everything is fine again.

In the worst case, printk() will stop working. I think that there
are other situations that are much more likely when printk() will
not work (people will not see the messages).


> >> Also it should not be a complete failure. We still could bail out when
> >> the previous state was not reusable. We are on the safe side
> >> when it was reusable.
> >> 
> >> Also we could probably detect when desc_ring->tail_id is not
> >> updated any longer and do something about it.
> >> 
> >> > +
> >> > +desc = to_desc(desc_ring, id);
> >> > +
> >> > +/* Set the descriptor's ID and also set its state to reserved. 
> >> > */
> >> > +atomic_set(>state_var, id | 0);
> >> 
> >> We should check here that the original state id from the previous
> >> wrap in reusable state (or 0 for bootstrap). On error, we know that
> >> there was the ABA problem and would need to do something about it.
> >
> > I believe that it should be enough to add this check and
> > bail out in case of error.
> 
> I need to go through the code again in detail and see how many locations
> are affected by ABA. All the code was written with the assumption that
> this type of ABA will not happen.
>
> As you've stated, we could add minimal handling so that the ringbuffer
> at least does not break or get stuck.
> 
> BTW: The same assumption is made for logical positions. There are 4
> times as many of these (on 32-bit systems) but logical positions advance
> much faster. I will review these as well.

The logical positions are assigned only when a descriptor is reserved.
Such a descriptor could not be reused until committed and marked
reusable. It limits the logical position movement to:

   max_record_size * number_of_descriptors

Printk records are limited to 1k. So we could safely support
up to 1 million fully sized lines printed when NMI interrupted
printk() on one CPU.

The most important thing is that it must not crash the system.
It would be nice if we are able to detect this situation and
do something about it. But IMHO, it is perfectly fine when
printk() would stop working in this corner case.

The only problem is that it might be hard to debug. But it
should be possible with crashdump. And I think that we will
first hit some other bugs that we do not see at the moment ;-)

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec