Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Dave Young
On 12/03/19 at 10:11pm, Michael Weiser wrote:
> Hi Dave,
> 
> On Tue, Dec 03, 2019 at 07:54:35PM +0800, Dave Young wrote:
> 
> > > Neither adding add_efi_memmap nor adding your patch and setting that 
> > > option
> > > does make the ESRT memory region appear in /proc/iomem. kexec_file still
> > > loads the kernel across the ESRT region.
> > Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
> > EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.
> 
> > Will read your replied information later, did not get time today, but
> > probably below chunk can help?
> 
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 3b9fd679cea9..516307617621 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> > size)
> > early_memunmap(new, new_size);
> 
> > efi_memmap_install(new_phys, num_entries);
> > +   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > +   e820__update_table(e820_table);
> >  }
> 
> >  /*
> 
> Yes, that did it:
> 
> -0fff : Reserved
> 1000-0009efff : System RAM
> 0009f000-000f : Reserved
>   000a-000b : PCI Bus :00
>   000e-000e3fff : PCI Bus :00
>   000e4000-000e7fff : PCI Bus :00
>   000e8000-000ebfff : PCI Bus :00
>   000ec000-000e : PCI Bus :00
>   000f-000f : PCI Bus :00
> 000f-000f : System ROM
> 0010-74dd1fff : System RAM
>   6500-6aff : Crash kernel
> 74dd2000-74dd2fff : Reserved   <- ESRT
> 74dd3000-763f5fff : System RAM
> 763f6000-79974fff : Reserved
> 79975000-799f1fff : ACPI Tables
> 799f2000-79aa6fff : ACPI Non-volatile Storage
>   79a17000-79a17fff : USBC000:00

Ok, good to know it works.  I will think about it and file a patch
later.  There are more things to consider, eg. kexec reboot multiple
times, userspace kexec loader etc.

If we choose to fix it in kexec_file path to avoid those region then we
need to do same in userspace, there will be compatibility issues so I
would still prefer to go with this way you tested.

BTW, on my laptop the ESRT stays in EFI runtime area so I do not see the
problem.  This should be machine/firmware specific.

Here is the info on my laptop:
[0.00] efi: mem34: [Runtime Data   |RUN|  |  |  |  |  |  |   
|WB|WT|WC|UC] range=[0x7a4b-0x7a676fff] (1MB)
[0.020670] esrt: Reserving ESRT space from 0x7a4ec000 to 
0x7a4ec088.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Michael Weiser
Hi Dave,

On Tue, Dec 03, 2019 at 07:54:35PM +0800, Dave Young wrote:

> > Neither adding add_efi_memmap nor adding your patch and setting that option
> > does make the ESRT memory region appear in /proc/iomem. kexec_file still
> > loads the kernel across the ESRT region.
> Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
> EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.

> Will read your replied information later, did not get time today, but
> probably below chunk can help?

> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 3b9fd679cea9..516307617621 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
>   early_memunmap(new, new_size);

>   efi_memmap_install(new_phys, num_entries);
> + e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> + e820__update_table(e820_table);
>  }

>  /*

Yes, that did it:

-0fff : Reserved
1000-0009efff : System RAM
0009f000-000f : Reserved
  000a-000b : PCI Bus :00
  000e-000e3fff : PCI Bus :00
  000e4000-000e7fff : PCI Bus :00
  000e8000-000ebfff : PCI Bus :00
  000ec000-000e : PCI Bus :00
  000f-000f : PCI Bus :00
000f-000f : System ROM
0010-74dd1fff : System RAM
  6500-6aff : Crash kernel
74dd2000-74dd2fff : Reserved   <- ESRT
74dd3000-763f5fff : System RAM
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

[0.001381] esrt: Reserving ESRT space from 0x74dd2f98 to 
0x74dd2fd0.
[0.001382] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
efi_mem_reserve+0x1d/0x2b
[0.001383] memblock_reserve: [0x0009e640-0x0009efcf] 
memblock_alloc_range_nid+0x93/0xfa
[0.001384] e820: update [mem 0x74dd2000-0x74dd2fff] usable ==> reserved
[...]
[0.043610] PM: Registered nosave memory: [mem 0x-0x0fff]
[0.043611] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 
from=0x max_addr=0x 
__register_nosave_region+0x6b/0xca
[0.043612] memblock_reserve: [0x00047dff95c0-0x00047dff95df] 
memblock_alloc_range_nid+0x93/0xfa
[0.043613] PM: Registered nosave memory: [mem 0x0009f000-0x000f]
[0.043615] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 
from=0x max_addr=0x 
__register_nosave_region+0x6b/0xca
[0.043616] memblock_reserve: [0x00047dff9580-0x00047dff959f] 
memblock_alloc_range_nid+0x93/0xfa
[0.043617] PM: Registered nosave memory: [mem 0x74dd2000-0x74dd2fff]   
< ESRT
[0.043618] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 
from=0x max_addr=0x 
__register_nosave_region+0x6b/0xca
[0.043619] memblock_reserve: [0x00047dff9540-0x00047dff955f] 
memblock_alloc_range_nid+0x93/0xfa
[0.043620] PM: Registered nosave memory: [mem 0x763f6000-0x79974fff]
[0.043620] PM: Registered nosave memory: [mem 0x79975000-0x799f1fff]
[0.043621] PM: Registered nosave memory: [mem 0x799f2000-0x79aa6fff]
[0.043621] PM: Registered nosave memory: [mem 0x79aa7000-0x7a40dfff]
[...]
[5.993928] PCI: pci_cache_line_size set to 64 bytes
[5.994563] e820: reserve RAM buffer [mem 0x0009f000-0x0009]
[5.994565] e820: reserve RAM buffer [mem 0x74dd2000-0x77ff]
<- ESRT
[5.994565] e820: reserve RAM buffer [mem 0x763f6000-0x77ff]
[5.994566] e820: reserve RAM buffer [mem 0x7a40f000-0x7bff]
[5.994567] e820: reserve RAM buffer [mem 0x47e00-0x47fff]
[5.995513] acpi PNP0C14:02: duplicate WMI GUID 
05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[5.995549] acpi PNP0C14:03: duplicate WMI GUID 
05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[...]
[   86.508053] kexec-bzImage64: Loaded purgatory at 0x98000
[   86.508056] kexec_file: Considering 0x1000-0x9efff
[   86.508057] kexec-bzImage64: Loaded boot_param, command line and misc at 
0x96000 bufsz=0x1240 memsz=0x1240
[   86.508057] kexec_file: Considering 0x10-0x74dd1fff
[   86.508058] kexec-bzImage64: Loaded 64bit kernel at 0x7200 
bufsz=0x1140888 memsz=0x24b7000
[   86.508058] kexec-bzImage64: Final command line is: 
[   86.584668] kexec_file: Loading segment 0: buf=0xd5ec82bc 
bufsz=0x5000 mem=0x98000 memsz=0x6000
[   86.584672] kexec_file: Loading segment 1: buf=0xaf539c69 
bufsz=0x1240 mem=0x96000 memsz=0x2000
[   86.584674] kexec_file: Loading segment 2: buf=0x29f9b9a8 
bufsz=0x1140888 mem=0x7200 memsz=0x24b7000   < not ESRT :)

And no more invalid version error message from the kexec'd kernel.
-- 
Thanks,
Michael


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Dave Young
On 12/03/19 at 10:01am, Ard Biesheuvel wrote:
> On Mon, 2 Dec 2019 at 09:05, Dave Young  wrote:
> >
> > Add more cc
> > On 12/02/19 at 04:58pm, Dave Young wrote:
> > > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > > Hello Dave,
> > > >
> > > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > > >
> > > > > > > Fundamentally when deciding where to place a new kernel kexec 
> > > > > > > (either
> > > > > > > user space or the in kernel kexec_file implementation) needs to 
> > > > > > > be able
> > > > > > > to ask the question which memory ares are reserved.
> > > > [...]
> > > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > > /proc/iomem?
> > > > > >
> > > > > > My guess is that the focus was that some EFI structures need to be 
> > > > > > kept
> > > > > > around accross the life cycle of *one* running kernel and
> > > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > > kexecing another kernel might just never have cropped up thus far. 
> > > > > > Ard
> > > > > > or Matt would know.
> > > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > > BOOT* then in X86 you can use something like below if it is not 
> > > > > covered:
> > > >
> > > > > void __init efi_esrt_init(void)
> > > > > {
> > > > > ...
> > > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, 
> > > > > );
> > > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > > > ...
> > > > > }
> > > >
> > > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > > the esrt module reports at boot:
> > > >
> > > > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > > > 0x74dd2fd0.
> > > >
> > > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > > code you quote reserve it using memblock_reserve() shown by
> > > > memblock=debug:
> > > >
> > > > [0.001246] memblock_reserve: 
> > > > [0x74dd2f98-0x74dd2fcf] efi_mem_reserve+0x1d/0x2b
> > > >
> > > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > > as shown by efi=debug:
> > > >
> > > > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  
> > > > |   |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > > > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  
> > > > |   |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > > > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  
> > > > |   |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > > >
> > > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > > from calling __memblock_free_late() on it. And indeed, memblock=debug 
> > > > does
> > > > not report this area as being free'd while the surrounding ones are:
> > > >
> > > > [0.178369] __memblock_free_late: 
> > > > [0x74dd3000-0x75becfff] 
> > > > efi_free_boot_services+0x126/0x1f8
> > > > [0.178658] __memblock_free_late: 
> > > > [0x6d635000-0x74dd1fff] 
> > > > efi_free_boot_services+0x126/0x1f8
> > > >
> > > > The esrt area does not show up in /proc/iomem though:
> > > >
> > > > 0010-763f5fff : System RAM
> > > >   6200-62a00d80 : Kernel code
> > > >   62c0-62f15fff : Kernel rodata
> > > >   6300-630ea8bf : Kernel data
> > > >   63fed000-641f : Kernel bss
> > > >   6500-6aff : Crash kernel
> > > >
> > > > And thus kexec loads the new kernel right over that area as shown when
> > > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > > > and 0x7300+0x24be000 = 0x754be000):
> > > >
> > > > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > > > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > > > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > > > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > > >
> > > > ... because it looks for any memory hole large enough in iomem resources
> > > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > > excluded from on my system.
> > > >
> > > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > > the area with efi.memmap and installs it using efi_memmap_install().
> > > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > > of the comments in the source of memremap(), MEMREMAP_WB does 
> > > > specifically
> > > > *not* reserve that memory in any way.
> > > >
> > > > > Unfortunately I noticed there are different requirements/ways for
> > > > > different types of "reserved" memory.  But 

Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Dave Young
On 12/03/19 at 12:45am, Michael Weiser wrote:
> Hi Dave,
> 
> On Mon, Dec 02, 2019 at 05:05:20PM +0800, Dave Young wrote:
> 
> > > It seems a serious problem, the EFI modified memmap does not get an
> > > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > > X86.
> > > 
> > > There is an question from Sai about why add_efi_memmap is not enabled by
> > > default:
> > > https://www.spinics.net/lists/linux-mm/msg185166.html
> 
> Incidentally, a data point I did not think to mention: I do boot the
> kernel as EFI application directly from the firmware as a boot entry
> with compiled in initrd and command line:
> 
> $ grep EFI nobak/kernel/linux/.config
> CONFIG_EFI=y
> CONFIG_EFI_STUB=y
> # CONFIG_EFI_MIXED is not set
> CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
> # EFI (Extensible Firmware Interface) Support
> CONFIG_EFI_VARS=m
> CONFIG_EFI_ESRT=y
> CONFIG_EFI_VARS_PSTORE=m
> # CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
> CONFIG_EFI_RUNTIME_MAP=y
> # CONFIG_EFI_FAKE_MEMMAP is not set
> CONFIG_EFI_RUNTIME_WRAPPERS=y
> # CONFIG_EFI_BOOTLOADER_CONTROL is not set
> # CONFIG_EFI_CAPSULE_LOADER is not set
> # CONFIG_EFI_TEST is not set
> # CONFIG_EFI_RCI2_TABLE is not set
> # end of EFI (Extensible Firmware Interface) Support
> CONFIG_UEFI_CPER=y
> CONFIG_UEFI_CPER_X86=y
> CONFIG_EFI_EARLYCON=y
> CONFIG_EFI_PARTITION=y
> CONFIG_FB_EFI=y
> CONFIG_EFIVAR_FS=y
> # CONFIG_EFI_PGT_DUMP is not set
> 
> $ grep CMDLINE nobak/kernel/linux/.config
> CONFIG_CMDLINE_BOOL=y
> CONFIG_CMDLINE="root=UUID=97[...]e4 rd.luks.uuid=8a[...]c3 
> rd.luks.allow-discards=8a[...]c3 mem_sleep_default=deep resume=UUID=97[...]e4 
> resume_offset=96256 efi=debug memblock=debug"
> CONFIG_CMDLINE_OVERRIDE=y
> # CONFIG_BLK_CMDLINE_PARSER is not set
> # CONFIG_CMDLINE_PARTITION is not set
> CONFIG_FB_CMDLINE=y
> 
> $ efibootmgr -v
> BootCurrent: 000A
> Timeout: 2 seconds
> BootOrder: 000A,0009,0008,0005,0007,0006,0004,0002,0001,,0003
> [...]
> Boot0005* gentoo-5.4.0-next-20191127+-clear
> HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.0-next-20191127+-clear)
> [...]
> Boot000A* gentoo-5.4.1-gentoo
> HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.1-gentoo)
> 
> So there's no boot loader that could construct an e820 table for the
> kernel to consume. I understand it's then up to the EFI stub to come up
> with a e820 table from the EFI memory map.
> 
> > > Long time ago the add_efi_memmap is only enabled in case we explict
> > > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > > should.   Need opinion from X86 maintainers..
> > > Can you try below diff see if it works for you? (not tested, and need
> > > explicitly 'add_efi_memmap' in kernel cmdline param)
> 
> Neither adding add_efi_memmap nor adding your patch and setting that option
> does make the ESRT memory region appear in /proc/iomem. kexec_file still
> loads the kernel across the ESRT region.
> 

Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.

Will read your replied information later, did not get time today, but
probably below chunk can help?

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 3b9fd679cea9..516307617621 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
early_memunmap(new, new_size);
 
efi_memmap_install(new_phys, num_entries);
+   e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
+   e820__update_table(e820_table);
 }
 
 /*

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-03 Thread Ard Biesheuvel
On Mon, 2 Dec 2019 at 09:05, Dave Young  wrote:
>
> Add more cc
> On 12/02/19 at 04:58pm, Dave Young wrote:
> > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > Hello Dave,
> > >
> > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > >
> > > > > > Fundamentally when deciding where to place a new kernel kexec 
> > > > > > (either
> > > > > > user space or the in kernel kexec_file implementation) needs to be 
> > > > > > able
> > > > > > to ask the question which memory ares are reserved.
> > > [...]
> > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > /proc/iomem?
> > > > >
> > > > > My guess is that the focus was that some EFI structures need to be 
> > > > > kept
> > > > > around accross the life cycle of *one* running kernel and
> > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > > or Matt would know.
> > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > BOOT* then in X86 you can use something like below if it is not covered:
> > >
> > > > void __init efi_esrt_init(void)
> > > > {
> > > > ...
> > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
> > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > > ...
> > > > }
> > >
> > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > the esrt module reports at boot:
> > >
> > > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > > 0x74dd2fd0.
> > >
> > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > code you quote reserve it using memblock_reserve() shown by
> > > memblock=debug:
> > >
> > > [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> > > efi_mem_reserve+0x1d/0x2b
> > >
> > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > as shown by efi=debug:
> > >
> > > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  | 
> > >   |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > >
> > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > > not report this area as being free'd while the surrounding ones are:
> > >
> > > [0.178369] __memblock_free_late: 
> > > [0x74dd3000-0x75becfff] efi_free_boot_services+0x126/0x1f8
> > > [0.178658] __memblock_free_late: 
> > > [0x6d635000-0x74dd1fff] efi_free_boot_services+0x126/0x1f8
> > >
> > > The esrt area does not show up in /proc/iomem though:
> > >
> > > 0010-763f5fff : System RAM
> > >   6200-62a00d80 : Kernel code
> > >   62c0-62f15fff : Kernel rodata
> > >   6300-630ea8bf : Kernel data
> > >   63fed000-641f : Kernel bss
> > >   6500-6aff : Crash kernel
> > >
> > > And thus kexec loads the new kernel right over that area as shown when
> > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > > and 0x7300+0x24be000 = 0x754be000):
> > >
> > > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > >
> > > ... because it looks for any memory hole large enough in iomem resources
> > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > excluded from on my system.
> > >
> > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > the area with efi.memmap and installs it using efi_memmap_install().
> > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > > *not* reserve that memory in any way.
> > >
> > > > Unfortunately I noticed there are different requirements/ways for
> > > > different types of "reserved" memory.  But that is another topic..
> > >
> > > I tried to reserve the area with something like this:
> > >
> > > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > index 4de244683a7e..b86a5df027a2 100644
> > > --- a/arch/x86/platform/efi/quirks.c
> > > +++ b/arch/x86/platform/efi/quirks.c
> > > @@ -249,6 

Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-02 Thread Michael Weiser
Hi Dave,

On Mon, Dec 02, 2019 at 05:05:20PM +0800, Dave Young wrote:

> > It seems a serious problem, the EFI modified memmap does not get an
> > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > X86.
> > 
> > There is an question from Sai about why add_efi_memmap is not enabled by
> > default:
> > https://www.spinics.net/lists/linux-mm/msg185166.html

Incidentally, a data point I did not think to mention: I do boot the
kernel as EFI application directly from the firmware as a boot entry
with compiled in initrd and command line:

$ grep EFI nobak/kernel/linux/.config
CONFIG_EFI=y
CONFIG_EFI_STUB=y
# CONFIG_EFI_MIXED is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# EFI (Extensible Firmware Interface) Support
CONFIG_EFI_VARS=m
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=m
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_EFI_RCI2_TABLE is not set
# end of EFI (Extensible Firmware Interface) Support
CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y
CONFIG_EFI_EARLYCON=y
CONFIG_EFI_PARTITION=y
CONFIG_FB_EFI=y
CONFIG_EFIVAR_FS=y
# CONFIG_EFI_PGT_DUMP is not set

$ grep CMDLINE nobak/kernel/linux/.config
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="root=UUID=97[...]e4 rd.luks.uuid=8a[...]c3 
rd.luks.allow-discards=8a[...]c3 mem_sleep_default=deep resume=UUID=97[...]e4 
resume_offset=96256 efi=debug memblock=debug"
CONFIG_CMDLINE_OVERRIDE=y
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_FB_CMDLINE=y

$ efibootmgr -v
BootCurrent: 000A
Timeout: 2 seconds
BootOrder: 000A,0009,0008,0005,0007,0006,0004,0002,0001,,0003
[...]
Boot0005* gentoo-5.4.0-next-20191127+-clear
HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.0-next-20191127+-clear)
[...]
Boot000A* gentoo-5.4.1-gentoo
HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.1-gentoo)

So there's no boot loader that could construct an e820 table for the
kernel to consume. I understand it's then up to the EFI stub to come up
with a e820 table from the EFI memory map.

> > Long time ago the add_efi_memmap is only enabled in case we explict
> > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > should.   Need opinion from X86 maintainers..
> > Can you try below diff see if it works for you? (not tested, and need
> > explicitly 'add_efi_memmap' in kernel cmdline param)

Neither adding add_efi_memmap nor adding your patch and setting that option
does make the ESRT memory region appear in /proc/iomem. kexec_file still
loads the kernel across the ESRT region.

What occurs to me is that nowhere does the ESRT memory region appear in
any externally provided memory map. Neither e820 nor EFI seem to declare
it. Is that expected or a bug of my particular system?

For example, the e820 map (derived from the EFI map by the EFI stub?)
has these regions:

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x-0x0009efff] usable
BIOS-e820: [mem 0x0009f000-0x000f] reserved
BIOS-e820: [mem 0x0010-0x763f5fff] usable
BIOS-e820: [mem 0x763f6000-0x79974fff] reserved
BIOS-e820: [mem 0x79975000-0x799f1fff] ACPI data
BIOS-e820: [mem 0x799f2000-0x79aa6fff] ACPI NVS
BIOS-e820: [mem 0x79aa7000-0x7a40dfff] reserved
BIOS-e820: [mem 0x7a40e000-0x7a40efff] usable
BIOS-e820: [mem 0x7a40f000-0x7fff] reserved
BIOS-e820: [mem 0xf000-0xf7ff] reserved
BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
BIOS-e820: [mem 0xff00-0x] reserved
BIOS-e820: [mem 0x0001-0x00047dff] usable

The ESRT region sits smack in the middle of a large system RAM region:

BIOS-e820: [mem 0x0010-0x763f5fff] usable

Consequently, the relevant part of /proc/iomem looks like this:

-0fff : Reserved
1000-0009efff : System RAM
0009f000-000f : Reserved
  000a-000b : PCI Bus :00
  000e-000e3fff : PCI Bus :00
  000e4000-000e7fff : PCI Bus :00
  000e8000-000ebfff : PCI Bus :00
  000ec000-000e : PCI Bus :00
  000f-000f : PCI Bus :00
000f-000f : System ROM
0010-763f5fff : System RAM
  6500-6aff : Crash kernel
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

What it would need to look like for kexec to leave ESRT alone, I guess, is:

-0fff : Reserved
1000-0009efff : System RAM
0009f000-000f : Reserved
  

Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-02 Thread Dave Young
Add more cc
On 12/02/19 at 04:58pm, Dave Young wrote:
> On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > Hello Dave,
> > 
> > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > 
> > > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > > user space or the in kernel kexec_file implementation) needs to be 
> > > > > able
> > > > > to ask the question which memory ares are reserved.
> > [...]
> > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > /proc/iomem?
> > > > 
> > > > My guess is that the focus was that some EFI structures need to be kept
> > > > around accross the life cycle of *one* running kernel and
> > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > or Matt would know.
> > > Can you check your un-reserved memory, if your memory falls into EFI
> > > BOOT* then in X86 you can use something like below if it is not covered:
> > 
> > > void __init efi_esrt_init(void)
> > > {
> > > ...
> > >   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
> > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > ...
> > > }
> > 
> > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > the esrt module reports at boot:
> > 
> > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > 0x74dd2fd0.
> > 
> > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > code you quote reserve it using memblock_reserve() shown by
> > memblock=debug:
> > 
> > [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> > efi_mem_reserve+0x1d/0x2b
> > 
> > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > as shown by efi=debug:
> > 
> > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  |   
> > |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  |   
> > |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  |   
> > |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > 
> > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > not report this area as being free'd while the surrounding ones are:
> > 
> > [0.178369] __memblock_free_late: 
> > [0x74dd3000-0x75becfff] efi_free_boot_services+0x126/0x1f8
> > [0.178658] __memblock_free_late: 
> > [0x6d635000-0x74dd1fff] efi_free_boot_services+0x126/0x1f8
> > 
> > The esrt area does not show up in /proc/iomem though:
> > 
> > 0010-763f5fff : System RAM
> >   6200-62a00d80 : Kernel code
> >   62c0-62f15fff : Kernel rodata
> >   6300-630ea8bf : Kernel data
> >   63fed000-641f : Kernel bss
> >   6500-6aff : Crash kernel
> > 
> > And thus kexec loads the new kernel right over that area as shown when
> > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > and 0x7300+0x24be000 = 0x754be000):
> > 
> > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > 
> > ... because it looks for any memory hole large enough in iomem resources
> > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > excluded from on my system.
> > 
> > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > the area with efi.memmap and installs it using efi_memmap_install().
> > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > *not* reserve that memory in any way.
> > 
> > > Unfortunately I noticed there are different requirements/ways for
> > > different types of "reserved" memory.  But that is another topic..
> > 
> > I tried to reserve the area with something like this:
> > 
> > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 4de244683a7e..b86a5df027a2 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> > size)
> > efi_memory_desc_t md;
> > int num_entries;
> > void *new;
> > +   struct resource *res;
> >  
> > if (efi_mem_desc_lookup(addr, ) ||
> > 

Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-02 Thread Dave Young
On 11/29/19 at 04:27pm, Michael Weiser wrote:
> Hello Dave,
> 
> On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> 
> > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > user space or the in kernel kexec_file implementation) needs to be able
> > > > to ask the question which memory ares are reserved.
> [...]
> > > > So my question is why doesn't the ESRT reservation wind up in
> > > > /proc/iomem?
> > > 
> > > My guess is that the focus was that some EFI structures need to be kept
> > > around accross the life cycle of *one* running kernel and
> > > memblock_reserve() was enough for that. Marking them so they survive
> > > kexecing another kernel might just never have cropped up thus far. Ard
> > > or Matt would know.
> > Can you check your un-reserved memory, if your memory falls into EFI
> > BOOT* then in X86 you can use something like below if it is not covered:
> 
> > void __init efi_esrt_init(void)
> > {
> > ...
> > pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
> > if (md.type == EFI_BOOT_SERVICES_DATA)
> > efi_mem_reserve(esrt_data, esrt_data_size);
> > ...
> > }
> 
> Please bear with me if I'm a bit slow on the uptake here: On my machine,
> the esrt module reports at boot:
> 
> [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> 0x74dd2fd0.
> 
> This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> code you quote reserve it using memblock_reserve() shown by
> memblock=debug:
> 
> [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> efi_mem_reserve+0x1d/0x2b
> 
> It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> as shown by efi=debug:
> 
> [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  |   
> |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  |   
> |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  |   
> |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> 
> This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> from calling __memblock_free_late() on it. And indeed, memblock=debug does
> not report this area as being free'd while the surrounding ones are:
> 
> [0.178369] __memblock_free_late: [0x74dd3000-0x75becfff] 
> efi_free_boot_services+0x126/0x1f8
> [0.178658] __memblock_free_late: [0x6d635000-0x74dd1fff] 
> efi_free_boot_services+0x126/0x1f8
> 
> The esrt area does not show up in /proc/iomem though:
> 
> 0010-763f5fff : System RAM
>   6200-62a00d80 : Kernel code
>   62c0-62f15fff : Kernel rodata
>   6300-630ea8bf : Kernel data
>   63fed000-641f : Kernel bss
>   6500-6aff : Crash kernel
> 
> And thus kexec loads the new kernel right over that area as shown when
> enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> and 0x7300+0x24be000 = 0x754be000):
> 
> [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> bufsz=0x5000 mem=0x98000 memsz=0x6000
> [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> bufsz=0x1240 mem=0x96000 memsz=0x2000
> [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> 
> ... because it looks for any memory hole large enough in iomem resources
> tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> excluded from on my system.
> 
> Looking some more at efi_arch_mem_reserve() I see that it also registers
> the area with efi.memmap and installs it using efi_memmap_install().
> which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> of the comments in the source of memremap(), MEMREMAP_WB does specifically
> *not* reserve that memory in any way.
> 
> > Unfortunately I noticed there are different requirements/ways for
> > different types of "reserved" memory.  But that is another topic..
> 
> I tried to reserve the area with something like this:
> 
> t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 4de244683a7e..b86a5df027a2 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> efi_memory_desc_t md;
> int num_entries;
> void *new;
> +   struct resource *res;
>  
> if (efi_mem_desc_lookup(addr, ) ||
> md.type != EFI_BOOT_SERVICES_DATA) {
> @@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> early_memunmap(new, new_size);
>  
> efi_memmap_install(new_phys, num_entries);
> +
> +   res = memblock_alloc(sizeof(*res), 

Re: kexec_file overwrites reserved EFI ESRT memory

2019-11-29 Thread Michael Weiser
Hello Dave,

On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:

> > > Fundamentally when deciding where to place a new kernel kexec (either
> > > user space or the in kernel kexec_file implementation) needs to be able
> > > to ask the question which memory ares are reserved.
[...]
> > > So my question is why doesn't the ESRT reservation wind up in
> > > /proc/iomem?
> > 
> > My guess is that the focus was that some EFI structures need to be kept
> > around accross the life cycle of *one* running kernel and
> > memblock_reserve() was enough for that. Marking them so they survive
> > kexecing another kernel might just never have cropped up thus far. Ard
> > or Matt would know.
> Can you check your un-reserved memory, if your memory falls into EFI
> BOOT* then in X86 you can use something like below if it is not covered:

> void __init efi_esrt_init(void)
> {
> ...
>   pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
>   if (md.type == EFI_BOOT_SERVICES_DATA)
>   efi_mem_reserve(esrt_data, esrt_data_size);
> ...
> }

Please bear with me if I'm a bit slow on the uptake here: On my machine,
the esrt module reports at boot:

[0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
0x74dd2fd0.

This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
code you quote reserve it using memblock_reserve() shown by
memblock=debug:

[0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
efi_mem_reserve+0x1d/0x2b

It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
as shown by efi=debug:

[0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  |   
|WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
[0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  |   
|WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
[0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  |   
|WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)

This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
from calling __memblock_free_late() on it. And indeed, memblock=debug does
not report this area as being free'd while the surrounding ones are:

[0.178369] __memblock_free_late: [0x74dd3000-0x75becfff] 
efi_free_boot_services+0x126/0x1f8
[0.178658] __memblock_free_late: [0x6d635000-0x74dd1fff] 
efi_free_boot_services+0x126/0x1f8

The esrt area does not show up in /proc/iomem though:

0010-763f5fff : System RAM
  6200-62a00d80 : Kernel code
  62c0-62f15fff : Kernel rodata
  6300-630ea8bf : Kernel data
  63fed000-641f : Kernel bss
  6500-6aff : Crash kernel

And thus kexec loads the new kernel right over that area as shown when
enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
and 0x7300+0x24be000 = 0x754be000):

[  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
bufsz=0x5000 mem=0x98000 memsz=0x6000
[  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
bufsz=0x1240 mem=0x96000 memsz=0x2000
[  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
bufsz=0x1150888 mem=0x7300 memsz=0x24be000

... because it looks for any memory hole large enough in iomem resources
tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
excluded from on my system.

Looking some more at efi_arch_mem_reserve() I see that it also registers
the area with efi.memmap and installs it using efi_memmap_install().
which seems to call memremap(MEMREMAP_WB) on it. From my understanding
of the comments in the source of memremap(), MEMREMAP_WB does specifically
*not* reserve that memory in any way.

> Unfortunately I noticed there are different requirements/ways for
> different types of "reserved" memory.  But that is another topic..

I tried to reserve the area with something like this:

t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 4de244683a7e..b86a5df027a2 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
efi_memory_desc_t md;
int num_entries;
void *new;
+   struct resource *res;
 
if (efi_mem_desc_lookup(addr, ) ||
md.type != EFI_BOOT_SERVICES_DATA) {
@@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
size)
early_memunmap(new, new_size);
 
efi_memmap_install(new_phys, num_entries);
+
+   res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES);
+   if (!res) {
+   pr_err("Failed to allocate EFI io resource allocator for "
+   "0x%llx:0x%llx", mr.range.start, mr.range.end);
+   return;
+   }
+
+   res->start  = mr.range.start;
+   

Re: kexec_file overwrites reserved EFI ESRT memory

2019-11-26 Thread Eric W. Biederman
Michael Weiser  writes:

> Hello Eric,
> Hello Ard,
>
> on my machine, kexec_file loads the normal (not crash) kernel image
> right across the EFI ESRT reserved memory range:
>
> esrt: Reserving ESRT space from 0x74dd6f98 to 0x74dd6fd0.
> [...]
> kexec_file: kernel signature verification successful.
> kexec_file: Loading segment 0: buf=0xe99b31ad bufsz=0x5000 
> mem=0x91000 memsz=0x6000
> kexec_file: Loading segment 1: buf=0xe45cdeb8 bufsz=0x1240 
> mem=0x8f000 memsz=0x2000
> kexec_file: Loading segment 2: buf=0x096e6de9 bufsz=0x1133888 
> mem=0x7300 memsz=0x249a000
>
> This causes the following message by the kexec'd kernel:
>
> esrt: Unsupported ESRT version 2904149718861218184.
>
> (The image is rather large at 18MiB as it has a built-in initrd.)

When did x86_64 get support for ARCH_KEEP_MEMBLOCK?  I can't find it
anywhere.

My recollection is that on x86 the definitive specification of what is
reserved and what is not is the struct resource (aka /proc/iomem).
While on some other architectures they do something else apparently
the memblock implementation.

Fundamentally when deciding where to place a new kernel kexec (either
user space or the in kernel kexec_file implementation) needs to be able
to ask the question which memory ares are reserved.  What the buddy
allocator does is unimportant as kexec copies memory from all over
the place and places it in the destined memory addresses at the
time of the kexec operation.

So my question is why doesn't the ESRT reservation wind up in
/proc/iomem?

Are you dealing with an embedded port that is being clever?

Or is there some subtle breakage now that x86 has memblock support that
/proc/iomem is no longer being properly maintained?

Eric

> Poking at the involved code a bit (as a layman) I found that the EFI
> code reserves the memory range using memblock_reserve() which is by all
> appearances correctly handed over to the buddy allocator as
> in-use/reserved. kexec_file on the other hand by default looks at iomem
> regions of type System RAM using walk_system_ram_res() and does not seem
> to have that particular information available to consider. (As may have
> become clear from this explanation I'm still somewhat fuzzy (to put it
> midly) on the relationship of memblock, buddy and slab allocator and how
> (if at all) kexec_file interacts with them to a.) find available memory
> regions for the new kernel to load to and b.) tell them where it
> loaded the new kernel to so they don't use it any more.)
>
> As is to be expected, activating CONFIG_ARCH_KEEP_MEMBLOCK makes
> kexec_file use the preserved memblock structures and indeed end up using
> totally different memory regions and gets rid of the message:
>
> kexec_file: kernel signature verification successful.
> kexec_file: Loading segment 0: buf=0x2dea71f8 bufsz=0x5000 
> mem=0x47df8e000 memsz=0x6000
> kexec_file: Loading segment 1: buf=0x0686ff17 bufsz=0x1240 
> mem=0x47df8c000 memsz=0x2000
> kexec_file: Loading segment 2: buf=0xfc444e67 bufsz=0x1133888 
> mem=0x46900 memsz=0x2497000
>
> This is with 5.3.11 mainline and linux-next 5.4.0-rc8-next-20191122.
>
> I'm not actually trying to use ESRT for anything at this point but want
> to stop the boot message from messing up silent boot and suspect that
> this could potentially happen to other, more important EFI memory
> regions as well.
>
> I'm willing to chase this further but at this point I'm wondering
> whether it's the EFI code not reserving this memory area with enough
> emphasis (as iomem?) or kexec_file not checking usability of
> candidate memory regions rigorously enough (based on what other
> criteria?).
>
> Are there maybe any upcoming patches or subsystem-specific kernel trees
> I should try?
>
> Please let me know what other information may be helpful or if I should
> open a bug on bugzilla.kernel.org.
>
> Boot messages on normal boot:
> Linux version 5.3.11-gentoo (m@n) (gcc version 9.2.0 (Gentoo 9.2.0-r2 p3)) 
> #29 SMP Thu Nov 21 20:40:28 CET 2019
> Command line: 
> x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
> x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
> x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 
> 'compacted' format.
> BIOS-provided physical RAM map:
> BIOS-e820: [mem 0x-0x0009efff] usable
> BIOS-e820: [mem 0x0009f000-0x000f] reserved
> BIOS-e820: [mem 0x0010-0x763fafff] usable
> BIOS-e820: [mem 0x763fb000-0x79979fff] reserved
> BIOS-e820: [mem 

Re: kexec_file overwrites reserved EFI ESRT memory

2019-11-24 Thread Dave Young
On 11/22/19 at 10:07pm, Michael Weiser wrote:
> Hi Eric,
> 
> On Fri, Nov 22, 2019 at 02:00:22PM -0600, Eric W. Biederman wrote:
> 
> > > esrt: Unsupported ESRT version 2904149718861218184.
> > >
> > > (The image is rather large at 18MiB as it has a built-in initrd.)
> > When did x86_64 get support for ARCH_KEEP_MEMBLOCK?  I can't find it
> > anywhere.
> 
> No, is hasn't. I temporarily hacked that in to see if it'd change
> anything and it did. Sorry to not be more clear about that.
> 
> > Fundamentally when deciding where to place a new kernel kexec (either
> > user space or the in kernel kexec_file implementation) needs to be able
> > to ask the question which memory ares are reserved.
> > What the buddy
> > allocator does is unimportant as kexec copies memory from all over
> > the place and places it in the destined memory addresses at the
> > time of the kexec operation.
> 
> > So my question is why doesn't the ESRT reservation wind up in
> > /proc/iomem?
> 
> My guess is that the focus was that some EFI structures need to be kept
> around accross the life cycle of *one* running kernel and
> memblock_reserve() was enough for that. Marking them so they survive
> kexecing another kernel might just never have cropped up thus far. Ard
> or Matt would know.

Can you check your un-reserved memory, if your memory falls into EFI
BOOT* then in X86 you can use something like below if it is not covered:

void __init efi_esrt_init(void)
{
...
pr_info("Reserving ESRT space from %pa to %pa.\n", _data, );
if (md.type == EFI_BOOT_SERVICES_DATA)
efi_mem_reserve(esrt_data, esrt_data_size);
...
}

Unfortunately I noticed there are different requirements/ways for
different types of "reserved" memory.  But that is another topic..

Thanks
Dave 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-11-22 Thread Michael Weiser
Hi Eric,

On Fri, Nov 22, 2019 at 02:00:22PM -0600, Eric W. Biederman wrote:

> > esrt: Unsupported ESRT version 2904149718861218184.
> >
> > (The image is rather large at 18MiB as it has a built-in initrd.)
> When did x86_64 get support for ARCH_KEEP_MEMBLOCK?  I can't find it
> anywhere.

No, is hasn't. I temporarily hacked that in to see if it'd change
anything and it did. Sorry to not be more clear about that.

> Fundamentally when deciding where to place a new kernel kexec (either
> user space or the in kernel kexec_file implementation) needs to be able
> to ask the question which memory ares are reserved.
> What the buddy
> allocator does is unimportant as kexec copies memory from all over
> the place and places it in the destined memory addresses at the
> time of the kexec operation.

> So my question is why doesn't the ESRT reservation wind up in
> /proc/iomem?

My guess is that the focus was that some EFI structures need to be kept
around accross the life cycle of *one* running kernel and
memblock_reserve() was enough for that. Marking them so they survive
kexecing another kernel might just never have cropped up thus far. Ard
or Matt would know.

> Are you dealing with an embedded port that is being clever?

I'm not an expert but think it's rather the opposite: It's just a memory
area provided by EFI containing some potentially interesting information
about the EFI firmware structure itself. The aim is to aid firmware
upgrades. This information needs to survive kexec so the user would be
able to use that information (e.g. for upgrades) after a kexec.

So apart from leaving that memory untouched, I guess it could also be
copied over to a staging area by kexec explicitly to be preserved across
the kexec. Or it could be blanked out in such a way that the esrt driver
would not find it after kexec and just be unavailable, if it's decided
that you should only use data about a firmware for upgrades that you
really just used to boot. I guess a bigger question could be asked
whether it would actually be useful and safe for esrt to be available
after kexec.

> Or is there some subtle breakage now that x86 has memblock support that
> /proc/iomem is no longer being properly maintained?

Uuuh, let me backpaddle very hard here: x86 has not gained memblock
preserve support. That was just me mucking about. Sorry.
-- 
Thanks,
Michael

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec