Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-03 Thread Dave Young
On Mon, 3 Jun 2024 at 23:33, Mike Rapoport  wrote:
>
> On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
> > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> > > for kexec case), then for kexec boot, EFI memmap is memremapped in the 
> > > same
> > > virtual address as the first kernel and not the allocated memblock 
> > > address.
> >
> > Are you saying that we should simply do
> >
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index fdf07dd6f459..410cb0743289 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
> >   if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
> >   return;
> >
> > + if (kexec_in_progress)
> > + return;
> > +

kexec_in_progress is only for checking if this is in a reboot (kexec) code path.
But eif_mem_reserve is only called during the boot time so checking
kexec_in_progress is meaningless here.
current_kernel_is_booted_via_kexec != is_rebooting_with_kexec

The code change below in the patch looks good to me, but I'm not sure
what caused the memory corruption, it indeed worth some more digging,
maybe SEV/SNP related.
+   if (md.attribute & EFI_MEMORY_RUNTIME)
+   return;

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 0/7] Support kdump with LUKS encryption by reusing LUKS volume keys

2024-05-30 Thread Dave Young
On Tue, 21 May 2024 at 11:20, Baoquan He  wrote:
>
> On 05/21/24 at 09:43am, Coiby Xu wrote:
> > On Mon, May 20, 2024 at 02:18:09PM +0800, Baoquan He wrote:
> > > Please don't add dm-de...@redhat.com in the public list because it's a
> > > internal mailing list or aliase. And I got error when replying.
> >
> > Thanks for the reminder! Actually it's a public mailing list and you
> > can find all emails publicly listed on [1]. I did some research and it
> > seems we should email to dm-de...@lists.linux.dev instead. Quoting [2],
>
> I always got this when replying to your patch. Maybe you registered
> before.
>
> msmtp: recipient address dm-de...@redhat.com not accepted by the server
> msmtp: server message: 550 5.1.1 : Recipient address 
> rejected: User unknown in local
> recipient table

Hi Baoquan and Coiby,  lists on listman.redhat.com have all been
migrated somewhere else.  For dm-devel, according to MAINTAINERS it is
dm-de...@lists.linux.dev.   I think the old address should not be used
otherwise it is expected to see some errors.

>
>
> > > To post a message to all the list members (who were subscribed with
> > > mail delivery enabled as of 2023.10.20), send email to
> > > dm-de...@lists.linux.dev.  You can no longer subscribe to
> > > dm-de...@redhat.com, to subscribe to dm-de...@lists.linux.dev please
> > > send email to dm-devel+subscr...@lists.linux.dev.
> >
> > [1] https://lore.kernel.org/dm-devel/
> > [2] https://listman.redhat.com/mailman/listinfo/dm-devel
> >
> > --
> > Best regards,
> > Coiby
> >
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] Resolve problems with kexec identity mapping

2024-05-22 Thread Dave Young
Cc kexec list as well.

On Thu, 23 May 2024 at 10:52, Dave Young  wrote:
>
> Add Tao in the cc list.
>
> On Tue, 21 May 2024 at 02:37, Steve Wahl  wrote:
> >
> > Although there was a previous fix to avoid early kernel access to the
> > EFI config table on Intel systems, the problem can still exist on AMD
> > systems that support SEV (Secure Encrypted Virtualization).  The
> > command line option "nogbpages" brings this bug to the surface.  And
> > this is what caused the regression with my earlier patch that
> > attempted to reduce the use of gbpages.  This patch series fixes that
> > problem and restores my earlier patch.
> >
> > The following 2 commits caused the EFI config table, and the CC_BLOB
> > entry in that table, to be accessed when enabling SEV at kernel
> > startup.
> >
> > commit ec1c66af3a30 ("x86/compressed/64: Detect/setup SEV/SME features
> >   earlier during boot")
> > commit c01fce9cef84 ("x86/compressed: Add SEV-SNP feature
> >   detection/setup")
> >
> > These accesses happen before the new kernel establishes its own
> > identity map, and before establishing a routine to handle page faults.
> > But the areas referenced are not explicitly added to the kexec
> > identity map.
> >
> > This goes unnoticed when these areas happen to be placed close enough
> > to others areas that are explicitly added to the identity map, but
> > that is not always the case.
> >
> > Under certain conditions, for example Intel Atom processors that don't
> > support 1GB pages, it was found that these areas don't end up mapped,
> > and the SEV initialization code causes an unrecoverable page fault,
> > and the kexec fails.
> >
> > Tau Liu had offered a patch to put the config table into the kexec
> > identity map to avoid this problem:
> >
> > https://lore.kernel.org/all/20230601072043.24439-1-l...@redhat.com/
> >
> > But the community chose instead to avoid referencing this memory on
> > non-AMD systems where the problem was reported.
> >
> > commit bee6cf1a80b5 ("x86/sev: Do not try to parse for the CC blob
> >   on non-AMD hardware")
> >
> > I later wanted to make a different change to kexec identity map
> > creation, and had this patch accepted:
> >
> > commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB 
> > page should be mapped.")
> >
> > but it quickly needed to be reverted because of problems on AMD systems.
> >
> > The reported regression problems on AMD systems were due to the above
> > mentioned references to the EFI config table.  In fact, on the same
> > systems, the "nogbpages" command line option breaks kexec as well.
> >
> > So I resubmit Tau Liu's original patch that maps the EFI config
> > table, add an additional patch by me that ensures that the CC blob is
> > also mapped (if present), and also resubmit my earlier patch to use
> > gpbages only when a full GB of space is requested to be mapped.
> >
> > I do not advocate for removing the earlier, non-AMD fix.  With kexec,
> > two different kernel versions can be in play, and the earlier fix
> > still covers non-AMD systems when the kexec'd-from kernel doesn't have
> > these patches applied.
> >
> > All three of the people who reported regression with my earlier patch
> > have retested with this patch series and found it to work where my
> > single patch previously did not.  With current kernels, all fail to
> > kexec when "nogbpages" is on the command line, but all succeed with
> > "nogbpages" after the series is applied.
> >
> > Tao Liu (1):
> >   x86/kexec: Add EFI config table identity mapping for kexec kernel
> >
> > Steve Wahl (2):
> >   x86/kexec: Add EFI Confidential Computing blob to kexec identity
> > mapping.
> >   x86/mm/ident_map: Use gbpages only where full GB page should be
> > mapped.
> >
> >  arch/x86/kernel/machine_kexec_64.c | 82 --
> >  arch/x86/mm/ident_map.c| 23 +++--
> >  2 files changed, 95 insertions(+), 10 deletions(-)
> >
> > --
> > 2.26.2
> >


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/7] Kexec-tools: Improve RISC-V port

2024-05-17 Thread Dave Young
On Wed, 11 Oct 2023 at 13:24, Song Shuai  wrote:
>
>
>
> 在 2023/9/20 19:56, Simon Horman 写道:
> > On Fri, Sep 15, 2023 at 11:50:06AM +0800, Song Shuai wrote:
> >> Hi,
> >>
> >> This series is created to improve RISC-V port of kexec-tools,
> >> and is based on the horms/kexec-tools:build-test-riscv-v2 branch.
> >
> > In my mind the big question is how to move RISC-V support
> > from that branch, to being merged into main.
> >
> > IIRC there were some issues that needed to be addressed.
> > Perhaps they are all addressed by this series, and with
> > some appropriate squashing we can move forwards with a series
> > based on main?
>
> Hi, Simon and Nick:
>
> I squashed the first four patches as a "RISC-V: Some fixes for riscv
> port" patch and then took the horms/main as the base to collect the 2
> patches from horms/build-test-riscv-v2 branch and this series togother.
> These are the Github link and all commits for RISC-V.
>
> https://github.com/sugarfillet/kexec-tools/commits/main_rv
>
> 5dc133e RISC-V: Support loading Image binary file
> b042f6d RISC-V: Separate elf_riscv_find_pbase out
> 8f344c7 RISC-V: Enable kexec_file_load syscall
> 7d4b982 RISC-V: Some fixes for riscv port
> 3205c1c local: RISC-V: distribute purgatory/riscv/Makefile
> 54f9daf RISC-V: Add support for riscv kexec/kdump on kexec-tools
>
> Since I didn't found the issues/fixes as Nick mentioned with these
> commits, I prefer to merge them into horms/main and let more kexec/kdump
> users to help improve/fixup RISC-V port.

Hi,  I noticed another pr for Fedora kexec-tools:
https://src.fedoraproject.org/rpms/kexec-tools/pull-request/24

It is bad to take it as Fedora only,   I would suggest posting all the
refreshed patches together here again for review.

If no enough reviewers can review them my another suggestion is to
drop the kexec_load support code for the time being, and only enable
the kexec_file_load support code in kexec-tools, and I assume below
kernel commit make the kexec_file_load kernel piece of work done.
Then it will be easier to review and make something working at least.
commit 6261586e0c91db14c34f894f4bc48f2300cff1d4
Author: Liao Chang 
Date:   Fri Apr 8 18:09:11 2022 +0800

RISC-V: Add kexec_file support

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v6 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-05-09 Thread Dave Young
On Thu, 9 May 2024 at 17:56, Ruirui Yang  wrote:
>
> On Fri, Apr 26, 2024 at 04:33:48PM +, Ashish Kalra wrote:
> > From: Ashish Kalra 
> >
> > With SNP guest kexec observe the following efi memmap corruption :
> >
> > [0.00] efi: EFI v2.7 by EDK II
> > [0.00] efi: SMBIOS=0x7e33f000 SMBIOS 3.0=0x7e33d000 ACPI=0x7e57e000 
> > ACPI 2.0=0x7e57e014 MEMATTR=0x7cc3c018 Unaccepted=0x7c09e018
> > [0.00] efi: [Firmware Bug]: Invalid EFI memory map entries:
> > [0.00] efi: mem03: [type=269370880|attr=0x0e42100e42180e41] 
> > range=[0x0486200e41038c18-0x200e898a0eee713ac17] (invalid)
> > [0.00] efi: mem04: [type=12336|attr=0x0e410686300e4105] 
> > range=[0x100e42000176-0x8c290f26248d200e175] (invalid)
> > [0.00] efi: mem06: [type=1124304408|attr=0x30b40028] 
> > range=[0x0e51300e45280e77-0xb44ed2142f460c1e76] (invalid)
> > [0.00] efi: mem08: [type=68|attr=0x300e540583280e41] 
> > range=[0x011a3cd8-0x486200e54b38c0bcd7] (invalid)
> > [0.00] efi: mem10: [type=1107529240|attr=0x0e42280e41300e41] 
> > range=[0x300e41058c280e42-0x38010ae54c5c328ee41] (invalid)
> > [0.00] efi: mem11: [type=189335566|attr=0x048d200e42038e18] 
> > range=[0x318c0048-0xe42029228ce4200047] (invalid)
> > [0.00] efi: mem12: [type=239142534|attr=0x00240b4b] 
> > range=[0x0e41380e0a7d700e-0x80f26238f22bfe500d] (invalid)
> > [0.00] efi: mem14: [type=239207055|attr=0x0e41300e43380e0a] 
> > range=[0x8c280e42048d200e-0xc70b028f2f27cc0a00d] (invalid)
> > [0.00] efi: mem15: [type=239210510|attr=0x00080e660b47080e] 
> > range=[0x324c001c-0xa78028634ce490001b] (invalid)
> > [0.00] efi: mem16: [type=4294848528|attr=0x32940014] 
> > range=[0x0e410286100e4100-0x80f252036a218f20ff] (invalid)
> > [0.00] efi: mem19: [type=2250772033|attr=0x42180e42200e4328] 
> > range=[0x41280e0ab9020683-0xe0e538c28b39e62682] (invalid)
> > [0.00] efi: mem20: [type=16|   |  |  |  |  |  |  |  |  |   |WB|  
> > |WC|  ] range=[0x00084438-0x44340090333c437] (invalid)
> > [0.00] efi: mem22: [Reserved|attr=0x00c14420] 
> > range=[0x44243398-0x1033a04240003f397] (invalid)
> > [0.00] efi: mem23: [type=1141080856|attr=0x080e41100e43180e] 
> > range=[0x280e66300e4b280e-0x440dc5ee7141f4c080d] (invalid)
> > [0.00] efi: mem25: [Reserved|attr=0x000a44a0] 
> > range=[0x44a43428-0x1034304a400013427] (invalid)
> > [0.00] efi: mem28: [type=16|   |  |  |  |  |  |  |  |  |   |WB|  
> > |WC|  ] range=[0x000a4488-0x448400b034bc487] (invalid)
> > [0.00] efi: mem30: [Reserved|attr=0x000a4470] 
> > range=[0x44743518-0x10352047400013517] (invalid)
> > [0.00] efi: mem33: [type=16|   |  |  |  |  |  |  |  |  |   |WB|  
> > |WC|  ] range=[0x000a4458-0x445400b035ac457] (invalid)
> > [0.00] efi: mem35: [type=269372416|attr=0x0e42100e42180e41] 
> > range=[0x0486200e44038c18-0x200e8b8a0eee823ac17] (invalid)
> > [0.00] efi: mem37: [type=2351435330|attr=0x0e42100e42180e42] 
> > range=[0x470783380e410686-0x2002b2a041c2141e685] (invalid)
> > [0.00] efi: mem38: [type=1093668417|attr=0x100e42000270] 
> > range=[0x42100e42180e4220-0xfff366a4e421b78c21f] (invalid)
> > [0.00] efi: mem39: [type=76357646|attr=0x180e42200e42280e] 
> > range=[0x0e410686300e4105-0x4130f251a0710ae5104] (invalid)
> > [0.00] efi: mem40: [type=940444268|attr=0x0e42200e42280e41] 
> > range=[0x180e42200e42280e-0x300fc71c300b4f2480d] (invalid)
> > [0.00] efi: mem41: [MMIO|attr=0x8c280e42048d200e] 
> > range=[0x47943728-0x42138e0c87820292727] (invalid)
> > [0.00] efi: mem42: [type=1191674680|attr=0x004c000b] 
> > range=[0x300e41380e0a0246-0x470b0f26238f22b8245] (invalid)
> > [0.00] efi: mem43: [type=2010|attr=0x0301f00e4d078338] 
> > range=[0x45038e180e42028f-0xe4556bf118f282528e] (invalid)
> > [0.00] efi: mem44: [type=1109921345|attr=0x300e446c] 
> > range=[0x44080e42100e4218-0xfff39254e42138ac217] (invalid)
> > ...
> >
> > This EFI memap corruption is happening with efi_arch_mem_reserve() 
> > invocation in case of kexec boot.
> >
> > ( efi_arch_mem_reserve() is invoked with the following call-stack: )
> >
> > [0.310010]  efi_arch_mem_reserve+0xb1/0x220
> > [0.311382]  efi_mem_reserve+0x36/0x60
> > [0.311973]  efi_bgrt_init+0x17d/0x1a0
> > [0.313265]  acpi_parse_bgrt+0x12/0x20
> > [0.313858]  acpi_table_parse+0x77/0xd0
> > [0.314463]  acpi_boot_init+0x362/0x630
> > [0.315069]  setup_arch+0xa88/0xf80
> > [0.315629]  start_kernel+0x68/0xa90
> > [0.316194]  x86_64_start_reservations+0x1c/0x30
> > [0.316921]  x86_64_start_kernel+0xbf/0x110
> > [0.317582]  common_startup_64+0x13e/0x141
> >
> > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> > EFI memory map and due 

Re: [PATCH v2] kexec: fix the unexpected kexec_dprintk() macro

2024-04-14 Thread Dave Young
On Sun, 14 Apr 2024 at 10:54, Baoquan He  wrote:
>
> Hi Dave,
>
> On 04/12/24 at 03:28pm, Dave Young wrote:
> > On Tue, 9 Apr 2024 at 12:23, Baoquan He  wrote:
> > >
> > > Jiri reported that the current kexec_dprintk() always prints out
> > > debugging message whenever kexec/kdmmp loading is triggered. That is
> > > not wanted. The debugging message is supposed to be printed out when
> > > 'kexec -s -d' is specified for kexec/kdump loading.
> > >
> > > After investigating, the reason is the current kexec_dprintk() takes
> > > printk(KERN_INFO) or printk(KERN_DEBUG) depending on whether '-d' is
> > > specified. However, distros usually have defaulg log level like below:
> > >
> > >  [~]# cat /proc/sys/kernel/printk
> > >  7   4  1   7
> > >
> > > So, even though '-d' is not specified, printk(KERN_DEBUG) also always
> > > prints out. I thought printk(KERN_DEBUG) is equal to pr_debug(), it's
> > > not.
> > >
> > > Fix it by changing to use pr_info() instead which are expected to work.
> >
> > Could you also update the kernel/crash_core.c and
> > kernel/crash_reserve.c to include the filename prefix?
> > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> When I added pr_fmt() to kernel/crash_reserve.c and tested code, the
> printed boot log about crashkernel reservation is changed as below:
>
> [  +0.00] crash_reserve: crashkernel reserved: 0x7d00 - 
> 0x9500 (384 MB)
>
> When I looked around, I noticed all other lines around don't have the
> module name printed out. Seems it's not appropriate to add one for
> crashkernel alone. And the kexec_dprintk() doesn't exist in
> kernel/crash_reserve.c. Furthermore, the kexec_dprintk() is added to
> enable debugging printing for kexec_file_load when loading kexec/kdump
> kernel. This crashkernel reservation may not be related. Combinbed these
> all, I would suggest not adding pr_fmt() for kernel/crash_reserve.c for
> now, let's add pr_fmt() for kernel/crash_core.c, what do you think?

Hi Baoquan,  I'm fine with it.

But adding pr_fmt is always good as all the pr_* will have a prefix
instead of manually add "crashkernel:" in every pr_warn etc.  There
are indeed a few without any prefix in this file though.  But it can
be done separately, not necessary to be done here together with the
debug print though.

Thanks
Dave

>
> >
> > >
> > > Fixes: cbc2fe9d9cb2 ("kexec_file: add kexec_file flag to control debug 
> > > printing")
> > > Signed-off-by: Baoquan He 
> > > Reported-by: Jiri Slaby 
> > > Closes: 
> > > https://lore.kernel.org/all/4c775fca-5def-4a2d-8437-7130b0272...@kernel.org
> > > ---
> > > v1->v2:
> > > - Change to use pr_info() only when "kexec -s -d" is specified. With
> > >   this change, those debugging message for "kexec -c -d" of kexec_load
> > >   will be missed. We'll see if we need add them for kexec_load too, if
> > >   someone explicitly requests it.
> > >
> > >  include/linux/kexec.h | 6 ++
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > > index 060835bb82d5..f31bd304df45 100644
> > > --- a/include/linux/kexec.h
> > > +++ b/include/linux/kexec.h
> > > @@ -461,10 +461,8 @@ static inline void arch_kexec_pre_free_pages(void 
> > > *vaddr, unsigned int pages) {
> > >
> > >  extern bool kexec_file_dbg_print;
> > >
> > > -#define kexec_dprintk(fmt, ...)\
> > > -   printk("%s" fmt,\
> > > -  kexec_file_dbg_print ? KERN_INFO : KERN_DEBUG,   \
> > > -  ##__VA_ARGS__)
> > > +#define kexec_dprintk(fmt, arg...) \
> > > +do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0)
> > >
> > >  #else /* !CONFIG_KEXEC_CORE */
> > >  struct pt_regs;
> > > --
> > > 2.41.0
> > >
> > >
> >
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] kexec: fix the unexpected kexec_dprintk() macro

2024-04-12 Thread Dave Young
On Tue, 9 Apr 2024 at 12:23, Baoquan He  wrote:
>
> Jiri reported that the current kexec_dprintk() always prints out
> debugging message whenever kexec/kdmmp loading is triggered. That is
> not wanted. The debugging message is supposed to be printed out when
> 'kexec -s -d' is specified for kexec/kdump loading.
>
> After investigating, the reason is the current kexec_dprintk() takes
> printk(KERN_INFO) or printk(KERN_DEBUG) depending on whether '-d' is
> specified. However, distros usually have defaulg log level like below:
>
>  [~]# cat /proc/sys/kernel/printk
>  7   4  1   7
>
> So, even though '-d' is not specified, printk(KERN_DEBUG) also always
> prints out. I thought printk(KERN_DEBUG) is equal to pr_debug(), it's
> not.
>
> Fix it by changing to use pr_info() instead which are expected to work.

Could you also update the kernel/crash_core.c and
kernel/crash_reserve.c to include the filename prefix?
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

Otherwise:
Reviewed-by: Dave Young 

>
> Fixes: cbc2fe9d9cb2 ("kexec_file: add kexec_file flag to control debug 
> printing")
> Signed-off-by: Baoquan He 
> Reported-by: Jiri Slaby 
> Closes: 
> https://lore.kernel.org/all/4c775fca-5def-4a2d-8437-7130b0272...@kernel.org
> ---
> v1->v2:
> - Change to use pr_info() only when "kexec -s -d" is specified. With
>   this change, those debugging message for "kexec -c -d" of kexec_load
>   will be missed. We'll see if we need add them for kexec_load too, if
>   someone explicitly requests it.
>
>  include/linux/kexec.h | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 060835bb82d5..f31bd304df45 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -461,10 +461,8 @@ static inline void arch_kexec_pre_free_pages(void 
> *vaddr, unsigned int pages) {
>
>  extern bool kexec_file_dbg_print;
>
> -#define kexec_dprintk(fmt, ...)\
> -   printk("%s" fmt,\
> -  kexec_file_dbg_print ? KERN_INFO : KERN_DEBUG,   \
> -  ##__VA_ARGS__)
> +#define kexec_dprintk(fmt, arg...) \
> +do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0)
>
>  #else /* !CONFIG_KEXEC_CORE */
>  struct pt_regs;
> --
> 2.41.0
>
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Question about Address Range Validation in Crash Kernel Allocation

2024-03-22 Thread Dave Young
Hi,

On Fri, 22 Mar 2024 at 09:16, Baoquan He  wrote:
>
> On 03/21/24 at 08:37pm, Li Huafei wrote:
> >
> >
> > On 2024/3/21 18:06, Dave Young wrote:
> > > Hi,
> > >
> > > On Thu, 21 Mar 2024 at 17:49, Li Huafei  wrote:
> > >>
> > >> Hi Baoquan,
> > >>
> > >> On 2024/3/21 17:17, chenhaixiang (A) wrote:
> > >>>
> > >>>>> I'm sorry for the delay. Here are some details from the boot log and
> > >>>> /proc/iomem:
> > >>>>> The Boot log:
> > >>>>> [0.00] Linux version 6.8.0 (root@localhost.localdomain) (gcc 
> > >>>>> (GCC)
> > >>>> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
> > >>>> 11:46:11 UTC 2024
> > >>>>> [0.00] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
> > >>>> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
> > >>>> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers 
> > >>>> softlockup_panic=1
> > >>>> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
> > >>>> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
> > >>>> console=ttyS0,115200n8 console=tty0
> > >>>> ..snip...
> > >>>>> [0.022622] memblock_phys_alloc_range: 536870912 bytes 
> > >>>>> align=0x100
> > >>>> from=0x max_addr=0x0001
> > >>>> reserve_crashkernel_generic+0x7c/0x220
> > >>>>> [0.022628] memblock_phys_alloc_range: 536870912 bytes 
> > >>>>> align=0x100
> > >>>> from=0x0001 max_addr=0x4000
> > >>>> reserve_crashkernel_generic+0x7c/0x220
> > >>>>> [0.022632] memblock_reserve: 
> > >>>>> [0x00c01f00-0x00c03eff]
> > >>>> memblock_alloc_range_nid+0xee/0x170
> > >>>>> [0.022634] memblock_phys_alloc_range: 268435456 bytes 
> > >>>>> align=0x100
> > >>>> from=0x max_addr=0x0001
> > >>>> reserve_crashkernel_generic+0x11d/0x220
> > >>>>> [0.022638] memblock_reserve: 
> > >>>>> [0x4900-0x58ff]
> > >>>> memblock_alloc_range_nid+0xee/0x170
> > >>>>> [0.022640] crashkernel low memory reserved: 0x4900 - 
> > >>>>> 0x5900
> > >>>> (256 MB)
> > >>>>> [0.022641] crashkernel reserved: 0x00c01f00 -
> > >>>> 0x00c03f00 (512 MB)
> > >>>>
> > >>>> Here, crashkernel,low is reserved in region:  [0x4900 - 
> > >>>> 0x5900] (256
> > >>>> MB)
> > >>>>   crashkernel,high is reserved in region: [0x00c01f00 -
> > >>>> 0x00c03f00] (512 MB) ..
> > >>>>> [0.029839] memblock_reserve: 
> > >>>>> [0x00c03740-0x00c03f7f]
> > >>>> memblock_alloc_range_nid+0xee/0x170
> > >>>>> [0.029843] e820: update [mem 0x53cbd000-0x53cc] usable ==>
> > >>>> reserved
> > >>>>> [0.029861] TSC deadline timer available
> > >>>>
> > >>>> Then here, region [0x53cbd000-0x53cc] is reserved in e820, and 
> > >>>> print abvoe
> > >>>> "usable ==> reserved". This should be the step which prevents earlier 
> > >>>> reserved
> > >>>> crashkernel,low from being added to iomem tree. I am not sure what 
> > >>>> triggered
> > >>>> the e820 update.
> > >>
> > >> We added dump_stack () printing in efi_mem_reserve () and found that
> > >> [0x53cbd000-0x53cc] was reserved by BGRT:
> > >>
> > >>   [0.032259] e820: update [mem 0x53cbd000-0x53cc] usable ==>
> > >> reserved
> > >>   [0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
> > >> 5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
> > >>   [0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
> > >> 08/30/2022
> > >>   [0.032264] Call Trace:
> > >>   [0.032265]  ? dump_stack+0x57/0x6e
> > >>   [0.032267]  ? bg

Re: Question about Address Range Validation in Crash Kernel Allocation

2024-03-22 Thread Dave Young
On Thu, 21 Mar 2024 at 20:37, Li Huafei  wrote:
>
>
>
> On 2024/3/21 18:06, Dave Young wrote:
> > Hi,
> >
> > On Thu, 21 Mar 2024 at 17:49, Li Huafei  wrote:
> >>
> >> Hi Baoquan,
> >>
> >> On 2024/3/21 17:17, chenhaixiang (A) wrote:
> >>>
> >>>>> I'm sorry for the delay. Here are some details from the boot log and
> >>>> /proc/iomem:
> >>>>> The Boot log:
> >>>>> [0.00] Linux version 6.8.0 (root@localhost.localdomain) (gcc 
> >>>>> (GCC)
> >>>> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
> >>>> 11:46:11 UTC 2024
> >>>>> [0.00] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
> >>>> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
> >>>> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers 
> >>>> softlockup_panic=1
> >>>> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
> >>>> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
> >>>> console=ttyS0,115200n8 console=tty0
> >>>> ..snip...
> >>>>> [0.022622] memblock_phys_alloc_range: 536870912 bytes 
> >>>>> align=0x100
> >>>> from=0x max_addr=0x0001
> >>>> reserve_crashkernel_generic+0x7c/0x220
> >>>>> [0.022628] memblock_phys_alloc_range: 536870912 bytes 
> >>>>> align=0x100
> >>>> from=0x0001 max_addr=0x4000
> >>>> reserve_crashkernel_generic+0x7c/0x220
> >>>>> [0.022632] memblock_reserve: [0x00c01f00-0x00c03eff]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [0.022634] memblock_phys_alloc_range: 268435456 bytes 
> >>>>> align=0x100
> >>>> from=0x max_addr=0x0001
> >>>> reserve_crashkernel_generic+0x11d/0x220
> >>>>> [0.022638] memblock_reserve: [0x4900-0x58ff]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [0.022640] crashkernel low memory reserved: 0x4900 - 0x5900
> >>>> (256 MB)
> >>>>> [0.022641] crashkernel reserved: 0x00c01f00 -
> >>>> 0x00c03f00 (512 MB)
> >>>>
> >>>> Here, crashkernel,low is reserved in region:  [0x4900 - 0x5900] 
> >>>> (256
> >>>> MB)
> >>>>   crashkernel,high is reserved in region: [0x00c01f00 -
> >>>> 0x00c03f00] (512 MB) ..
> >>>>> [0.029839] memblock_reserve: [0x00c03740-0x00c03f7f]
> >>>> memblock_alloc_range_nid+0xee/0x170
> >>>>> [0.029843] e820: update [mem 0x53cbd000-0x53cc] usable ==>
> >>>> reserved
> >>>>> [0.029861] TSC deadline timer available
> >>>>
> >>>> Then here, region [0x53cbd000-0x53cc] is reserved in e820, and print 
> >>>> abvoe
> >>>> "usable ==> reserved". This should be the step which prevents earlier 
> >>>> reserved
> >>>> crashkernel,low from being added to iomem tree. I am not sure what 
> >>>> triggered
> >>>> the e820 update.
> >>
> >> We added dump_stack () printing in efi_mem_reserve () and found that
> >> [0x53cbd000-0x53cc] was reserved by BGRT:
> >>
> >>   [0.032259] e820: update [mem 0x53cbd000-0x53cc] usable ==>
> >> reserved
> >>   [0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
> >> 5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
> >>   [0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
> >> 08/30/2022
> >>   [0.032264] Call Trace:
> >>   [0.032265]  ? dump_stack+0x57/0x6e
> >>   [0.032267]  ? bgrt_init+0xc2/0xc2
> >>   [0.032268]  ? __e820__range_update+0x7a/0x1d6
> >>   [0.032270]  ? bgrt_init+0xc2/0xc2
> >>   [0.032272]  ? bgrt_init+0xc2/0xc2
> >>   [0.032274]  ? efi_arch_mem_reserve+0x1a3/0x1d0
> >>   [0.032276]  ? efi_mem_reserve+0x2d/0x42
> >>   [0.032278]  ? acpi_parse_bgrt+0xa/0x11
> >>   [0.032279]  ? acpi_table_parse+0x86/0xbc
> >>   [0.032281]  ? acpi_boot_init+0x79/0xad
> >>   [0.032282]  ? setup_arch+0x835/0x954
> >&g

Re: [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data

2024-03-21 Thread Dave Young
> > Your tree is missing this recent commit:
> > 7fd817c906503b6813ea3b41f5fdf4192449a707 ("x86/e820: Don't
> > reserve SETUP_RNG_SEED in e820").
> >
> > Wouldn't this fix [/paper over] your problem as well? I.e., isn't
> > SETUP_RNG_SEED the setup_data item that's causing your problem?
>
> Thanks for catching this, I will rebase and repost.
>
> But it does not "fix" the problem as my problem is related to the
> other setup_data
> range, I think it is SETUP_PCI (not 100% sure, but it is certainly not 
> RNG_SEED)
>

The webmail reply broke the lines randomly, sorry for that.  I have
resent a rebased version.  And I also confirmed that in my case it was
SETUP_PCI caused the issue.   Note, this SETUP_PCI is from previous
physical bootup, the old kernel reserved it in kexec e820, it is not
the RNG_SEED which was passed in by kexec.  I believe the RND_SEED
region could cause issues only on the 3rd+ boot with a kernel without
the commit you mentioned.

It is a little tricky, I suppose not obvious to understand..

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH V3] x86/kexec: do not update E820 kexec table for setup_data

2024-03-21 Thread Dave Young
crashkernel reservation failed on a Thinkpad t440s laptop recently. 
Actually the memblock reservation succeeded, but later insert_resource()
failed.

Test steps:
kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
kexec reboot -> 
dmesg|grep "crashkernel reserved";
crashkernel memory range like below reserved successfully:
0xd000 - 0xda00
But no such "Crash kernel" region in /proc/iomem

The background story is like below:

Currently E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.
Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed.
Thus the old setup data memory is not used at all.

Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().

Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.

Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.

Thus drop the useless buggy code here.

Signed-off-by: Dave Young 
---
V3: Rebase to latest mainline [Jiri Bohac]
V2: changelog grammar fixes [suggestions from Huang Kai]
 arch/x86/kernel/e820.c |   17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

Index: linux/arch/x86/kernel/e820.c
===
--- linux.orig/arch/x86/kernel/e820.c
+++ linux/arch/x86/kernel/e820.c
@@ -1016,17 +1016,6 @@ void __init e820__reserve_setup_data(voi
 
e820__range_update(pa_data, sizeof(*data)+data->len, 
E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
 
-   /*
-* SETUP_EFI, SETUP_IMA and SETUP_RNG_SEED are supplied by
-* kexec and do not need to be reserved.
-*/
-   if (data->type != SETUP_EFI &&
-   data->type != SETUP_IMA &&
-   data->type != SETUP_RNG_SEED)
-   e820__range_update_kexec(pa_data,
-sizeof(*data) + data->len,
-E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-
if (data->type == SETUP_INDIRECT) {
len += data->len;
early_memunmap(data, sizeof(*data));
@@ -1038,12 +1027,9 @@ void __init e820__reserve_setup_data(voi
 
indirect = (struct setup_indirect *)data->data;
 
-   if (indirect->type != SETUP_INDIRECT) {
+   if (indirect->type != SETUP_INDIRECT)
e820__range_update(indirect->addr, 
indirect->len,
   E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-   e820__range_update_kexec(indirect->addr, 
indirect->len,
-E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-   }
}
 
pa_data = pa_next;
@@ -1051,7 +1037,6 @@ void __init e820__reserve_setup_data(voi
}
 
e820__update_table(e820_table);
-   e820__update_table(e820_table_kexec);
 
pr_info("extended physical RAM map:\n");
e820__print_table("reserve setup_data");


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data

2024-03-21 Thread Dave Young
Hi Jiri,

On Thu, 21 Mar 2024 at 18:32, Jiri Bohac  wrote:
>
> Hi,
>
> On Thu, Mar 21, 2024 at 05:23:20PM +0800, Dave Young wrote:
> > crashkernel reservation failed on a Thinkpad t440s laptop recently.
> > Actually the memblock reservation succeeded, but later insert_resource()
> > failed.
> >
> > Test steps:
> > kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
> > kexec reboot ->
> > dmesg|grep "crashkernel reserved";
> > crashkernel memory range like below reserved successfully:
> > 0xd000 - 0xda00
> > But no such "Crash kernel" region in /proc/iomem
> >
> > The background story is like below:
> >
> > Currently E820 code reserves setup_data regions for both the current
> > kernel and the kexec kernel, and it inserts them into the resources list.
> > Before the kexec kernel reboots nobody passes the old setup_data, and
> > kexec only passes fresh SETUP_EFI and SETUP_IMA if needed.  Thus the old
> > setup data memory is not used at all.
> >
> > Due to old kernel updates the kexec e820 table as well so kexec kernel
> > sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
> > regions are inserted into resources list in the kexec kernel by
> > e820__reserve_resources().
> >
> > Note, due to no setup_data is passed in for those old regions they are not
> > early reserved (by function early_reserve_memory), and the crashkernel
> > memblock reservation will just treat them as usable memory and it could
> > reserve the crashkernel region which overlaps with the old setup_data
> > regions. And just like the bug I noticed here, kdump insert_resource
> > failed because e820__reserve_resources has added the overlapped chunks
> > in /proc/iomem already.
>
> wouldn't this be caused by
> 4a693ce65b186fddc1a73621bd6f941e6e3eca21 ("kdump: defer the
> insertion of crashkernel resources")?
>
> Before that the crashkernel resources were inserted from
> arch_reserve_crashkernel() which is called before
> e820__reserve_resources().

I think reverting the commit you mentioned can paper out this issue
but it is not
the root cause.  Yes, arch_reserve_crashkernel can succeed, then e820
still tries
to reserve the setup_data overlapping with crashkernel for another purpose.

>
> The semantics of E820_TYPE_RESERVED_KERN wrt kexec quite
> inconsistent. It's treated as E820_TYPE_RAM by
> e820__memblock_setup() and e820_type_to_iomem_type().
>
> The problem we're seeing here is the result of the former.
> e820__memblock_setup() will add the E820_TYPE_RESERVED_KERN
> region to the memblock, merging with the neighbouring memblocks,
> allowing crashkernel region to span across the originally
> reserved area.
>
> e820_type_to_iomem_type() treating E820_TYPE_RESERVED_KERN as
> E820_TYPE_RAM will make the E820_TYPE_RESERVED_KERN appear as
> system ram in /proc/iomem. If the old kexec_load (not
> kexec_file_load) syscall is used, the userspace kexec utility
> will construct the e820 table based on the contents of
> /proc/iomem and the kexec kernel will see the
> E820_TYPE_RESERVED_KERN range as E820_TYPE_RAM.  When
> kexec_file_load is used the E820_TYPE_RESERVED_KERN type is
> propagated to the kexec kernel by bzImage64_load() /
> setup_e820_entries().

This is true, but it does not matter for the kexec kernel as they are
only reserved for
the 1st kernel, and it is not meaningful to the kexec kernel.  Use
them as system ram
is fine in the 2nd kexec kernel.

>
>
> > Index: linux/arch/x86/kernel/e820.c
> > ===
> > --- linux.orig/arch/x86/kernel/e820.c
> > +++ linux/arch/x86/kernel/e820.c
> > @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
> >   pa_next = data->next;
> >
> >   e820__range_update(pa_data, sizeof(*data)+data->len, 
> > E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
> > - /*
> > -  * SETUP_EFI and SETUP_IMA are supplied by kexec and do not 
> > need
> > -  * to be reserved.
> > -  */
> > - if (data->type != SETUP_EFI && data->type != SETUP_IMA)
> > - e820__range_update_kexec(pa_data,
> > -  sizeof(*data) + data->len,
> > -  E820_TYPE_RAM, 
> > E820_TYPE_RESERVED_KERN);
> > -
>
> Your tree is missing this recent commit:
> 7fd817c906503b6813ea3b41f5fdf4192449a

Re: Question about Address Range Validation in Crash Kernel Allocation

2024-03-21 Thread Dave Young
Hi,

On Thu, 21 Mar 2024 at 17:49, Li Huafei  wrote:
>
> Hi Baoquan,
>
> On 2024/3/21 17:17, chenhaixiang (A) wrote:
> >
> >>> I'm sorry for the delay. Here are some details from the boot log and
> >> /proc/iomem:
> >>> The Boot log:
> >>> [0.00] Linux version 6.8.0 (root@localhost.localdomain) (gcc (GCC)
> >> 10.3.1, GNU ld (GNU Binutils) 2.37) #3 SMP PREEMPT_DYNAMIC Wed Mar 20
> >> 11:46:11 UTC 2024
> >>> [0.00] Command line: BOOT_IMAGE=/vmlinuz-6.8.0
> >> root=/dev/mapper/root ro crashkernel=512M resume=/dev/mapper/swap
> >> rd.lvm.lv=root rd.lvm.lv=swap crash_kexec_post_notifiers softlockup_panic=1
> >> reserve_kbox_mem=16M fsck.mode=auto fsck.repair=yes panic=3
> >> nmi_watchdog=1 quiet rd.shell=0 memblock=debug efi=debug
> >> console=ttyS0,115200n8 console=tty0
> >> ..snip...
> >>> [0.022622] memblock_phys_alloc_range: 536870912 bytes align=0x100
> >> from=0x max_addr=0x0001
> >> reserve_crashkernel_generic+0x7c/0x220
> >>> [0.022628] memblock_phys_alloc_range: 536870912 bytes align=0x100
> >> from=0x0001 max_addr=0x4000
> >> reserve_crashkernel_generic+0x7c/0x220
> >>> [0.022632] memblock_reserve: [0x00c01f00-0x00c03eff]
> >> memblock_alloc_range_nid+0xee/0x170
> >>> [0.022634] memblock_phys_alloc_range: 268435456 bytes align=0x100
> >> from=0x max_addr=0x0001
> >> reserve_crashkernel_generic+0x11d/0x220
> >>> [0.022638] memblock_reserve: [0x4900-0x58ff]
> >> memblock_alloc_range_nid+0xee/0x170
> >>> [0.022640] crashkernel low memory reserved: 0x4900 - 0x5900
> >> (256 MB)
> >>> [0.022641] crashkernel reserved: 0x00c01f00 -
> >> 0x00c03f00 (512 MB)
> >>
> >> Here, crashkernel,low is reserved in region:  [0x4900 - 0x5900] 
> >> (256
> >> MB)
> >>   crashkernel,high is reserved in region: [0x00c01f00 -
> >> 0x00c03f00] (512 MB) ..
> >>> [0.029839] memblock_reserve: [0x00c03740-0x00c03f7f]
> >> memblock_alloc_range_nid+0xee/0x170
> >>> [0.029843] e820: update [mem 0x53cbd000-0x53cc] usable ==>
> >> reserved
> >>> [0.029861] TSC deadline timer available
> >>
> >> Then here, region [0x53cbd000-0x53cc] is reserved in e820, and print 
> >> abvoe
> >> "usable ==> reserved". This should be the step which prevents earlier 
> >> reserved
> >> crashkernel,low from being added to iomem tree. I am not sure what 
> >> triggered
> >> the e820 update.
>
> We added dump_stack () printing in efi_mem_reserve () and found that
> [0x53cbd000-0x53cc] was reserved by BGRT:
>
>   [0.032259] e820: update [mem 0x53cbd000-0x53cc] usable ==>
> reserved
>   [0.032262] CPU: 0 PID: 0 Comm: swapper Not tainted
> 5.10.0-60.18.0.50.h820.eulerosv2r11.x86_64 #7
>   [0.032263] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 8.25
> 08/30/2022
>   [0.032264] Call Trace:
>   [0.032265]  ? dump_stack+0x57/0x6e
>   [0.032267]  ? bgrt_init+0xc2/0xc2
>   [0.032268]  ? __e820__range_update+0x7a/0x1d6
>   [0.032270]  ? bgrt_init+0xc2/0xc2
>   [0.032272]  ? bgrt_init+0xc2/0xc2
>   [0.032274]  ? efi_arch_mem_reserve+0x1a3/0x1d0
>   [0.032276]  ? efi_mem_reserve+0x2d/0x42
>   [0.032278]  ? acpi_parse_bgrt+0xa/0x11
>   [0.032279]  ? acpi_table_parse+0x86/0xbc
>   [0.032281]  ? acpi_boot_init+0x79/0xad
>   [0.032282]  ? setup_arch+0x835/0x954
>   [0.032284]  ? start_kernel+0x5d/0x455
>   [0.032286]  ? secondary_startup_64_no_verify+0xc2/0xcb
>
> efi_reserve_boot_services() has reserved memory of type
> EFI_BOOT_SERVICES_CODE & EFI_BOOT_SERVICES_DATA  before crashkernel.
> efi_bgrt_init() assumes that EFI_BOOT_SERVICES_DATA is not reserved by
> other modules. Then, the e820_table is directly updated, and the BGRT
> memory is reserved.
>
> However, memblock_is_region_reserved() in efi_reserve_boot_services()
> returns true when the ranges only overlap.
>
>  already_reserved = memblock_is_region_reserved(start, size);

Do you mean efi_reserve_boot_services is supposed to reserve the bgrt
memory but it does not reserve it due to the region overlapping with
some other reserved region?  If so can you debug and find what exact
memblock reserved region overlaps with the bgrt?

BTW, the previous email threads are weird, and not threading
correctly, hard to find information.

>
>  /*
>   * Because the following memblock_reserve() is paired
>   * with memblock_free_late() for this region in
>   * efi_free_boot_services(), we must be extremely
>   * careful not to reserve, and subsequently free,
>   * critical regions of memory (like the kernel image) or
>   * those regions that somebody else has already
>   * reserved.
>   *
>   * A good example of a critical region that must not be
>   * freed is page zero (first 4Kb of memory), which may
>   * contain boot services 

[PATCH V2] x86/kexec: do not update E820 kexec table for setup_data

2024-03-21 Thread Dave Young
crashkernel reservation failed on a Thinkpad t440s laptop recently. 
Actually the memblock reservation succeeded, but later insert_resource()
failed.

Test steps:
kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
kexec reboot -> 
dmesg|grep "crashkernel reserved";
crashkernel memory range like below reserved successfully:
0xd000 - 0xda00
But no such "Crash kernel" region in /proc/iomem

The background story is like below:

Currently E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.
Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI and SETUP_IMA if needed.  Thus the old
setup data memory is not used at all.

Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().

Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.

Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.

Thus drop the useless buggy code here.

Signed-off-by: Dave Young 
---
V2: changelog grammar fixes [suggestions from Huang Kai]
 arch/x86/kernel/e820.c |   16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

Index: linux/arch/x86/kernel/e820.c
===
--- linux.orig/arch/x86/kernel/e820.c
+++ linux/arch/x86/kernel/e820.c
@@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
pa_next = data->next;
 
e820__range_update(pa_data, sizeof(*data)+data->len, 
E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
-
-   /*
-* SETUP_EFI and SETUP_IMA are supplied by kexec and do not need
-* to be reserved.
-*/
-   if (data->type != SETUP_EFI && data->type != SETUP_IMA)
-   e820__range_update_kexec(pa_data,
-sizeof(*data) + data->len,
-E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-
if (data->type == SETUP_INDIRECT) {
len += data->len;
early_memunmap(data, sizeof(*data));
@@ -1036,12 +1026,9 @@ void __init e820__reserve_setup_data(voi
 
indirect = (struct setup_indirect *)data->data;
 
-   if (indirect->type != SETUP_INDIRECT) {
+   if (indirect->type != SETUP_INDIRECT)
e820__range_update(indirect->addr, 
indirect->len,
   E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-   e820__range_update_kexec(indirect->addr, 
indirect->len,
-E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-   }
}
 
pa_data = pa_next;
@@ -1049,7 +1036,6 @@ void __init e820__reserve_setup_data(voi
}
 
e820__update_table(e820_table);
-   e820__update_table(e820_table_kexec);
 
pr_info("extended physical RAM map:\n");
e820__print_table("reserve setup_data");


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/kexec: do not update E820 kexec table for setup_data

2024-03-20 Thread Dave Young
Hi,

On Thu, 21 Mar 2024 at 05:56, Huang, Kai  wrote:
>
> Hi Dave,
>
> Some nitpicking in changelog:

Will fix the grammar issues.  Thanks for your review!

>
> On 5/03/2024 2:32 pm, Dave Young wrote:
> > crashkernel reservation failed on a Thinkpad t440s laptop recently,
>
> ',' -> '.' to make it as a standalone sentence.
>
> > Actually the memblock reservation succeeded, but later insert_resource()
> > failed.
> >
> > Test step:
> > kexec load ->
> >   kexec reboot ->
> >   check the crashkernel memory
> >   dmesg|grep "crashkernel reserved"; saw reserved suceeeded:
>
> "suceeeded" -> "succeeded".
>
> >   0xd000 - 0xda00
> >   grep Crash /proc/iomem: got nothing
>
> And somehow I found it's not easy to read.  :-)
>
> >
> > The background story is like below:
>
> Better to have an blank line to make text more breathable.
>
> > Currently E820 code reserves setup_data regions for both the current kernel
> > and the kexec kernel, and it will also insert them into resources list.
>
> "will insert" -> "inserts".
>
> > Before the kexec kernel reboot nobody passes the old setup_data, kexec only
>
>   ^ "reboots" ^ and
>
> > passes SETUP_EFI and SETUP_IMA if needed.  Thus the old setup data memory
> > are not used at all. But due to old kernel updated the kexec e820 table
>
>^ is  ^ updates
>
> > as well so kexec kernel see them as E820_TYPE_RESERVED_KERN regions, later
>
> "so kexec kernel" -> ", the kexec kernel"
>
> "see" -> "sees"
>
> ", later" -> ", and later"
>
> > the old setup_data regions will be inserted into resources list in kexec
>
> "will be" -> "are"
>
> > kernel by e820__reserve_resources().
> >
> > Note, due to no setup_data passed in for those old regions they are not
>
> ^ is
>
> > early reserved (by function early_reserve_memory), crashkernel memblock
>
>^ and the
>
> > reservation will just regard them as usable memory and it could reserve
>
> "regard" -> "treat"
>
> > reserve crashkernel region overlaps with the old setup_data regions.
>
> double "reserve".
>
> "crashkernel region" -> "the crashkernel region"
>
> "overlaps" -> "which overlaps"
>
> >
> > Just like the bug I noticed here, kdump insert_resource failed because
> > e820__reserve_resources added the overlapped chunks in /proc/iomem already.
>
>   ^ has added
> >
> > Finally, looking at the code, the old setup_data regions are not used
> > at all as no setup_data passed in by the kexec boot loader. Although
>
>   ^ is passed
>
> > something like SETUP_PCI etc could be needed, kexec should pass
> > the info as setup_data so that kexec kernel can take care of them.
> > This should be taken care of in other separate patches if needed.
> >
> > Thus drop the useless buggy code here.
> >
> > Signed-off-by: Dave Young 
> > ---
> >   arch/x86/kernel/e820.c |   16 +---
> >   1 file changed, 1 insertion(+), 15 deletions(-)
> >
> > Index: linux/arch/x86/kernel/e820.c
> > ===
> > --- linux.orig/arch/x86/kernel/e820.c
> > +++ linux/arch/x86/kernel/e820.c
> > @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
> >   pa_next = data->next;
> >
> >   e820__range_update(pa_data, sizeof(*data)+data->len, 
> > E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
> > - /*
> > -  * SETUP_EFI and SETUP_IMA are supplied by kexec and do not 
> > need
> > -  * to be reserved.
> > -  */
> > - if (data->type != SETUP_EFI && data->type != SETUP_IMA)
> > - e820__range_update_kexec(pa_data,
> > -  sizeof(*data) + data->len,
> > -  E820_TYPE_RAM, 
> > E820_TYPE_RESERVED_KERN);
> > -
> >   if (data->type == SETUP_INDIRECT) {
> >   len += dat

Re: [PATCH] x86/kexec: do not update E820 kexec table for setup_data

2024-03-18 Thread Dave Young
On Tue, 5 Mar 2024 at 09:32, Dave Young  wrote:
>
> crashkernel reservation failed on a Thinkpad t440s laptop recently,
> Actually the memblock reservation succeeded, but later insert_resource()
> failed.
>
> Test step:
> kexec load ->
> kexec reboot ->
> check the crashkernel memory
> dmesg|grep "crashkernel reserved"; saw reserved suceeeded:
> 0xd000 - 0xda00
> grep Crash /proc/iomem: got nothing
>
> The background story is like below:
> Currently E820 code reserves setup_data regions for both the current kernel
> and the kexec kernel, and it will also insert them into resources list.
> Before the kexec kernel reboot nobody passes the old setup_data, kexec only
> passes SETUP_EFI and SETUP_IMA if needed.  Thus the old setup data memory
> are not used at all. But due to old kernel updated the kexec e820 table
> as well so kexec kernel see them as E820_TYPE_RESERVED_KERN regions, later
> the old setup_data regions will be inserted into resources list in kexec
> kernel by e820__reserve_resources().
>
> Note, due to no setup_data passed in for those old regions they are not
> early reserved (by function early_reserve_memory), crashkernel memblock
> reservation will just regard them as usable memory and it could reserve
> reserve crashkernel region overlaps with the old setup_data regions.
>
> Just like the bug I noticed here, kdump insert_resource failed because
> e820__reserve_resources added the overlapped chunks in /proc/iomem already.
>
> Finally, looking at the code, the old setup_data regions are not used
> at all as no setup_data passed in by the kexec boot loader. Although
> something like SETUP_PCI etc could be needed, kexec should pass
> the info as setup_data so that kexec kernel can take care of them.
> This should be taken care of in other separate patches if needed.
>
> Thus drop the useless buggy code here.
>
> Signed-off-by: Dave Young 
> ---
>  arch/x86/kernel/e820.c |   16 +---
>  1 file changed, 1 insertion(+), 15 deletions(-)
>
> Index: linux/arch/x86/kernel/e820.c
> ===
> --- linux.orig/arch/x86/kernel/e820.c
> +++ linux/arch/x86/kernel/e820.c
> @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
> pa_next = data->next;
>
> e820__range_update(pa_data, sizeof(*data)+data->len, 
> E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> -
> -   /*
> -* SETUP_EFI and SETUP_IMA are supplied by kexec and do not 
> need
> -* to be reserved.
> -*/
> -   if (data->type != SETUP_EFI && data->type != SETUP_IMA)
> -   e820__range_update_kexec(pa_data,
> -sizeof(*data) + data->len,
> -E820_TYPE_RAM, 
> E820_TYPE_RESERVED_KERN);
> -
> if (data->type == SETUP_INDIRECT) {
> len += data->len;
> early_memunmap(data, sizeof(*data));
> @@ -1036,12 +1026,9 @@ void __init e820__reserve_setup_data(voi
>
> indirect = (struct setup_indirect *)data->data;
>
> -   if (indirect->type != SETUP_INDIRECT) {
> +   if (indirect->type != SETUP_INDIRECT)
> e820__range_update(indirect->addr, 
> indirect->len,
>E820_TYPE_RAM, 
> E820_TYPE_RESERVED_KERN);
> -   e820__range_update_kexec(indirect->addr, 
> indirect->len,
> -E820_TYPE_RAM, 
> E820_TYPE_RESERVED_KERN);
> -   }
> }
>
> pa_data = pa_next;
> @@ -1049,7 +1036,6 @@ void __init e820__reserve_setup_data(voi
> }
>
> e820__update_table(e820_table);
> -   e820__update_table(e820_table_kexec);
>
> pr_info("extended physical RAM map:\n");
> e820__print_table("reserve setup_data");
>

Kindly ping for review.

Thanks!
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/3] efi/x86: skip efi_arch_mem_reserve() in case of kexec.

2024-03-18 Thread Dave Young
Hi,

Added Ard in cc.

On 03/18/24 at 07:02am, Ashish Kalra wrote:
> From: Ashish Kalra 
> 
> For kexec use case, need to use and stick to the EFI memmap passed
> from the first kernel via boot-params/setup data, hence,
> skip efi_arch_mem_reserve() during kexec.
> 
> Additionally during SNP guest kexec testing discovered that EFI memmap
> is corrupted during chained kexec. kexec_enter_virtual_mode() during
> late init will remap the efi_memmap physical pages allocated in
> efi_arch_mem_reserve() via memboot & then subsequently cause random
> EFI memmap corruption once memblock is freed/teared-down.
> 
> Signed-off-by: Ashish Kalra 
> ---
>  arch/x86/platform/efi/quirks.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index f0cc00032751..d4562d074371 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -258,6 +258,16 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
>   int num_entries;
>   void *new;
>  
> + /*
> +  * For kexec use case, we need to use the EFI memmap passed from the 
> first
> +  * kernel via setup data, so we need to skip this.
> +  * Additionally kexec_enter_virtual_mode() during late init will remap
> +  * the efi_memmap physical pages allocated here via memboot & then
> +  * subsequently cause random EFI memmap corruption once memblock is 
> freed.

Can you elaborate a bit about the corruption, is it reproducible without
SNP?

> +  */
> + if (efi_setup)
> + return;
> +

How about checking the md attribute instead of checking the efi_setup,
personally I feel it a bit better, something like below:

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..699332b075bb 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -255,15 +255,24 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
size)
struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
-   int num_entries;
+   int num_entries, ret;
void *new;
 
-   if (efi_mem_desc_lookup(addr, ) ||
-   md.type != EFI_BOOT_SERVICES_DATA) {
+   ret = efi_mem_desc_lookup(addr, );
+   if (ret) {
pr_err("Failed to lookup EFI memory descriptor for %pa\n", 
);
return;
}
 
+   if (md.type != EFI_BOOT_SERVICES_DATA) {
+   pr_err("Skil reserving non EFI Boot Service Data memory for 
%pa\n", );
+   return;
+   }
+
+   /* Kexec copied the efi memmap from the 1st kernel, thus skip the case. 
*/
+   if (md.attribute & EFI_MEMORY_RUNTIME)
+   return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", );
return;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [REGRESSION] kexec does firmware reboot in kernel v6.7.6

2024-03-14 Thread Dave Young
On Thu, 14 Mar 2024 at 00:18, Steve Wahl  wrote:
>
> On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
> >
> > Kexec happens on identity mapped page tables.
> >
> > The files of interest are machine_kexec_64.c and relocate_kernel_64.S
> >
> > I suspect either the building of the identity mappged page table in
> > machine_kexec_prepare, or the switching to the page table in
> > identity_mapped in relocate_kernel_64.S is where something goes wrong.
> >
> > Probably in kernel_ident_mapping_init as that code is directly used
> > to build the identity mapped page tables.
> >
> > Hmm.
> >
> > Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only
> > where full GB page should be mapped.")
>
> Yeah, sorry, I accidentally used the stable cherry-pick commit id that
> Pavin Joseph found with his bisect results.
>
> > Given the simplicity of that change itself my guess is that somewhere in
> > the first 1Gb there are pages that needed to be mapped like the idt at 0
> > that are not getting mapped.
>
> ...
>
> > It might be worth setting up early printk on some of these systems
> > and seeing if the failure is in early boot up of the new kernel (that is
> > using kexec supplied identity mapped pages) rather than in kexec per-se.
> >
> > But that is just my guess at the moment.
>
> Thanks for the input.  I was thinking in terms of running out of
> memory somewhere because we're using more page table entries than we
> used to.  But you've got me thinking that maybe some necessary region
> is not explicitly requested to be placed in the identity map, but is
> by luck included in the rounding errors when we use gbpages.

Yes, it is possible. Here is an example case:
http://lists.infradead.org/pipermail/kexec/2023-June/027301.html
Final change was to avoid doing AMD things on Intel platform, but the
mapping code is still not fixed in a good way.

>
> At any rate, since I am still unable to reproduce this for myself, I
> am going to contact Pavin Joseph off-list and see if he's willing to
> do a few debugging kernel steps for me and send me the results, to see
> if I can get this figured out.  (I believe trimming the CC list and/or
> going private is usually frowned upon for the LKML, but I think this
> is appropriate as it only adds noise for the rest.  Let me know if I'm
> wrong.)
>
> Thank you.
>
> --> Steve Wahl
>
> --
> Steve Wahl, Hewlett Packard Enterprise
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] x86/kexec: do not update E820 kexec table for setup_data

2024-03-04 Thread Dave Young
crashkernel reservation failed on a Thinkpad t440s laptop recently,
Actually the memblock reservation succeeded, but later insert_resource()
failed.

Test step:
kexec load ->
kexec reboot -> 
check the crashkernel memory
dmesg|grep "crashkernel reserved"; saw reserved suceeeded:
0xd000 - 0xda00
grep Crash /proc/iomem: got nothing 

The background story is like below:
Currently E820 code reserves setup_data regions for both the current kernel
and the kexec kernel, and it will also insert them into resources list.
Before the kexec kernel reboot nobody passes the old setup_data, kexec only
passes SETUP_EFI and SETUP_IMA if needed.  Thus the old setup data memory
are not used at all. But due to old kernel updated the kexec e820 table
as well so kexec kernel see them as E820_TYPE_RESERVED_KERN regions, later
the old setup_data regions will be inserted into resources list in kexec
kernel by e820__reserve_resources().

Note, due to no setup_data passed in for those old regions they are not
early reserved (by function early_reserve_memory), crashkernel memblock
reservation will just regard them as usable memory and it could reserve
reserve crashkernel region overlaps with the old setup_data regions.

Just like the bug I noticed here, kdump insert_resource failed because
e820__reserve_resources added the overlapped chunks in /proc/iomem already.

Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.

Thus drop the useless buggy code here.

Signed-off-by: Dave Young 
---
 arch/x86/kernel/e820.c |   16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

Index: linux/arch/x86/kernel/e820.c
===
--- linux.orig/arch/x86/kernel/e820.c
+++ linux/arch/x86/kernel/e820.c
@@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
pa_next = data->next;
 
e820__range_update(pa_data, sizeof(*data)+data->len, 
E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
-
-   /*
-* SETUP_EFI and SETUP_IMA are supplied by kexec and do not need
-* to be reserved.
-*/
-   if (data->type != SETUP_EFI && data->type != SETUP_IMA)
-   e820__range_update_kexec(pa_data,
-sizeof(*data) + data->len,
-E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-
if (data->type == SETUP_INDIRECT) {
len += data->len;
early_memunmap(data, sizeof(*data));
@@ -1036,12 +1026,9 @@ void __init e820__reserve_setup_data(voi
 
indirect = (struct setup_indirect *)data->data;
 
-   if (indirect->type != SETUP_INDIRECT) {
+   if (indirect->type != SETUP_INDIRECT)
e820__range_update(indirect->addr, 
indirect->len,
   E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-   e820__range_update_kexec(indirect->addr, 
indirect->len,
-E820_TYPE_RAM, 
E820_TYPE_RESERVED_KERN);
-   }
}
 
pa_data = pa_next;
@@ -1049,7 +1036,6 @@ void __init e820__reserve_setup_data(voi
}
 
e820__update_table(e820_table);
-   e820__update_table(e820_table_kexec);
 
pr_info("extended physical RAM map:\n");
e820__print_table("reserve setup_data");


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

2023-10-16 Thread Dave Young
[Added more people in cc]

On 10/08/23 at 12:19pm, John Ogness wrote:
> Hi Petr,
> 
> On 2023-10-06, Petr Mladek  wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
> 
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.
> 
> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
> 
> The LPC2022 demo/talk was recorded:
> 
> https://www.youtube.com/watch?v=TVhNcKQvzxI
> 
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.
> 
> You may also want to reread my summary:
> 
> https://lore.kernel.org/lkml/875yheqh6v@jogness.linutronix.de
> 
> as well as Linus' follow-up message:
> 
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hjbwgyaaaipxnmxr0...@mail.gmail.com
> 
> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/20230919230856.661435-9-john.ogn...@linutronix.de
> >
> >+ nbcon_atomic_exit() is called only in one code path.
> 
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
> 
> >+ nbcon_atomic_flush_all() is used in other paths. It looks like
> >  a "Whack a mole" game to me.
> 
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.
> 
> >+ messages are never emitted by printk kthread either because
> >  CPUs are stopped or the kthread is not allowed to get the lock
> 
> Correct.
> 
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
> 
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.
> 
> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
> 
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

Flush consoles in panic kexec case sounds not good, but I have no
deep understanding about the atomic printk series, added kexec list and
reviewers in cc.

> 
> John
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kexec: provide a memfd_create() wrapper if not present in libc

2023-09-27 Thread Dave Young
On Sun, 24 Sept 2023 at 00:47, Julien Olivain  wrote:
>
> Commit 714fa115 "kexec/arm64: Simplify the code for zImage" introduced
> a use of the memfd_create() system call, included in version
> kexec-tools v2.0.27.
>
> This system call was introduced in kernel commit [1], first included
> in kernel v3.17 (released on 2014-10-05).
>
> The memfd_create() glibc wrapper function was added much later in
> commit [2], first included in glibc version 2.27 (released on
> 2018-02-01).
>
> This direct use memfd_create() introduced a requirement on
> Kernel >= 3.17 and glibc >= 2.27.
>
> There is old toolchains like [3] for example (which ships gcc 7.3.1,
> glibc 2.25 and includes kernel v4.10 headers), that can still be used
> to build newer kernels. Even if such toolchains can be seen as
> outdated, they are is still claimed as supported by recent kernel.
> For example, Kernel v6.5.5 has a requirement on gcc version 5.1 and
> greater. See [4].
>
> Moreover, kexec-tools <= 2.0.26 could be compiled using recent
> toolchains with alternative libc (e.g. uclibc-ng, musl) which are not
> providing the memfd_create() wrapper.
>
> When compiling kexec-tools v2.0.27 with a toolchain not providing the
> memfd_create() syscall wrapper, the compilation fail with message:
>
> kexec/kexec.c: In function 'copybuf_memfd':
> kexec/kexec.c:645:7: warning: implicit declaration of function 
> 'memfd_create'; did you mean 'SYS_memfd_create'? 
> [-Wimplicit-function-declaration]
>   fd = memfd_create("kernel", MFD_ALLOW_SEALING);
>^~~~
>SYS_memfd_create
> kexec/kexec.c:645:30: error: 'MFD_ALLOW_SEALING' undeclared (first use in 
> this function); did you mean '_PC_ALLOC_SIZE_MIN'?
>   fd = memfd_create("kernel", MFD_ALLOW_SEALING);
>   ^
>   _PC_ALLOC_SIZE_MIN
>
> In order to let kexec-tools compile in a wider range of configurations,
> this commit adds a memfd_create() function check in autoconf configure
> script, and adds a system call wrapper which will be used if the
> function is not available. With this commit, the environment
> requirement is relaxed to only kernel >= v3.17.
>
> Note: this issue was found in kexec-tools integration in Buildroot [5]
> using the command "utils/test-pkg -a -p kexec", which tests many
> toolchain/arch combinations.

I guess maybe the test was done on non x86 arch,  when I tried to
build on old versions
I got another failure of lacking "getrandom".Only quickly did a
build test with commenting out
the getrandom code, the build passed with your patch.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] Sign the Image which is zboot's payload

2023-09-21 Thread Dave Young
Hi Jan,

On Fri, 22 Sept 2023 at 13:19, Jan Hendrik Farr  wrote:
>
> Hi Pingfan!
>
> On 21 21:37:01, Pingfan Liu wrote:
> > From: Pingfan Liu 
> >
>
> > For security boot, the vmlinuz.efi will be signed so UEFI boot loader
> > can check against it. But at present, there is no signature for kexec
> > file load, this series makes a signature on the zboot's payload -- Image
> > before it is compressed. As a result, the kexec-tools parses and
> > decompresses the Image.gz to get the Image, which has signature and can
> > be checked against during kexec file load
>
> I missed some of the earlier discussion about this zboot kexec support.
> So just let me know if I'm missing something here. You were exploring
> these two options in getting this supported:
>
> 1. Making kexec_file_load do all the work.
>
> This option makes the signature verification easy. kexec_file_load
> checks the signature on the pe file and then extracts it and does the
> kexec.
>
> This is similar to how I'm approaching UKI support in [1].
>
> 2. Extract in userspace and pass decompressed kernel to kexec_file_load
>
> This options requires the decompressed kernel to have a valid signature on
> it. That's why this patch adds the ability to add that signature to the
> kernel contained inside the zboot image.
>
> This option would not make sense for UKI support as it would not
> validate the signature with respect to the initrd and cmdline that it
> contains.

Another possibility for the cmdline could be using the bootconfig
facility which was
introduced for boot time tracking:
Documentation/admin-guide/bootconfig.rst

So the initrd+cmdline can be signed as well.  Has this been discussed
before for UKI?

Thanks
Dave



Re: [PATCH] kexec: change locking mechanism to a mutex

2023-09-21 Thread Dave Young
[Cced Valentin Schneider as he added the trylocks]

On Fri, 22 Sept 2023 at 06:04, Eric DeVolder  wrote:
>
> Scaled up testing has revealed that the kexec_trylock()
> implementation leads to failures within the crash hotplug
> infrastructure due to the inability to acquire the lock,
> specifically the message:
>
>  crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>
> When hotplug events occur, the crash hotplug infrastructure first
> attempts to obtain the lock via the kexec_trylock(). However, the
> implementation either acquires the lock, or fails and returns; there
> is no waiting on the lock. Here is the comment/explanation from
> kernel/kexec_internal.h:kexec_trylock():
>
>  * Whatever is used to serialize accesses to the kexec_crash_image needs to be
>  * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use 
> a
>  * "simple" atomic variable that is acquired with a cmpxchg().
>
> While this in theory can happen for either CPU or memory hoptlug,
> this problem is most prone to occur for memory hotplug.
>
> When memory is hot plugged, the memory is converted into smaller
> 128MiB memblocks (typically). As each memblock is processed, a
> kernel thread and a udev event thread are created. The udev thread
> tries for the lock via the reading of the sysfs node
> /sys/devices/system/memory/crash_hotplug node, and the kernel
> worker thread tries for the lock upon entering the crash hotplug
> infrastructure.
>
> These threads then compete for the kexec lock.
>
> For example, a 1GiB DIMM is converted into 8 memblocks, each
> spawning two threads for a total of 16 threads that create a small
> "swarm" all trying to acquire the lock. The larger the DIMM, the
> more the memblocks and the larger the swarm.
>
> At the root of the problem is the atomic lock behind kexec_trylock();
> it works well for low lock traffic; ie loading/unloading a capture
> kernel, things that happen basically once. But with the introduction
> of crash hotplug, the traffic through the lock increases significantly,
> and more importantly in bursts occurring at roughly the same time. Thus
> there is a need to wait on the lock.
>
> A possible workaround is to simply retry the lock, say up to N times.
> There is, of course, the problem of determining a value of N that works for
> all implementations, and for all the other call sites of kexec_trylock().
> Not ideal.
>
> The design decision to use the atomic lock is described in the comment
> from kexec_internal.h, cited above. However, examining the code of
> __crash_kexec():
>
> if (kexec_trylock()) {
> if (kexec_crash_image) {
> ...
> }
> kexec_unlock();
> }
>
> reveals that the use of kexec_trylock() here is actually a "best effort"
> due to the atomic lock.  This atomic lock, prior to crash hotplug,
> would almost always be assured (another kexec syscall could hold the lock
> and prevent this, but that is about it).
>
> So at the point where the capture kernel would be invoked, if the lock
> is not obtained, then kdump doesn't occur.
>
> It is possible to instead use a mutex with proper waiting, and utilize
> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
> mutex then avoids all the lock acquisition problems that were revealed
> by the crash hotplug activity.
>
> Convert the atomic lock to a mutex.
>
> Signed-off-by: Eric DeVolder 
> ---
>  kernel/crash_core.c | 10 ++
>  kernel/kexec.c  |  3 +--
>  kernel/kexec_core.c | 13 +
>  kernel/kexec_file.c |  3 +--
>  kernel/kexec_internal.h | 12 +++-
>  5 files changed, 12 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..9a8378fbdafa 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -749,10 +749,7 @@ int crash_check_update_elfcorehdr(void)
> int rc = 0;
>
> /* Obtain lock while reading crash information */
> -   if (!kexec_trylock()) {
> -   pr_info("kexec_trylock() failed, elfcorehdr may be 
> inaccurate\n");
> -   return 0;
> -   }
> +   kexec_lock();
> if (kexec_crash_image) {
> if (kexec_crash_image->file_mode)
> rc = 1;
> @@ -784,10 +781,7 @@ static void crash_handle_hotplug_event(unsigned int 
> hp_action, unsigned int cpu)
> struct kimage *image;
>
> /* Obtain lock while changing crash information */
> -   if (!kexec_trylock()) {
> -   pr_info("kexec_trylock() failed, elfcorehdr may be 
> inaccurate\n");
> -   return;
> -   }
> +   kexec_lock();
>
> /* Check kdump is not loaded */
> if (!kexec_crash_image)
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 107f355eac10..a2f687900bb5 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -96,8 +96,7 @@ static int do_kexec_load(unsigned long entry, 

Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Dave Young
On Wed, 20 Sept 2023 at 20:18, Dave Young  wrote:
>
> On Wed, 20 Sept 2023 at 20:07, Dave Young  wrote:
> >
> > On Wed, 20 Sept 2023 at 18:50, Ard Biesheuvel  wrote:
> > >
> > > On Wed, 20 Sept 2023 at 08:40, Dave Young  wrote:
> > > >
> > > > On Wed, 20 Sept 2023 at 15:43, Dave Young  wrote:
> > > > >
> > > > > > > In the end the only benefit this series brings is to extend the
> > > > > > > signature checking on the whole UKI except of just the kernel 
> > > > > > > image.
> > > > > > > Everything else can also be done in user space. Compared to the
> > > > > > > problems described above this is a very small gain for me.
> > > > > >
> > > > > > Correct. That is the benefit of pulling the UKI apart in the
> > > > > > kernel. However having to sign the kernel inside the UKI defeats
> > > > > > the whole point.
> > > > >
> > > > >
> > > > > Pingfan added the zboot load support in kexec-tools, I know that he is
> > > > > trying to sign the zboot image and the inside kernel twice. So
> > > > > probably there are some common areas which can be discussed.
> > > > > Added Ard and Pingfan in cc.
> > > > > http://lists.infradead.org/pipermail/kexec/2023-August/027674.html
> > > > >
> > > >
> > > > Here is another thread of the initial try in kernel with a few more
> > > > options eg. some fake efi service helpers.
> > > > https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
> > > >
> > >
> >
> > Ard, thanks for the comments.
> >
> > > Currently, UKI's external interface is defined in terms of EFI
> > > services, i.e., it is an executable PE/COFF binary that encapsulates
> > > all the logic that performs the unpacking of the individual sections,
> > > and loads the kernel as a PE/COFF binary as well (i.e., via
> > > LoadImage/StartImage)
> > >
> > > As soon as we add support to Linux to unpack a UKI and boot the
> > > encapsulated kernel using a boot protocol other than EFI, we are
> > > painting ourselves into a corner, severely limiting the freedom of the
> > > UKI effort to make changes to the interfaces that were implementation
> > > details up to this point.
> >
> > Agreed, it seems UKI is more flexible and complex than the zboot,
> > we do need to carefully think about a better solution.
> >
> > >
> > > It also means that UKI handling in kexec will need to be taught about
> > > every individual architecture again, which is something we are trying
> > > to avoid with EFI support in general. Breaking the abstraction like
> > > this lets the cat out of the bag, and will add yet another variation
> > > of kexec that we will need to support and maintain forever.
> > >
> > > So the only way to do this properly and portably is to implement the
> > > minimal set of EFI boot services [0] that Linux actually needs to run
> > > its EFI stub (which is mostly identical to the set that UKI relies on
> > > afaict), and expose them to the kexec image as it is being loaded.
> > > This is not as bad as it sounds - I have some Rust code that could be
> > > used as an inspiration [1] and which could be reused and shared
> > > between architectures.
> >
> > Great!
> >
> > >
> > > This would also reduce/remove the need for a purgatory: loading a EFI
> > > binary in this way would run it up to the point were it calls
> > > ExitBootServices(), and the actual kexec would invoke the image as if
> > > it was returning from ExitBootServices().
> > >
> > > The only fundamental problem here is the need to allocate large chunks
> > > of physical memory, which would need some kind of CMA support, I
> > > imagine?
> >
> > Hmm, I thought that your idea is to write the efi stub code in "purgatory"
> > so kexec can jump to it while rebooting then it will be able to access the
> > whole usable memory, but it seems you want an efi app run under linux
> > and somehow provide services to kexec?  My EFI knowledge is incomplete
> > and outdated,  If my understanding of your proposal is true how can it keep
> > running after switching to the new kernel stub?
>
> Oops,  please ignore the quick reply and questioins, I apparen

Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Dave Young
On Wed, 20 Sept 2023 at 20:07, Dave Young  wrote:
>
> On Wed, 20 Sept 2023 at 18:50, Ard Biesheuvel  wrote:
> >
> > On Wed, 20 Sept 2023 at 08:40, Dave Young  wrote:
> > >
> > > On Wed, 20 Sept 2023 at 15:43, Dave Young  wrote:
> > > >
> > > > > > In the end the only benefit this series brings is to extend the
> > > > > > signature checking on the whole UKI except of just the kernel image.
> > > > > > Everything else can also be done in user space. Compared to the
> > > > > > problems described above this is a very small gain for me.
> > > > >
> > > > > Correct. That is the benefit of pulling the UKI apart in the
> > > > > kernel. However having to sign the kernel inside the UKI defeats
> > > > > the whole point.
> > > >
> > > >
> > > > Pingfan added the zboot load support in kexec-tools, I know that he is
> > > > trying to sign the zboot image and the inside kernel twice. So
> > > > probably there are some common areas which can be discussed.
> > > > Added Ard and Pingfan in cc.
> > > > http://lists.infradead.org/pipermail/kexec/2023-August/027674.html
> > > >
> > >
> > > Here is another thread of the initial try in kernel with a few more
> > > options eg. some fake efi service helpers.
> > > https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
> > >
> >
>
> Ard, thanks for the comments.
>
> > Currently, UKI's external interface is defined in terms of EFI
> > services, i.e., it is an executable PE/COFF binary that encapsulates
> > all the logic that performs the unpacking of the individual sections,
> > and loads the kernel as a PE/COFF binary as well (i.e., via
> > LoadImage/StartImage)
> >
> > As soon as we add support to Linux to unpack a UKI and boot the
> > encapsulated kernel using a boot protocol other than EFI, we are
> > painting ourselves into a corner, severely limiting the freedom of the
> > UKI effort to make changes to the interfaces that were implementation
> > details up to this point.
>
> Agreed, it seems UKI is more flexible and complex than the zboot,
> we do need to carefully think about a better solution.
>
> >
> > It also means that UKI handling in kexec will need to be taught about
> > every individual architecture again, which is something we are trying
> > to avoid with EFI support in general. Breaking the abstraction like
> > this lets the cat out of the bag, and will add yet another variation
> > of kexec that we will need to support and maintain forever.
> >
> > So the only way to do this properly and portably is to implement the
> > minimal set of EFI boot services [0] that Linux actually needs to run
> > its EFI stub (which is mostly identical to the set that UKI relies on
> > afaict), and expose them to the kexec image as it is being loaded.
> > This is not as bad as it sounds - I have some Rust code that could be
> > used as an inspiration [1] and which could be reused and shared
> > between architectures.
>
> Great!
>
> >
> > This would also reduce/remove the need for a purgatory: loading a EFI
> > binary in this way would run it up to the point were it calls
> > ExitBootServices(), and the actual kexec would invoke the image as if
> > it was returning from ExitBootServices().
> >
> > The only fundamental problem here is the need to allocate large chunks
> > of physical memory, which would need some kind of CMA support, I
> > imagine?
>
> Hmm, I thought that your idea is to write the efi stub code in "purgatory"
> so kexec can jump to it while rebooting then it will be able to access the
> whole usable memory, but it seems you want an efi app run under linux
> and somehow provide services to kexec?  My EFI knowledge is incomplete
> and outdated,  If my understanding of your proposal is true how can it keep
> running after switching to the new kernel stub?

Oops,  please ignore the quick reply and questioins, I apparently
forgot that this is the kexec loading
phase instead of the rebooting phase.  Yes as you said CMA might be
the only choice
for that proposal.

>
> >
> > Maybe we should do a BoF at LPC to discuss this further?
>
> It does deserve more discussion, unfortunately I will not be able to join LPC,
> Philipp Rudo (cced) planned attend the conf, so I think you guys can
> discuss together with
> other people interested. I think I will watch the recordings or
> joining virtually if possible.
>
> >
> > [0] this is not as bad as it sounds: beyond a protocol database, a
> > heap allocator and a memory map, there is actually very little needed
> > to boot Linux via the EFI stub (although UKI needs
> > LoadImage/StartImage as well)
> >
> > [1] https://github.com/ardbiesheuvel/efilite
> >


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Dave Young
On Wed, 20 Sept 2023 at 18:50, Ard Biesheuvel  wrote:
>
> On Wed, 20 Sept 2023 at 08:40, Dave Young  wrote:
> >
> > On Wed, 20 Sept 2023 at 15:43, Dave Young  wrote:
> > >
> > > > > In the end the only benefit this series brings is to extend the
> > > > > signature checking on the whole UKI except of just the kernel image.
> > > > > Everything else can also be done in user space. Compared to the
> > > > > problems described above this is a very small gain for me.
> > > >
> > > > Correct. That is the benefit of pulling the UKI apart in the
> > > > kernel. However having to sign the kernel inside the UKI defeats
> > > > the whole point.
> > >
> > >
> > > Pingfan added the zboot load support in kexec-tools, I know that he is
> > > trying to sign the zboot image and the inside kernel twice. So
> > > probably there are some common areas which can be discussed.
> > > Added Ard and Pingfan in cc.
> > > http://lists.infradead.org/pipermail/kexec/2023-August/027674.html
> > >
> >
> > Here is another thread of the initial try in kernel with a few more
> > options eg. some fake efi service helpers.
> > https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
> >
>

Ard, thanks for the comments.

> Currently, UKI's external interface is defined in terms of EFI
> services, i.e., it is an executable PE/COFF binary that encapsulates
> all the logic that performs the unpacking of the individual sections,
> and loads the kernel as a PE/COFF binary as well (i.e., via
> LoadImage/StartImage)
>
> As soon as we add support to Linux to unpack a UKI and boot the
> encapsulated kernel using a boot protocol other than EFI, we are
> painting ourselves into a corner, severely limiting the freedom of the
> UKI effort to make changes to the interfaces that were implementation
> details up to this point.

Agreed, it seems UKI is more flexible and complex than the zboot,
we do need to carefully think about a better solution.

>
> It also means that UKI handling in kexec will need to be taught about
> every individual architecture again, which is something we are trying
> to avoid with EFI support in general. Breaking the abstraction like
> this lets the cat out of the bag, and will add yet another variation
> of kexec that we will need to support and maintain forever.
>
> So the only way to do this properly and portably is to implement the
> minimal set of EFI boot services [0] that Linux actually needs to run
> its EFI stub (which is mostly identical to the set that UKI relies on
> afaict), and expose them to the kexec image as it is being loaded.
> This is not as bad as it sounds - I have some Rust code that could be
> used as an inspiration [1] and which could be reused and shared
> between architectures.

Great!

>
> This would also reduce/remove the need for a purgatory: loading a EFI
> binary in this way would run it up to the point were it calls
> ExitBootServices(), and the actual kexec would invoke the image as if
> it was returning from ExitBootServices().
>
> The only fundamental problem here is the need to allocate large chunks
> of physical memory, which would need some kind of CMA support, I
> imagine?

Hmm, I thought that your idea is to write the efi stub code in "purgatory"
so kexec can jump to it while rebooting then it will be able to access the
whole usable memory, but it seems you want an efi app run under linux
and somehow provide services to kexec?  My EFI knowledge is incomplete
and outdated,  If my understanding of your proposal is true how can it keep
running after switching to the new kernel stub?

>
> Maybe we should do a BoF at LPC to discuss this further?

It does deserve more discussion, unfortunately I will not be able to join LPC,
Philipp Rudo (cced) planned attend the conf, so I think you guys can
discuss together with
other people interested. I think I will watch the recordings or
joining virtually if possible.

>
> [0] this is not as bad as it sounds: beyond a protocol database, a
> heap allocator and a memory map, there is actually very little needed
> to boot Linux via the EFI stub (although UKI needs
> LoadImage/StartImage as well)
>
> [1] https://github.com/ardbiesheuvel/efilite
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Dave Young
On Wed, 20 Sept 2023 at 15:43, Dave Young  wrote:
>
> > > In the end the only benefit this series brings is to extend the
> > > signature checking on the whole UKI except of just the kernel image.
> > > Everything else can also be done in user space. Compared to the
> > > problems described above this is a very small gain for me.
> >
> > Correct. That is the benefit of pulling the UKI apart in the
> > kernel. However having to sign the kernel inside the UKI defeats
> > the whole point.
>
>
> Pingfan added the zboot load support in kexec-tools, I know that he is
> trying to sign the zboot image and the inside kernel twice. So
> probably there are some common areas which can be discussed.
> Added Ard and Pingfan in cc.
> http://lists.infradead.org/pipermail/kexec/2023-August/027674.html
>

Here is another thread of the initial try in kernel with a few more
options eg. some fake efi service helpers.
https://lore.kernel.org/linux-arm-kernel/zbvksis+dfnqa...@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e

>
> Thanks
> Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-20 Thread Dave Young
> > In the end the only benefit this series brings is to extend the
> > signature checking on the whole UKI except of just the kernel image.
> > Everything else can also be done in user space. Compared to the
> > problems described above this is a very small gain for me.
>
> Correct. That is the benefit of pulling the UKI apart in the
> kernel. However having to sign the kernel inside the UKI defeats
> the whole point.


Pingfan added the zboot load support in kexec-tools, I know that he is
trying to sign the zboot image and the inside kernel twice. So
probably there are some common areas which can be discussed.
Added Ard and Pingfan in cc.
http://lists.infradead.org/pipermail/kexec/2023-August/027674.html


Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 2/2] zboot: add loongarch kexec_load support

2023-09-14 Thread Dave Young
Copy arm64 code and change for loongarch so that the kexec -c can load
a zboot image.
Note: probe zboot image first otherwise the pei-loongarch file type will
be used.

Signed-off-by: Dave Young 
---
 kexec/arch/loongarch/Makefile  |  1 +
 kexec/arch/loongarch/image-header.h|  1 +
 kexec/arch/loongarch/kexec-loongarch.c |  1 +
 kexec/arch/loongarch/kexec-loongarch.h |  4 +
 kexec/arch/loongarch/kexec-pez-loongarch.c | 90 ++
 5 files changed, 97 insertions(+)
 create mode 100644 kexec/arch/loongarch/kexec-pez-loongarch.c

diff --git a/kexec/arch/loongarch/Makefile b/kexec/arch/loongarch/Makefile
index 3b33b9693287..cee7e569a2a2 100644
--- a/kexec/arch/loongarch/Makefile
+++ b/kexec/arch/loongarch/Makefile
@@ -6,6 +6,7 @@ loongarch_KEXEC_SRCS += 
kexec/arch/loongarch/kexec-elf-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pei-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-elf-rel-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/crashdump-loongarch.c
+loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pez-loongarch.c
 
 loongarch_MEM_REGIONS = kexec/mem_regions.c
 
diff --git a/kexec/arch/loongarch/image-header.h 
b/kexec/arch/loongarch/image-header.h
index 3b7576552685..223d81f77d9f 100644
--- a/kexec/arch/loongarch/image-header.h
+++ b/kexec/arch/loongarch/image-header.h
@@ -33,6 +33,7 @@ struct loongarch_image_header {
 };
 
 static const uint8_t loongarch_image_pe_sig[2] = {'M', 'Z'};
+static const uint8_t loongarch_pe_machtype[6] = {'P','E', 0x0, 0x0, 0x64, 
0x62};
 
 /**
  * loongarch_header_check_pe_sig - Helper to check the loongarch image header.
diff --git a/kexec/arch/loongarch/kexec-loongarch.c 
b/kexec/arch/loongarch/kexec-loongarch.c
index f47c99861674..62ff8fd1aeb7 100644
--- a/kexec/arch/loongarch/kexec-loongarch.c
+++ b/kexec/arch/loongarch/kexec-loongarch.c
@@ -165,6 +165,7 @@ int get_memory_ranges(struct memory_range **range, int 
*ranges,
 
 struct file_type file_type[] = {
{"elf-loongarch", elf_loongarch_probe, elf_loongarch_load, 
elf_loongarch_usage},
+   {"pez-loongarch", pez_loongarch_probe, pez_loongarch_load, 
pez_loongarch_usage},
{"pei-loongarch", pei_loongarch_probe, pei_loongarch_load, 
pei_loongarch_usage},
 };
 int file_types = sizeof(file_type) / sizeof(file_type[0]);
diff --git a/kexec/arch/loongarch/kexec-loongarch.h 
b/kexec/arch/loongarch/kexec-loongarch.h
index 5120a26fd513..2c7624f2fd3a 100644
--- a/kexec/arch/loongarch/kexec-loongarch.h
+++ b/kexec/arch/loongarch/kexec-loongarch.h
@@ -27,6 +27,10 @@ int pei_loongarch_probe(const char *buf, off_t len);
 int pei_loongarch_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info);
 void pei_loongarch_usage(void);
+int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size);
+int pez_loongarch_load(int argc, char **argv, const char *buf, off_t len,
+  struct kexec_info *info);
+void pez_loongarch_usage(void);
 
 int loongarch_process_image_header(const struct loongarch_image_header *h);
 
diff --git a/kexec/arch/loongarch/kexec-pez-loongarch.c 
b/kexec/arch/loongarch/kexec-pez-loongarch.c
new file mode 100644
index ..942a47c0eade
--- /dev/null
+++ b/kexec/arch/loongarch/kexec-pez-loongarch.c
@@ -0,0 +1,90 @@
+/*
+ * LoongArch PE compressed Image (vmlinuz, ZBOOT) support.
+ * Based on arm64 code
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include "kexec.h"
+#include "kexec-loongarch.h"
+#include 
+#include "arch/options.h"
+
+static int kernel_fd = -1;
+static off_t decompressed_size;
+
+/* Returns:
+ * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
format.
+ */
+int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size)
+{
+   int ret = -1;
+   const struct loongarch_image_header *h;
+   char *buf;
+   off_t buf_sz;
+
+   buf = (char *)kernel_buf;
+   buf_sz = kernel_size;
+   if (!buf)
+   return -1;
+   h = (const struct loongarch_image_header *)buf;
+
+   dbgprintf("%s: PROBE.\n", __func__);
+   if (buf_sz < sizeof(struct loongarch_image_header)) {
+   dbgprintf("%s: Not large enough to be a PE image.\n", __func__);
+   return -1;
+   }
+   if (!loongarch_header_check_pe_sig(h)) {
+   dbgprintf("%s: Not an PE image.\n", __func__);
+   return -1;
+   }
+
+   if (buf_sz < sizeof(struct loongarch_image_header) + h->pe_header) {
+   dbgprintf("%s: PE image offset larger than image.\n", __func__);
+   return -1;
+   }
+
+   if (memcmp([h->pe_header],
+  loongarch_pe_machtype, sizeof(loongarch_pe_machtype))) {
+   dbgprintf("%s: PE header doesn't match machine type.\n", 
__func__);
+

[PATCH v1 0/2] zboot: enable kexec_load for zboot kernel images

2023-09-14 Thread Dave Young
The current kexec-tools only support kexec_file_load for zboot kernel
images on arm64.

This series tweak a bit of the code so kexec_load can also load zboot
images on both arm64 and loongarch.

V1 changes:
- dup the kernel_fd so that kexec_file_load can work since slurp_fd
will close it.
- code clean up.

Dave Young (2):
  zboot: enable arm64 kexec_load for zboot image
  zboot: add loongarch kexec_load support

 include/kexec-pe-zboot.h   |  3 +-
 kexec/arch/arm64/kexec-vmlinuz-arm64.c | 26 ++-
 kexec/arch/loongarch/Makefile  |  1 +
 kexec/arch/loongarch/image-header.h|  1 +
 kexec/arch/loongarch/kexec-loongarch.c |  1 +
 kexec/arch/loongarch/kexec-loongarch.h |  4 +
 kexec/arch/loongarch/kexec-pez-loongarch.c | 90 ++
 kexec/kexec-pe-zboot.c |  4 +-
 kexec/kexec.c  |  2 +-
 kexec/kexec.h  |  1 +
 10 files changed, 127 insertions(+), 6 deletions(-)
 create mode 100644 kexec/arch/loongarch/kexec-pez-loongarch.c

-- 
2.37.2


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 1/2] zboot: enable arm64 kexec_load for zboot image

2023-09-14 Thread Dave Young
kexec_file_load support of zboot kernel image decompressed the vmlinuz,
so in kexec_load code just load the kernel with reading the decompressed
kernel fd into a new buffer and use it directly.

Signed-off-by: Dave Young 
---
 include/kexec-pe-zboot.h   |  3 ++-
 kexec/arch/arm64/kexec-vmlinuz-arm64.c | 26 +++---
 kexec/kexec-pe-zboot.c |  4 +++-
 kexec/kexec.c  |  2 +-
 kexec/kexec.h  |  1 +
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/kexec-pe-zboot.h b/include/kexec-pe-zboot.h
index e2e0448a81f2..374916cbe883 100644
--- a/include/kexec-pe-zboot.h
+++ b/include/kexec-pe-zboot.h
@@ -11,5 +11,6 @@ struct linux_pe_zboot_header {
uint32_t compress_type;
 };
 
-int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd);
+int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
+   off_t *kernel_size);
 #endif
diff --git a/kexec/arch/arm64/kexec-vmlinuz-arm64.c 
b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
index c0ee47c8f50a..e291a34c97ad 100644
--- a/kexec/arch/arm64/kexec-vmlinuz-arm64.c
+++ b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
@@ -34,6 +34,7 @@
 #include "arch/options.h"
 
 static int kernel_fd = -1;
+static off_t decompressed_size;
 
 /* Returns:
  * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
format.
@@ -72,7 +73,7 @@ int pez_arm64_probe(const char *kernel_buf, off_t kernel_size)
return -1;
}
 
-   ret = pez_prepare(buf, buf_sz, _fd);
+   ret = pez_prepare(buf, buf_sz, _fd, _size);
 
if (!ret) {
/* validate the arm64 specific header */
@@ -98,8 +99,27 @@ bad_header:
 int pez_arm64_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info)
 {
-   info->kernel_fd = kernel_fd;
-   return image_arm64_load(argc, argv, buf, len, info);
+   if (kernel_fd > 0 && decompressed_size > 0) {
+   char *kbuf;
+   off_t nread;
+   int fd;
+
+   info->kernel_fd = kernel_fd;
+   fd = dup(kernel_fd);
+   if (fd < 0) {
+   dbgprintf("%s: dup fd failed.\n", __func__);
+   return -1;
+   }
+   kbuf = slurp_fd(fd, NULL, decompressed_size, );
+   if (!kbuf || nread != decompressed_size) {
+   dbgprintf("%s: slurp_fd failed.\n", __func__);
+   return -1;
+   }
+   return image_arm64_load(argc, argv, kbuf, decompressed_size, 
info);
+   }
+
+   dbgprintf("%s: wrong kernel file descriptor.\n", __func__);
+   return -1;
 }
 
 void pez_arm64_usage(void)
diff --git a/kexec/kexec-pe-zboot.c b/kexec/kexec-pe-zboot.c
index 2f2e052b76c5..3abd17d9fe59 100644
--- a/kexec/kexec-pe-zboot.c
+++ b/kexec/kexec-pe-zboot.c
@@ -37,7 +37,8 @@
  *
  * crude_buf: the content, which is read from the kernel file without any 
processing
  */
-int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd)
+int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
+   off_t *kernel_size)
 {
int ret = -1;
int fd = 0;
@@ -110,6 +111,7 @@ int pez_prepare(const char *crude_buf, off_t buf_sz, int 
*kernel_fd)
goto fail_bad_header;
}
 
+   *kernel_size = decompressed_size;
dbgprintf("%s: done\n", __func__);
 
ret = 0;
diff --git a/kexec/kexec.c b/kexec/kexec.c
index c3b182e254e0..1edbd349c86d 100644
--- a/kexec/kexec.c
+++ b/kexec/kexec.c
@@ -489,7 +489,7 @@ static int add_backup_segments(struct kexec_info *info,
return 0;
 }
 
-static char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread)
+char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread)
 {
char *buf;
off_t progress;
diff --git a/kexec/kexec.h b/kexec/kexec.h
index ed3b499a80f2..093338969c57 100644
--- a/kexec/kexec.h
+++ b/kexec/kexec.h
@@ -267,6 +267,7 @@ extern void die(const char *fmt, ...)
__attribute__ ((format (printf, 1, 2)));
 extern void *xmalloc(size_t size);
 extern void *xrealloc(void *ptr, size_t size);
+extern char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread);
 extern char *slurp_file(const char *filename, off_t *r_size);
 extern char *slurp_file_mmap(const char *filename, off_t *r_size);
 extern char *slurp_file_len(const char *filename, off_t size, off_t *nread);
-- 
2.37.2


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] zboot: enable arm64 kexec_load for zboot image

2023-09-13 Thread Dave Young
Hi,

On Wed, 13 Sept 2023 at 08:54, Pingfan Liu  wrote:
>
> On Mon, Sep 11, 2023 at 6:37 PM Dave Young  wrote:
> >
> > kexec_file_load support of zboot kernel image decompressed the vmlinuz,
> > so in kexec_load code just load the kernel with reading the decompressed
> > kernel fd into a new buffer and use it directly.
> >
> > Signed-off-by: Dave Young 
> > ---
> >  include/kexec-pe-zboot.h   |  3 ++-
> >  kexec/arch/arm64/kexec-vmlinuz-arm64.c | 20 ++--
> >  kexec/kexec-pe-zboot.c |  4 +++-
> >  kexec/kexec.c  |  2 +-
> >  kexec/kexec.h  |  1 +
> >  5 files changed, 25 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/kexec-pe-zboot.h b/include/kexec-pe-zboot.h
> > index e2e0448a81f2..374916cbe883 100644
> > --- a/include/kexec-pe-zboot.h
> > +++ b/include/kexec-pe-zboot.h
> > @@ -11,5 +11,6 @@ struct linux_pe_zboot_header {
> > uint32_t compress_type;
> >  };
> >
> > -int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd);
> > +int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
> > +   off_t *kernel_size);
> >  #endif
> > diff --git a/kexec/arch/arm64/kexec-vmlinuz-arm64.c 
> > b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
> > index c0ee47c8f50a..8f378d8fa6d0 100644
> > --- a/kexec/arch/arm64/kexec-vmlinuz-arm64.c
> > +++ b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
> > @@ -34,6 +34,7 @@
> >  #include "arch/options.h"
> >
> >  static int kernel_fd = -1;
> > +static off_t decompressed_size;
> >
> >  /* Returns:
> >   * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
> > format.
> > @@ -72,7 +73,7 @@ int pez_arm64_probe(const char *kernel_buf, off_t 
> > kernel_size)
> > return -1;
> > }
> >
> > -   ret = pez_prepare(buf, buf_sz, _fd);
> > +   ret = pez_prepare(buf, buf_sz, _fd, _size);
> >
> > if (!ret) {
> > /* validate the arm64 specific header */
> > @@ -98,8 +99,23 @@ bad_header:
> >  int pez_arm64_load(int argc, char **argv, const char *buf, off_t len,
> > struct kexec_info *info)
> >  {
> > +   char *kbuf;
> > +
> > info->kernel_fd = kernel_fd;
> > -   return image_arm64_load(argc, argv, buf, len, info);
> > +   if (kernel_fd > 0 && decompressed_size > 0) {
> > +   off_t nread;
> > +
> > +   kbuf = slurp_fd(kernel_fd, NULL, decompressed_size, );
> > +   if (!kbuf || nread != decompressed_size) {

Today in another test I found that this breaks the kexec_file_load
because the slurp_fd() closed the kernel_fd after readding out the
buffer,  I will send another version soon, also cleanup a bit about
this function.

Thanks for reviewing.

> > +   dbgprintf("%s: failed.\n", __func__);
> > +   return -1;
> > +   }
> > +   } else {
> > +   dbgprintf("%s: wrong file descriptor.\n", __func__);
> > +   return -1;
> > +   }
> > +
> > +   return image_arm64_load(argc, argv, kbuf, decompressed_size, info);
> >  }
> >
> >  void pez_arm64_usage(void)
> > diff --git a/kexec/kexec-pe-zboot.c b/kexec/kexec-pe-zboot.c
> > index 2f2e052b76c5..3abd17d9fe59 100644
> > --- a/kexec/kexec-pe-zboot.c
> > +++ b/kexec/kexec-pe-zboot.c
> > @@ -37,7 +37,8 @@
> >   *
> >   * crude_buf: the content, which is read from the kernel file without any 
> > processing
> >   */
> > -int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd)
> > +int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
> > +   off_t *kernel_size)
> >  {
> > int ret = -1;
> > int fd = 0;
> > @@ -110,6 +111,7 @@ int pez_prepare(const char *crude_buf, off_t buf_sz, 
> > int *kernel_fd)
> > goto fail_bad_header;
> > }
> >
> > +   *kernel_size = decompressed_size;
> > dbgprintf("%s: done\n", __func__);
> >
> > ret = 0;
> > diff --git a/kexec/kexec.c b/kexec/kexec.c
> > index c3b182e254e0..1edbd349c86d 100644
> > --- a/kexec/kexec.c
> > +++ b/kexec/kexec.c
> > @@ -489,7 +489,7 @@ static int add_backup_segments(struct kexec_info *info,
> > return 0;
> >  }
> >
> > -static char *slurp_

[PATCH 1/2] zboot: enable arm64 kexec_load for zboot image

2023-09-11 Thread Dave Young
kexec_file_load support of zboot kernel image decompressed the vmlinuz,
so in kexec_load code just load the kernel with reading the decompressed
kernel fd into a new buffer and use it directly.

Signed-off-by: Dave Young 
---
 include/kexec-pe-zboot.h   |  3 ++-
 kexec/arch/arm64/kexec-vmlinuz-arm64.c | 20 ++--
 kexec/kexec-pe-zboot.c |  4 +++-
 kexec/kexec.c  |  2 +-
 kexec/kexec.h  |  1 +
 5 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/kexec-pe-zboot.h b/include/kexec-pe-zboot.h
index e2e0448a81f2..374916cbe883 100644
--- a/include/kexec-pe-zboot.h
+++ b/include/kexec-pe-zboot.h
@@ -11,5 +11,6 @@ struct linux_pe_zboot_header {
uint32_t compress_type;
 };
 
-int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd);
+int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
+   off_t *kernel_size);
 #endif
diff --git a/kexec/arch/arm64/kexec-vmlinuz-arm64.c 
b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
index c0ee47c8f50a..8f378d8fa6d0 100644
--- a/kexec/arch/arm64/kexec-vmlinuz-arm64.c
+++ b/kexec/arch/arm64/kexec-vmlinuz-arm64.c
@@ -34,6 +34,7 @@
 #include "arch/options.h"
 
 static int kernel_fd = -1;
+static off_t decompressed_size;
 
 /* Returns:
  * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
format.
@@ -72,7 +73,7 @@ int pez_arm64_probe(const char *kernel_buf, off_t kernel_size)
return -1;
}
 
-   ret = pez_prepare(buf, buf_sz, _fd);
+   ret = pez_prepare(buf, buf_sz, _fd, _size);
 
if (!ret) {
/* validate the arm64 specific header */
@@ -98,8 +99,23 @@ bad_header:
 int pez_arm64_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info)
 {
+   char *kbuf;
+
info->kernel_fd = kernel_fd;
-   return image_arm64_load(argc, argv, buf, len, info);
+   if (kernel_fd > 0 && decompressed_size > 0) {
+   off_t nread;
+
+   kbuf = slurp_fd(kernel_fd, NULL, decompressed_size, );
+   if (!kbuf || nread != decompressed_size) {
+   dbgprintf("%s: failed.\n", __func__);
+   return -1;
+   }
+   } else {
+   dbgprintf("%s: wrong file descriptor.\n", __func__);
+   return -1;
+   }
+
+   return image_arm64_load(argc, argv, kbuf, decompressed_size, info);
 }
 
 void pez_arm64_usage(void)
diff --git a/kexec/kexec-pe-zboot.c b/kexec/kexec-pe-zboot.c
index 2f2e052b76c5..3abd17d9fe59 100644
--- a/kexec/kexec-pe-zboot.c
+++ b/kexec/kexec-pe-zboot.c
@@ -37,7 +37,8 @@
  *
  * crude_buf: the content, which is read from the kernel file without any 
processing
  */
-int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd)
+int pez_prepare(const char *crude_buf, off_t buf_sz, int *kernel_fd,
+   off_t *kernel_size)
 {
int ret = -1;
int fd = 0;
@@ -110,6 +111,7 @@ int pez_prepare(const char *crude_buf, off_t buf_sz, int 
*kernel_fd)
goto fail_bad_header;
}
 
+   *kernel_size = decompressed_size;
dbgprintf("%s: done\n", __func__);
 
ret = 0;
diff --git a/kexec/kexec.c b/kexec/kexec.c
index c3b182e254e0..1edbd349c86d 100644
--- a/kexec/kexec.c
+++ b/kexec/kexec.c
@@ -489,7 +489,7 @@ static int add_backup_segments(struct kexec_info *info,
return 0;
 }
 
-static char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread)
+char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread)
 {
char *buf;
off_t progress;
diff --git a/kexec/kexec.h b/kexec/kexec.h
index ed3b499a80f2..093338969c57 100644
--- a/kexec/kexec.h
+++ b/kexec/kexec.h
@@ -267,6 +267,7 @@ extern void die(const char *fmt, ...)
__attribute__ ((format (printf, 1, 2)));
 extern void *xmalloc(size_t size);
 extern void *xrealloc(void *ptr, size_t size);
+extern char *slurp_fd(int fd, const char *filename, off_t size, off_t *nread);
 extern char *slurp_file(const char *filename, off_t *r_size);
 extern char *slurp_file_mmap(const char *filename, off_t *r_size);
 extern char *slurp_file_len(const char *filename, off_t size, off_t *nread);
-- 
2.37.2


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/2] zboot: add loongarch kexec_load support

2023-09-11 Thread Dave Young
From: "dyo...@redhat.com" 

Copy arm64 code and change for loongarch so that the kexec -c can load
a zboot image.
Note: probe zboot image first otherwise the pei-loongarch file type will
be used.

Signed-off-by: Dave Young 
---
 kexec/arch/loongarch/Makefile  |  1 +
 kexec/arch/loongarch/image-header.h|  1 +
 kexec/arch/loongarch/kexec-loongarch.c |  1 +
 kexec/arch/loongarch/kexec-loongarch.h |  4 +
 kexec/arch/loongarch/kexec-pez-loongarch.c | 88 ++
 5 files changed, 95 insertions(+)
 create mode 100644 kexec/arch/loongarch/kexec-pez-loongarch.c

diff --git a/kexec/arch/loongarch/Makefile b/kexec/arch/loongarch/Makefile
index 3b33b9693287..cee7e569a2a2 100644
--- a/kexec/arch/loongarch/Makefile
+++ b/kexec/arch/loongarch/Makefile
@@ -6,6 +6,7 @@ loongarch_KEXEC_SRCS += 
kexec/arch/loongarch/kexec-elf-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pei-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-elf-rel-loongarch.c
 loongarch_KEXEC_SRCS += kexec/arch/loongarch/crashdump-loongarch.c
+loongarch_KEXEC_SRCS += kexec/arch/loongarch/kexec-pez-loongarch.c
 
 loongarch_MEM_REGIONS = kexec/mem_regions.c
 
diff --git a/kexec/arch/loongarch/image-header.h 
b/kexec/arch/loongarch/image-header.h
index 3b7576552685..223d81f77d9f 100644
--- a/kexec/arch/loongarch/image-header.h
+++ b/kexec/arch/loongarch/image-header.h
@@ -33,6 +33,7 @@ struct loongarch_image_header {
 };
 
 static const uint8_t loongarch_image_pe_sig[2] = {'M', 'Z'};
+static const uint8_t loongarch_pe_machtype[6] = {'P','E', 0x0, 0x0, 0x64, 
0x62};
 
 /**
  * loongarch_header_check_pe_sig - Helper to check the loongarch image header.
diff --git a/kexec/arch/loongarch/kexec-loongarch.c 
b/kexec/arch/loongarch/kexec-loongarch.c
index f47c99861674..62ff8fd1aeb7 100644
--- a/kexec/arch/loongarch/kexec-loongarch.c
+++ b/kexec/arch/loongarch/kexec-loongarch.c
@@ -165,6 +165,7 @@ int get_memory_ranges(struct memory_range **range, int 
*ranges,
 
 struct file_type file_type[] = {
{"elf-loongarch", elf_loongarch_probe, elf_loongarch_load, 
elf_loongarch_usage},
+   {"pez-loongarch", pez_loongarch_probe, pez_loongarch_load, 
pez_loongarch_usage},
{"pei-loongarch", pei_loongarch_probe, pei_loongarch_load, 
pei_loongarch_usage},
 };
 int file_types = sizeof(file_type) / sizeof(file_type[0]);
diff --git a/kexec/arch/loongarch/kexec-loongarch.h 
b/kexec/arch/loongarch/kexec-loongarch.h
index 5120a26fd513..2c7624f2fd3a 100644
--- a/kexec/arch/loongarch/kexec-loongarch.h
+++ b/kexec/arch/loongarch/kexec-loongarch.h
@@ -27,6 +27,10 @@ int pei_loongarch_probe(const char *buf, off_t len);
 int pei_loongarch_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info);
 void pei_loongarch_usage(void);
+int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size);
+int pez_loongarch_load(int argc, char **argv, const char *buf, off_t len,
+  struct kexec_info *info);
+void pez_loongarch_usage(void);
 
 int loongarch_process_image_header(const struct loongarch_image_header *h);
 
diff --git a/kexec/arch/loongarch/kexec-pez-loongarch.c 
b/kexec/arch/loongarch/kexec-pez-loongarch.c
new file mode 100644
index ..6d94a405d54a
--- /dev/null
+++ b/kexec/arch/loongarch/kexec-pez-loongarch.c
@@ -0,0 +1,88 @@
+/*
+ * LoongArch PE compressed Image (vmlinuz, ZBOOT) support.
+ * Based on arm64 code
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include "kexec.h"
+#include "kexec-loongarch.h"
+#include 
+#include "arch/options.h"
+
+static int kernel_fd = -1;
+static off_t decompressed_size;
+
+/* Returns:
+ * -1 : in case of error/invalid format (not a valid PE+compressed ZBOOT 
format.
+ */
+int pez_loongarch_probe(const char *kernel_buf, off_t kernel_size)
+{
+   int ret = -1;
+   const struct loongarch_image_header *h;
+   char *buf;
+   off_t buf_sz;
+
+   buf = (char *)kernel_buf;
+   buf_sz = kernel_size;
+   if (!buf)
+   return -1;
+   h = (const struct loongarch_image_header *)buf;
+
+   dbgprintf("%s: PROBE.\n", __func__);
+   if (buf_sz < sizeof(struct loongarch_image_header)) {
+   dbgprintf("%s: Not large enough to be a PE image.\n", __func__);
+   return -1;
+   }
+   if (!loongarch_header_check_pe_sig(h)) {
+   dbgprintf("%s: Not an PE image.\n", __func__);
+   return -1;
+   }
+
+   if (buf_sz < sizeof(struct loongarch_image_header) + h->pe_header) {
+   dbgprintf("%s: PE image offset larger than image.\n", __func__);
+   return -1;
+   }
+
+   if (memcmp([h->pe_header],
+  loongarch_pe_machtype, sizeof(loongarch_pe_machtype))) {
+   dbgprintf("%s: PE header doesn't match machine type.\n&q

Re: kexec reboot failed due to commit 75d090fd167ac

2023-09-11 Thread Dave Young
Add kexec list in cc

On Sat, 9 Sept 2023 at 19:34, Kirill A. Shutemov
 wrote:
>
> On Fri, Sep 08, 2023 at 06:17:53PM +0200, Ard Biesheuvel wrote:
> > On Fri, Sep 8, 2023 at 5:58 PM Kees Cook  wrote:
> > >
> > > On Fri, Sep 08, 2023 at 03:32:33PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Sep 08, 2023 at 02:02:30PM +0800, Aaron Lu wrote:
> > > > > On Thu, Sep 07, 2023 at 04:14:09PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 29, 2023 at 10:04:51PM +0800, Aaron Lu wrote:
> > > > > > > > Could you show dmesg of the first kernel before kexec?
> > > > > > >
> > > > > > > Attached.
> > > > > > >
> > > > > > > BTW, kexec is invoked like this:
> > > > > > > kver=6.4.0-rc5-9-g75d090fd167a
> > > > > > > kdir=$HOME/kernels/$kver
> > > > > > > sudo kexec -l $kdir/vmlinuz-$kver 
> > > > > > > --initrd=$kdir/initramfs-$kver.img 
> > > > > > > --append="root=UUID=4381321e-e01e-455a-9d46-5e8c4c5b2d02 ro 
> > > > > > > net.ifnames=0 acpi_rsdp=0x728e8014 no_hash_pointers sched_verbose 
> > > > > > > selinux=0"
> > > > > >
> > > > > > I don't understand why it happens.
> > > > > >
> > > > > > Could you check if this patch changes anything:
> > > > > >
> > > > > > diff --git a/arch/x86/boot/compressed/misc.c 
> > > > > > b/arch/x86/boot/compressed/misc.c
> > > > > > index 94b7abcf624b..172c476ff6f3 100644
> > > > > > --- a/arch/x86/boot/compressed/misc.c
> > > > > > +++ b/arch/x86/boot/compressed/misc.c
> > > > > > @@ -456,10 +456,12 @@ asmlinkage __visible void 
> > > > > > *extract_kernel(void *rmode, memptr heap,
> > > > > >
> > > > > >   debug_putstr("\nDecompressing Linux... ");
> > > > > >
> > > > > > +#if 0
> > > > > >   if (init_unaccepted_memory()) {
> > > > > >   debug_putstr("Accepting memory... ");
> > > > > >   accept_memory(__pa(output), __pa(output) + needed_size);
> > > > > >   }
> > > > > > +#endif
> > > > > >
> > > > > >   __decompress(input_data, input_len, NULL, NULL, output, 
> > > > > > output_len,
> > > > > >   NULL, error);
> > > > > > --
> > > > >
> > > > > It solved the problem.
> > > >
> > > > Looks like increasing BOOT_INIT_PGT_SIZE fixes the issue. I don't yet
> > > > understand why and how unaccepted memory is involved. I will look more
> > > > into it.
> > > >
> > > > Enabling CONFIG_RANDOMIZE_BASE also makes the issue go away.
> > >
> > > Is this perhaps just luck? I.e. does is break ever on, say, 1000 boot
> > > attempts? (i.e. maybe some position is bad and KASLR happens to usually
> > > avoid it?)
>
> Yes, it can be luck.
>
> > > > Kees, maybe you have a clue?
> > >
> > > The only thing I can think of is that something isn't being counted
> > > correctly due to the size of code, and it just happens that this commit
> > > makes the code large enough to exceed some set of mappings?
> > >
> > > >
> > > > diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
> > > > index 9191280d9ea3..26ccce41d781 100644
> > > > --- a/arch/x86/include/asm/boot.h
> > > > +++ b/arch/x86/include/asm/boot.h
> > > > @@ -40,7 +40,7 @@
> > > >  #ifdef CONFIG_X86_64
> > > >  # define BOOT_STACK_SIZE 0x4000
> > > >
> > > > -# define BOOT_INIT_PGT_SIZE  (6*4096)
> > > > +# define BOOT_INIT_PGT_SIZE  (7*4096)
> > >
> > > That's why this might be working, for example? How large is the boot
> > > image before/after the commit, etc?
> > >
> >
> > Not sure why these changes would make a difference here, but choking
> > on accept_memory() on a non-TDX suggests that init_unaccepted_memory()
> > is poking into unmapped memory before it even decides that the
> > unaccepted memory does not exist.
> >
> > init_unaccepted_memory() has
> >
> > ret = efi_get_conf_table(boot_params, _table_pa, 
> > _table_len);
> > if (ret) {
> > warn("EFI config table not found.");
> > return false;
> > }
> >
> > which looks for  tuples in an array pointed to by the
> > EFI system table, and if either of those is not mapped, things can be
> > expected to explode.
> >
> > The only odd thing there is that this code is invoked after setting up
> > the 'demand paging' logic in the decompressor.
> >
> > If you haven't yet, could you please retry the kexec boot with
> > earlyprintk=tty?
>
> early console in extract_kernel
> input_data: 0x00807eb433a8
> input_len: 0x00d26271
> output: 0x00807b00
> output_len: 0x04800c10
> kernel_total_size: 0x03e28000
> needed_size: 0x04a0
> trampoline_32bit: 0x0009d000
>
> Decompressing Linux... out of pgt_buf in 
> arch/x86/boot/compressed/ident_map_64.c!?
> pages->pgt_buf_offset: 0x6000
> pages->pgt_buf_size: 0x6000
>
>
> Error: kernel_ident_mapping_init() failed
>
> It crashes on #PF due to stbl->nr_tables dereference in
> efi_get_conf_table() called from init_unaccepted_memory().
>
> I don't see anything special about stbl location: 0x775d6018.
>
> One other bit of information: disabling 5-level paging also 

Re: [PATCHv7 2/5] kexec: Introduce a member kernel_fd in kexec_info

2023-08-10 Thread Dave Young
Hi,

On Thu, 3 Aug 2023 at 10:42, Pingfan Liu  wrote:
>
> Utilize the image load interface to export the kernel fd, which points
> to the uncompressed kernel and will be passed to kexec_file_load.
>
> The credit goes to the Dave Young, who contributes the original code.
>
> Signed-off-by: Pingfan Liu 
> Co-authored-by: Dave Young 

Signed-off-by: Dave Young 

> To: kexec@lists.infradead.org
> Cc: ho...@verge.net.au
> Cc: a...@kernel.org
> Cc: jeremy.lin...@arm.com
> ---
>  kexec/kexec.c | 8 
>  kexec/kexec.h | 1 +
>  2 files changed, 9 insertions(+)
>
> diff --git a/kexec/kexec.c b/kexec/kexec.c
> index d132eb5..c3b182e 100644
> --- a/kexec/kexec.c
> +++ b/kexec/kexec.c
> @@ -1292,6 +1292,7 @@ static int do_kexec_file_load(int fileind, int argc, 
> char **argv,
> info.kexec_flags = flags;
>
> info.file_mode = 1;
> +   info.kernel_fd = -1;
> info.initrd_fd = -1;
>
> if (!is_kexec_file_load_implemented())
> @@ -1337,6 +1338,13 @@ static int do_kexec_file_load(int fileind, int argc, 
> char **argv,
> return ret;
> }
>
> +   /*
> +   * image type specific load functioin detect the capsule kernel type
> +   * and create another fd for file load. For example the zboot kernel.
> +   */
> +   if (info.kernel_fd != -1)
> +   kernel_fd = info.kernel_fd;
> +
> /*
>  * If there is no initramfs, set KEXEC_FILE_NO_INITRAMFS flag so that
>  * kernel does not return error with negative initrd_fd.
> diff --git a/kexec/kexec.h b/kexec/kexec.h
> index 0d820ad..ed3b499 100644
> --- a/kexec/kexec.h
> +++ b/kexec/kexec.h
> @@ -164,6 +164,7 @@ struct kexec_info {
> unsigned long file_mode :1;
>
> /* Filled by kernel image processing code */
> +   int kernel_fd;
> int initrd_fd;
> char *command_line;
> int command_line_len;
> --
> 2.31.1
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7 1/5] kexec/arm64: Simplify the code for zImage

2023-08-10 Thread Dave Young
On Thu, 10 Aug 2023 at 20:11, Simon Horman  wrote:
>
> On Mon, Aug 07, 2023 at 09:26:49PM +0800, Dave Young wrote:
> > On Mon, 7 Aug 2023 at 21:23, Simon Horman  wrote:
> > >
> > > On Thu, Aug 03, 2023 at 10:41:48AM +0800, Pingfan Liu wrote:
> > > > Inside zimage_probe(), it uncompresses the kernel and performs some
> > > > check, similar to image_probe(). Taking a close look, the uncompressing
> > > > has already executed before the image probe is called. What is missing
> > > > here is to provide a fd, pointing to an uncompressed kernel image.
> > > >
> > > > This patch creates a memfd based on the result produced by
> > > > slurp_decompress_file(), and finally simplify the logical of the probe
> > > > for aarch64.
> > > >
> > > > The credit goes to the Dave Young, who contributes the original code.
> > > >
> > > > Signed-off-by: Pingfan Liu 
> > > > Co-authored-by: Dave Young 
> > >
> > > Dave, could I get a Signed-off-by line from you?
> > > Likewise for patch 2/5.
> >
> > I Simon, sounds good to me, thanks!
>
> Thanks, could your respond to the two patches with the following?
>
> Signed-off by: <>

Sure, please feel free to add:

Signed-off-by: Dave Young 

>
> >
> > >
> > > I think simply replying to the relevant emails should be sufficient.
> > >
> > > ___
> > > kexec mailing list
> > > kexec@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/kexec
> > >
> >
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7 1/5] kexec/arm64: Simplify the code for zImage

2023-08-07 Thread Dave Young
On Mon, 7 Aug 2023 at 21:23, Simon Horman  wrote:
>
> On Thu, Aug 03, 2023 at 10:41:48AM +0800, Pingfan Liu wrote:
> > Inside zimage_probe(), it uncompresses the kernel and performs some
> > check, similar to image_probe(). Taking a close look, the uncompressing
> > has already executed before the image probe is called. What is missing
> > here is to provide a fd, pointing to an uncompressed kernel image.
> >
> > This patch creates a memfd based on the result produced by
> > slurp_decompress_file(), and finally simplify the logical of the probe
> > for aarch64.
> >
> > The credit goes to the Dave Young, who contributes the original code.
> >
> > Signed-off-by: Pingfan Liu 
> > Co-authored-by: Dave Young 
>
> Dave, could I get a Signed-off-by line from you?
> Likewise for patch 2/5.

I Simon, sounds good to me, thanks!

>
> I think simply replying to the relevant emails should be sufficient.
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5 0/8] arm64: zboot support

2023-07-20 Thread Dave Young
On Thu, 20 Jul 2023 at 16:59, Pingfan Liu  wrote:
>
> On Thu, Jul 20, 2023 at 3:27 PM Dave Young  wrote:
> >
> > Hi Pingfan,
> >
> > On Thu, 20 Jul 2023 at 10:05, Pingfan Liu  wrote:
> > >
> > > Hi Dave,
> > >
> > > Thanks for your insight. Please see the comments inline below.
> > >
> > > On Wed, Jul 19, 2023 at 11:00 AM Dave Young  wrote:
> > > >
> > > > Hi Pingfan, Simon,
> > > >
> > > > On 07/17/23 at 09:07pm, Pingfan Liu wrote:
> > > > > As more complicated capsule kernel format occurs like zboot, where the
> > > > > compressed kernel is stored as a payload. The straight forward
> > > > > decompression can not meet the demand.
> > > > >
> > > > > As the first step, on aarch64, reading in the kernel file in a probe
> > > > > method and decide how to unfold the content by the method itself.
> > > > >
> > > > > This series introduce a new image probe interface probe2(), which
> > > > > returns three factors: kernel buffer, kernel size and kernel fd 
> > > > > through
> > > > > a struct parsed_info.
> > > > > -1. the parsed kernel_buf should be returned so that it can be used by
> > > > > the image load method later.
> > > > > -2. the final fd passed to sys_kexec_file_load, since aarch64 kernel 
> > > > > can
> > > > > only work with Image format, the outer payload should be stripped and 
> > > > > a
> > > > > temporary file of Image should be created.
> > > >
> > > > I took a look at the Image.gz file load code, the current code can be
> > > > simplified with passing a fd directly instead of creating temp files via
> > > > memfd_create with the already decompressed kernel_buf.
> > > >
> > > > The current file load is like below:
> > > >
> > > > do_kexec_file_load():
> > > >   1.slurp_decompress_file
> > > > 2. probe
> > > >   3. load
> > > > 4. kexec_file_load
> > > >
> > > > In step 1, the Image.gz has been decompressed to kernel_buf, so just
> > > > create a virtual memfd copy to it, then save the virtual fd for step 4
> > > > use.
> > > >
> > > > Otherwise in step 2 it is some sanity checking, step 3 is setting
> > > > something else eg. initrd_fd, cmdline. With the changes below Image and
> > > > Image.gz will share same code. I think you can add the zboot
> > > > detection/checking code in the Image probe, load functions, with a new
> > > > info->kernel_fd, you can decompress the zboot kernel_buf and save to
> > > > another virtual memfd, and set to the info->kernel_fd.  Then in step 4
> > > > the kexec_file_load can just use it.
> > > >
> > >
> > > This only results in a minor change in the interface, not like mine. I
> > > prefer this method.
> > >
> > > > The kernel_buf itself is only used for sanity checking of the formats,
> > > > kernel only needs a file fd, so I think it should be fine and easier
> > > > than the original ways.
> > > >
> > >
> > > Overall, this new method minimally affects the function interface, in
> > > addition to the code simplification.
> > > I will try to take it in the next version.
> > >
> > Ok, great.  Btw, it would be ideal if some of the pez_prepare work can
> > be done early before the slurp_decompress_file in
> > do_kexec_file_load(), then it could be just arch independent :)
> >
>
> I assume that you suggest the following big picture:
> if trying slurp_decompress_file()
> else trying pez_prepare()
> for (i = 0; i < file_types; i++) {
>   if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
>}
>
> But there is some check before pez_prepare() in pez_arm64_probe(), I
> am not sure whether they can be skipped or not.

Ok, got it, if the zimage header of arm64 is arch specific , it should
still go to the arch part.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5 0/8] arm64: zboot support

2023-07-20 Thread Dave Young
On Thu, 20 Jul 2023 at 18:08, Dave Young  wrote:
>
> On Thu, 20 Jul 2023 at 16:59, Pingfan Liu  wrote:
> >
> > On Thu, Jul 20, 2023 at 3:27 PM Dave Young  wrote:
> > >
> > > Hi Pingfan,
> > >
> > > On Thu, 20 Jul 2023 at 10:05, Pingfan Liu  wrote:
> > > >
> > > > Hi Dave,
> > > >
> > > > Thanks for your insight. Please see the comments inline below.
> > > >
> > > > On Wed, Jul 19, 2023 at 11:00 AM Dave Young  wrote:
> > > > >
> > > > > Hi Pingfan, Simon,
> > > > >
> > > > > On 07/17/23 at 09:07pm, Pingfan Liu wrote:
> > > > > > As more complicated capsule kernel format occurs like zboot, where 
> > > > > > the
> > > > > > compressed kernel is stored as a payload. The straight forward
> > > > > > decompression can not meet the demand.
> > > > > >
> > > > > > As the first step, on aarch64, reading in the kernel file in a probe
> > > > > > method and decide how to unfold the content by the method itself.
> > > > > >
> > > > > > This series introduce a new image probe interface probe2(), which
> > > > > > returns three factors: kernel buffer, kernel size and kernel fd 
> > > > > > through
> > > > > > a struct parsed_info.
> > > > > > -1. the parsed kernel_buf should be returned so that it can be used 
> > > > > > by
> > > > > > the image load method later.
> > > > > > -2. the final fd passed to sys_kexec_file_load, since aarch64 
> > > > > > kernel can
> > > > > > only work with Image format, the outer payload should be stripped 
> > > > > > and a
> > > > > > temporary file of Image should be created.
> > > > >
> > > > > I took a look at the Image.gz file load code, the current code can be
> > > > > simplified with passing a fd directly instead of creating temp files 
> > > > > via
> > > > > memfd_create with the already decompressed kernel_buf.
> > > > >
> > > > > The current file load is like below:
> > > > >
> > > > > do_kexec_file_load():
> > > > >   1.slurp_decompress_file
> > > > > 2. probe
> > > > >   3. load
> > > > > 4. kexec_file_load
> > > > >
> > > > > In step 1, the Image.gz has been decompressed to kernel_buf, so just
> > > > > create a virtual memfd copy to it, then save the virtual fd for step 4
> > > > > use.
> > > > >
> > > > > Otherwise in step 2 it is some sanity checking, step 3 is setting
> > > > > something else eg. initrd_fd, cmdline. With the changes below Image 
> > > > > and
> > > > > Image.gz will share same code. I think you can add the zboot
> > > > > detection/checking code in the Image probe, load functions, with a new
> > > > > info->kernel_fd, you can decompress the zboot kernel_buf and save to
> > > > > another virtual memfd, and set to the info->kernel_fd.  Then in step 4
> > > > > the kexec_file_load can just use it.
> > > > >
> > > >
> > > > This only results in a minor change in the interface, not like mine. I
> > > > prefer this method.
> > > >
> > > > > The kernel_buf itself is only used for sanity checking of the formats,
> > > > > kernel only needs a file fd, so I think it should be fine and easier
> > > > > than the original ways.
> > > > >
> > > >
> > > > Overall, this new method minimally affects the function interface, in
> > > > addition to the code simplification.
> > > > I will try to take it in the next version.
> > > >
> > > Ok, great.  Btw, it would be ideal if some of the pez_prepare work can
> > > be done early before the slurp_decompress_file in
> > > do_kexec_file_load(), then it could be just arch independent :)
> > >
> >
> > I assume that you suggest the following big picture:
> > if trying slurp_decompress_file()
> > else trying pez_prepare()
> > for (i = 0; i < file_types; i++) {
> >   if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
> >}
> >
> > But there is some check before pez_prepare() in pez_arm64_probe(), I
> > am not sure whether they can be skipped or not.
>
> Ok, got it, if the zimage header of arm64 is arch specific , it should

s/zimage/zboot

> still go to the arch part.
>
> Thanks
> Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5 0/8] arm64: zboot support

2023-07-20 Thread Dave Young
Hi Pingfan,

On Thu, 20 Jul 2023 at 10:05, Pingfan Liu  wrote:
>
> Hi Dave,
>
> Thanks for your insight. Please see the comments inline below.
>
> On Wed, Jul 19, 2023 at 11:00 AM Dave Young  wrote:
> >
> > Hi Pingfan, Simon,
> >
> > On 07/17/23 at 09:07pm, Pingfan Liu wrote:
> > > As more complicated capsule kernel format occurs like zboot, where the
> > > compressed kernel is stored as a payload. The straight forward
> > > decompression can not meet the demand.
> > >
> > > As the first step, on aarch64, reading in the kernel file in a probe
> > > method and decide how to unfold the content by the method itself.
> > >
> > > This series introduce a new image probe interface probe2(), which
> > > returns three factors: kernel buffer, kernel size and kernel fd through
> > > a struct parsed_info.
> > > -1. the parsed kernel_buf should be returned so that it can be used by
> > > the image load method later.
> > > -2. the final fd passed to sys_kexec_file_load, since aarch64 kernel can
> > > only work with Image format, the outer payload should be stripped and a
> > > temporary file of Image should be created.
> >
> > I took a look at the Image.gz file load code, the current code can be
> > simplified with passing a fd directly instead of creating temp files via
> > memfd_create with the already decompressed kernel_buf.
> >
> > The current file load is like below:
> >
> > do_kexec_file_load():
> >   1.slurp_decompress_file
> > 2. probe
> >   3. load
> > 4. kexec_file_load
> >
> > In step 1, the Image.gz has been decompressed to kernel_buf, so just
> > create a virtual memfd copy to it, then save the virtual fd for step 4
> > use.
> >
> > Otherwise in step 2 it is some sanity checking, step 3 is setting
> > something else eg. initrd_fd, cmdline. With the changes below Image and
> > Image.gz will share same code. I think you can add the zboot
> > detection/checking code in the Image probe, load functions, with a new
> > info->kernel_fd, you can decompress the zboot kernel_buf and save to
> > another virtual memfd, and set to the info->kernel_fd.  Then in step 4
> > the kexec_file_load can just use it.
> >
>
> This only results in a minor change in the interface, not like mine. I
> prefer this method.
>
> > The kernel_buf itself is only used for sanity checking of the formats,
> > kernel only needs a file fd, so I think it should be fine and easier
> > than the original ways.
> >
>
> Overall, this new method minimally affects the function interface, in
> addition to the code simplification.
> I will try to take it in the next version.
>
Ok, great.  Btw, it would be ideal if some of the pez_prepare work can
be done early before the slurp_decompress_file in
do_kexec_file_load(), then it could be just arch independent :)

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv5 0/8] arm64: zboot support

2023-07-19 Thread Dave Young
Hi Pingfan, Simon,

On 07/17/23 at 09:07pm, Pingfan Liu wrote:
> As more complicated capsule kernel format occurs like zboot, where the
> compressed kernel is stored as a payload. The straight forward
> decompression can not meet the demand.
> 
> As the first step, on aarch64, reading in the kernel file in a probe
> method and decide how to unfold the content by the method itself.
> 
> This series introduce a new image probe interface probe2(), which
> returns three factors: kernel buffer, kernel size and kernel fd through
> a struct parsed_info.
> -1. the parsed kernel_buf should be returned so that it can be used by
> the image load method later.
> -2. the final fd passed to sys_kexec_file_load, since aarch64 kernel can
> only work with Image format, the outer payload should be stripped and a
> temporary file of Image should be created.

I took a look at the Image.gz file load code, the current code can be
simplified with passing a fd directly instead of creating temp files via
memfd_create with the already decompressed kernel_buf. 

The current file load is like below:

do_kexec_file_load():
  1.slurp_decompress_file
2. probe
  3. load
4. kexec_file_load

In step 1, the Image.gz has been decompressed to kernel_buf, so just
create a virtual memfd copy to it, then save the virtual fd for step 4
use.

Otherwise in step 2 it is some sanity checking, step 3 is setting
something else eg. initrd_fd, cmdline. With the changes below Image and
Image.gz will share same code. I think you can add the zboot
detection/checking code in the Image probe, load functions, with a new
info->kernel_fd, you can decompress the zboot kernel_buf and save to
another virtual memfd, and set to the info->kernel_fd.  Then in step 4
the kexec_file_load can just use it.

The kernel_buf itself is only used for sanity checking of the formats,
kernel only needs a file fd, so I think it should be fine and easier
than the original ways.

Thoughts?

---
 kexec/arch/arm64/Makefile |3 
 kexec/arch/arm64/kexec-arm64.c|1 
 kexec/arch/arm64/kexec-arm64.h|6 
 kexec/arch/arm64/kexec-image-arm64.c  |2 
 kexec/arch/arm64/kexec-zImage-arm64.c |  226 --
 kexec/kexec.c |   55 +---
 kexec/kexec.h |1 
 7 files changed, 39 insertions(+), 255 deletions(-)

Index: kexec-tools/kexec/arch/arm64/Makefile
===
--- kexec-tools.orig/kexec/arch/arm64/Makefile
+++ kexec-tools/kexec/arch/arm64/Makefile
@@ -15,8 +15,7 @@ arm64_KEXEC_SRCS += \
kexec/arch/arm64/kexec-arm64.c \
kexec/arch/arm64/kexec-elf-arm64.c \
kexec/arch/arm64/kexec-uImage-arm64.c \
-   kexec/arch/arm64/kexec-image-arm64.c \
-   kexec/arch/arm64/kexec-zImage-arm64.c
+   kexec/arch/arm64/kexec-image-arm64.c
 
 arm64_UIMAGE = kexec/kexec-uImage.c
 
Index: kexec-tools/kexec/arch/arm64/kexec-arm64.c
===
--- kexec-tools.orig/kexec/arch/arm64/kexec-arm64.c
+++ kexec-tools/kexec/arch/arm64/kexec-arm64.c
@@ -74,7 +74,6 @@ struct file_type file_type[] = {
{"vmlinux", elf_arm64_probe, elf_arm64_load, elf_arm64_usage},
{"Image", image_arm64_probe, image_arm64_load, image_arm64_usage},
{"uImage", uImage_arm64_probe, uImage_arm64_load, uImage_arm64_usage},
-   {"zImage", zImage_arm64_probe, zImage_arm64_load, zImage_arm64_usage},
 };
 
 int file_types = sizeof(file_type) / sizeof(file_type[0]);
Index: kexec-tools/kexec/arch/arm64/kexec-arm64.h
===
--- kexec-tools.orig/kexec/arch/arm64/kexec-arm64.h
+++ kexec-tools/kexec/arch/arm64/kexec-arm64.h
@@ -44,12 +44,6 @@ int uImage_arm64_load(int argc, char **a
  struct kexec_info *info);
 void uImage_arm64_usage(void);
 
-int zImage_arm64_probe(const char *kernel_buf, off_t kernel_size);
-int zImage_arm64_load(int argc, char **argv, const char *kernel_buf,
-   off_t kernel_size, struct kexec_info *info);
-void zImage_arm64_usage(void);
-
-
 extern off_t initrd_base;
 extern off_t initrd_size;
 
Index: kexec-tools/kexec/arch/arm64/kexec-image-arm64.c
===
--- kexec-tools.orig/kexec/arch/arm64/kexec-image-arm64.c
+++ kexec-tools/kexec/arch/arm64/kexec-image-arm64.c
@@ -114,6 +114,6 @@ exit:
 void image_arm64_usage(void)
 {
printf(
-" An ARM64 binary image, uncompressed, big or little endian.\n"
+" An ARM64 binary image, compressed or not, big or little endian.\n"
 " Typically an Image file.\n\n");
 }
Index: kexec-tools/kexec/arch/arm64/kexec-zImage-arm64.c
===
--- kexec-tools.orig/kexec/arch/arm64/kexec-zImage-arm64.c
+++ /dev/null
@@ -1,226 +0,0 @@
-/*
- * ARM64 kexec zImage (Image.gz) support.
- *
- * 

Re: [RFC PATCH 0/4] kdump: add generic functions to simplify crashkernel crashkernel in architecture

2023-07-07 Thread Dave Young
On 06/19/23 at 01:59pm, Baoquan He wrote:
> In the current arm64, crashkernel=,high support has been finished after
> several rounds of posting and careful reviewing. The code in arm64 which
> parses crashkernel kernel parameters firstly, then reserve memory can be
> a good example for other ARCH to refer to.
> 
> Whereas in x86_64, the code mixing crashkernel parameter parsing and
> memory reserving is twisted, and looks messy. Refactoring the code to
> make it more readable maintainable is necessary.
> 
> Here, try to abstract the crashkernel parameter parsing code into a
> generic function parse_crashkernel_generic(), and the crashkernel memory
> reserving code into a generic function reserve_crashkernel_generic().
> Then, in ARCH which crashkernel=,high support is needed, a simple
> arch_reserve_crashkernel() can be added to call above two generic
> functions. This can remove the duplicated implmentation code in each
> ARCH, like arm64, x86_64.

Hi Baoquan, the parse_crashkernel_common and parse_crashkernel_generic
are confusion to me.  Thanks for the effort though.

I'm not sure if it will be easy or not, but ideally I think the parse
function can be arch independent, something like a general funtion
parse_crashkernel() which can return the whole necessary infomation of
crashkenrel for arch code to use, for example return like
below pseudo stucture(just a concept, may need to think more):

structure crashkernel_range {
size,
range,
struct list_head list;
}

structure crashkernel{
  structure crashkernel_range *range_list;
  union {
offset,
low_high
  }
}

So the arch code can just get the data of crashkernel and then check
about the details, if it does not support low and high reservation then
it can just ignore the option.

Thoughts?

> 
> I only change the arm64 and x86_64 implementation to make use of the
> generic functions to simplify code. Risc-v can be done very easily refer
> to the steps in arm64 and x86_64. I leave this to Jiahao or other risc-v
> developer since Jiahao have posted a patchset to add crashkernel=,high
> support to risc-v.
> 
> This patchset is based on the latest linus's tree, and on top of below
> patch:
> 
> arm64: kdump: simplify the reservation behaviour of crashkernel=,high
>   https://git.kernel.org/arm64/c/6c4dcaddbd36
> 
> 
> Baoquan He (4):
>   kdump: rename parse_crashkernel() to parse_crashkernel_common()
>   kdump: add generic functions to parse crashkernel and do reservation
>   arm64: kdump: use generic interfaces to simplify crashkernel
> reservation code
>   x86: kdump: use generic interfaces to simplify crashkernel reservation
> code
> 
>  arch/arm/kernel/setup.c  |   4 +-
>  arch/arm64/Kconfig   |   3 +
>  arch/arm64/include/asm/kexec.h   |   8 ++
>  arch/arm64/mm/init.c | 141 ++--
>  arch/ia64/kernel/setup.c |   4 +-
>  arch/loongarch/kernel/setup.c|   3 +-
>  arch/mips/cavium-octeon/setup.c  |   2 +-
>  arch/mips/kernel/setup.c |   4 +-
>  arch/powerpc/kernel/fadump.c |   5 +-
>  arch/powerpc/kexec/core.c|   4 +-
>  arch/powerpc/mm/nohash/kaslr_booke.c |   4 +-
>  arch/riscv/mm/init.c |   5 +-
>  arch/s390/kernel/setup.c |   4 +-
>  arch/sh/kernel/machine_kexec.c   |   5 +-
>  arch/x86/Kconfig |   3 +
>  arch/x86/include/asm/kexec.h |  32 ++
>  arch/x86/kernel/setup.c  | 141 +++-
>  include/linux/crash_core.h   |  33 +-
>  kernel/crash_core.c  | 158 +--
>  19 files changed, 274 insertions(+), 289 deletions(-)
> 
> -- 
> 2.34.1
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 00/10] ima: measure events between kexec load and execute

2023-07-07 Thread Dave Young
[Add Eric in cc]

On Tue, 4 Jul 2023 at 05:58, Tushar Sugandhi
 wrote:
>
> The current Kernel behavior is IMA measurements snapshot is taken at
> kexec 'load' and not at kexec 'execute'.  IMA log is then carried
> over to the new Kernel after kexec 'execute'.
>
> Some devices can be configured to call kexec 'load' first, and followed
> by kexec 'execute' after some time. (as opposed to calling 'load' and
> 'execute' in one single kexec command).  In such scenario, if new IMA
> measurements are added between kexec 'load' and kexec 'execute', the
> TPM PCRs are extended with the IMA events between 'load' and 'execute';
> but those IMA events are not carried over to the new kernel after kexec
> soft reboot.  This results in mismatch between TPM PCR quotes and the
> actual IMA measurements list after the device boots into the new kexec
> image.  This mismatch results in the remote attestation failing for that
> device.
>
> This patch series proposes a solution to solve this problem by allocating
> the necessary buffer at kexec 'load' time, and populating the buffer
> with the IMA measurements at kexec 'execute' time.
>
> The solution includes:
>  - addition of new functionality to allocate a buffer to hold IMA
>measurements at kexec 'load',
>
>  - ima functionality to suspend and resume measurements as needed during
>buffer copy at kexec 'execute',
>
>  - ima functionality for mapping the measurement list from the current
>Kernel to the subsequent one,
>
>  - necessary changes to the kexec_file_load syscall, enabling it to call
>the ima functions
>
>  - registering a reboot notifier which gets called during kexec 'execute',
>
>  - and removal of deprecated functions.
>
> The modifications proposed in this series ensure the integrity of the ima
> measurements is preserved across kexec soft reboots, thus significantly
> improving the security of the Kernel post kexec soft reboots.
>
> There were previous attempts to fix this issue [1], [2], [3].  But they
> were not merged into the mainline Kernel.
>
> We took inspiration from the past work [1] and [2] while working on this
> patch series.
>
> References:
> ---
>
> [1] [PATHC v2 5/9] ima: on soft reboot, save the measurement list
> https://lore.kernel.org/lkml/1472596811-9596-6-git-send-email-zo...@linux.vnet.ibm.com/
>
> [2] PATCH v2 4/6] kexec_file: Add mechanism to update kexec segments.
> https://lkml.org/lkml/2016/8/16/577
>
> [3] [PATCH 1/6] kexec_file: Add buffer hand-over support
> https://lore.kernel.org/linuxppc-dev/1466473476-10104-6-git-send-email-bauer...@linux.vnet.ibm.com/T/
>
> Tushar Sugandhi (10):
>   ima: implement function to allocate buffer at kexec load
>   ima: implement function to populate buffer at kexec execute
>   ima: allocate buffer at kexec load to hold ima measurements
>   ima: implement functions to suspend and resume measurements
>   kexec: implement functions to map and unmap segment to kimage
>   ima: update buffer at kexec execute with ima measurements
>   ima: remove function ima_dump_measurement_list
>   ima: implement and register a reboot notifier function to update kexec
> buffer
>   ima: suspend measurements while the kexec buffer is being copied
>   kexec: update kexec_file_load syscall to call ima_kexec_post_load
>
>  include/linux/ima.h|   3 +
>  include/linux/kexec.h  |  13 ++
>  kernel/kexec_core.c|  72 +-
>  kernel/kexec_file.c|   7 +
>  kernel/kexec_internal.h|   1 +
>  security/integrity/ima/ima.h   |   4 +
>  security/integrity/ima/ima_kexec.c | 211 +++--
>  security/integrity/ima/ima_queue.c |  32 +
>  8 files changed, 295 insertions(+), 48 deletions(-)
>
> --
> 2.25.1
>
>
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-07-06 Thread Dave Young
On 07/05/23 at 07:33pm, Borislav Petkov wrote:
> On Thu, Jun 01, 2023 at 03:20:44PM +0800, Tao Liu wrote:
> > A kexec kernel bootup hang is observed on Intel Atom cpu due to unmapped
> 
> s/cpu/CPU/g
> 
> > EFI config table.
> > 
> > Currently EFI system table is identity-mapped for the kexec kernel, but EFI
> > config table is not mapped explicitly:
> 
> Why does the EFI config table *need* to be mapped explicitly?
> 
> > commit 6bbeb276b71f ("x86/kexec: Add the EFI system tables and ACPI
> >   tables to the ident map")
> > 
> > Later in the following 2 commits, EFI config table will be accessed when
> > enabling sev at kernel startup.
> 
> What does SEV have to do with an Intel problem?

I'm also curious, let's cc the author of below mentioned commits.

> 
> > This may result in a page fault due to EFI
> > config table's unmapped address. Since the page fault occurs at an early
> > stage, it is unrecoverable and kernel hangs.
> > 
> > commit ec1c66af3a30 ("x86/compressed/64: Detect/setup SEV/SME features
> >   earlier during boot")
> > commit c01fce9cef84 ("x86/compressed: Add SEV-SNP feature
> >   detection/setup")
> > 
> > In addition, the issue doesn't appear on all systems, because the kexec
> > kernel uses Page Size Extension (PSE) for identity mapping. In most cases,
> > EFI config table can end up to be mapped into due to 1 GB page size.
> > However if nogbpages is set, or cpu doesn't support pdpe1gb feature
> > (e.g Intel Atom x6425RE cpu), EFI config table may not be mapped into
> > due to 2 MB page size, thus a page fault hang is more likely to happen.
> 
> This doesn't answer my question above.
> 
> > This patch will make sure the EFI config table is always mapped.
> 
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
> 
> Also, do
> 
> $ git grep 'This patch' Documentation/process
> 
> for more details.
> 
> 
> > 
> > Signed-off-by: Tao Liu 
> > ---
> > Changes in v2:
> > - Rephrase the change log based on Baoquan's suggestion.
> > - Rename map_efi_sys_cfg_tab() to map_efi_tables().
> > - Link to v1: 
> > https://lore.kernel.org/kexec/20230525094914.23420-1-l...@redhat.com/
> > ---
> >  arch/x86/kernel/machine_kexec_64.c | 35 ++
> >  1 file changed, 31 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/machine_kexec_64.c 
> > b/arch/x86/kernel/machine_kexec_64.c
> > index 1a3e2c05a8a5..664aefa6e896 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -28,6 +28,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #ifdef CONFIG_ACPI
> >  /*
> > @@ -86,10 +87,12 @@ const struct kexec_file_ops * const 
> > kexec_file_loaders[] = {
> >  #endif
> >  
> >  static int
> > -map_efi_systab(struct x86_mapping_info *info, pgd_t *level4p)
> > +map_efi_tables(struct x86_mapping_info *info, pgd_t *level4p)
> >  {
> >  #ifdef CONFIG_EFI
> > unsigned long mstart, mend;
> > +   void *kaddr;
> > +   int ret;
> >  
> > if (!efi_enabled(EFI_BOOT))
> > return 0;
> > @@ -105,6 +108,30 @@ map_efi_systab(struct x86_mapping_info *info, pgd_t 
> > *level4p)
> > if (!mstart)
> > return 0;
> >  
> > +   ret = kernel_ident_mapping_init(info, level4p, mstart, mend);
> > +   if (ret)
> > +   return ret;
> > +
> > +   kaddr = memremap(mstart, mend - mstart, MEMREMAP_WB);
> > +   if (!kaddr) {
> > +   pr_err("Could not map UEFI system table\n");
> > +   return -ENOMEM;
> > +   }
> > +
> > +   mstart = efi_config_table;
> 
> Yeah, about this, did you see efi_reuse_config() and the comment above
> it especially?
> 
> Or is it that the EFI in that box wants the config table mapped 1:1 and
> accesses it during boot/kexec?
> 
> In any case, this is all cloudy without a proper root cause.

efi_reuse_config is patching the SMBIOS table address in efi init path
durint kexec kernel bootup due to some nasty firmware behavior.

It seems the sev code is searching for table with EFI_CC_BLOB_GUID. In
theory it is safe as it will not access SMBIOS table here.  But for
safe purpose it would be better to test on AMD SEV guest, see if the
EFI_CC_BLOB table address is untouched after the 1st kernel booting.

> 
> Also, I'd like for Ard to have a look at this too.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/6] arm64: make kexec_file able to load zboot image

2023-03-21 Thread Dave Young
On Fri, 10 Mar 2023 at 12:18, Pingfan Liu  wrote:
>
> On Tue, Mar 07, 2023 at 04:08:55PM +0800, Pingfan Liu wrote:
> > Hi Ard,
> >
> > Thanks for sharing your idea. Please see the comment.
> >
> > On Mon, Mar 06, 2023 at 09:08:03AM +0100, Ard Biesheuvel wrote:
> > > (cc Mark)
> > >
> > > Hello Pingfan,
> > >
> > > Thanks for working on this.
> > >
> > > On Mon, 6 Mar 2023 at 04:03, Pingfan Liu  wrote:
> > > >
> > > > After introducing zboot image, kexec_file can not load and jump to the
> > > > new style image. Hence it demands a method to load the new kernel.
> > > >
> > > > The crux of the problem lies in when and how to decompress the Image.gz.
> > > > There are three possible courses to take: -1. in user space, but hard to
> > > > achieve due to the signature verification inside the kernel.
> > >
> > > That depends. The EFI zboot image encapsulates another PE/COFF image,
> > > which could be signed as well.
> > >
> > > So there are at least three other options here:
> > > - sign the encapsulated image with the same key as the zboot image
> > > - sign the encapsulated image with a key that is only valid for kexec boot
> > > - sign the encapsulated image with an ephemeral key that is only valid
> > > for a kexec'ing an image that was produced by the same kernel build
> > >
> > > >  -2. at the
> > > > boot time, let the efi_zboot_entry() handles it, which means a simulated
> > > > EFI service should be provided to that entry, especially about how to be
> > > > aware of the memory layout.
> > >
> > > This is actually an idea I intend to explore: with the EFI runtime
> > > services regions mapped 1:1, it wouldn't be too hard to implement a
> > > minimal environment that can run the zboot image under the previous
> >
> > The idea of the minimal environment lools amazing. After digging
> > more deeply into it, I think it means to implement most of the function
> > members in efi_boot_services, besides that, some UEFI protocols due to
> > the reference of efi_call_proto(). So a clear boundary between zboot and
> > its dependent EFI service is demanded before the work.
> >
>
> Looking deeper into it. This approach may be splitted into the following
> chunks:
> -1. Estimation the memory demanded by the decompression of zboot, which
> roughly includes the size of Image, the size of the emulated service and
> the stack used by zboot. Finally we need a kexec_add_buffer() for this
> range.
>
> -2. The emulated EFI services and some initial data such as the physical
> address of dtb, the usable memory start address and size should be set
> by kexec_purgatory_get_set_symbol()
>
> -3. Set up an identity mapping of the usable memory by zboot, prepare
> stack and turn on MMU at the last point just before 'br efi_zboot_entry'
> in relocate_kernel.S, which means relocate_kernel.S should support two
> kinds of payload.
>
> -4. For efi_zboot_entry(), if jumping from kexec, limit its requirement
> to only a few boot services: e.g. allocate_pages, allocate_pool. So the
> emulated services can be deduced.

Hi Pingfan,

I'm not sure how hard it will be although Ard thinks it could be
doable.  If it is not easy I suspect it is not worth the effort.

For your current series,  my suggestion is you can try to move the
major code in the generic code path in kernel/kexec_file.c and keep
the arch code minimum so that in the future other arches can avoid
redundant code.

Otherwise a fallback solution could be using the same key to sign both
the zboot image and the internal kernel image like below:
1. sign the kernel with the same key twice (kernel image and zboot
image) in distro kernel
2. introduce a kconfig in mainline to sign the kernel image with an
ephemeral key same to kernel modules.  Distro can disable the config
option. (in this way kexec can only load the same kernel, it is not
useful if people want to load older/newer kernels)
3. patch kexec-tools to decompress the zboot image and load the kernel image


>
> > > kernel up to the point where it call ExitBootServices(), after which
> > > kexec() would take over.
> > >
> >
> > IIUC, after kexec switches to efi_zboot_entry(), it will not return,
> > right?
> >
>
> I have this assumption because letting the control path switch between
> kernel and non-kernel code is not a good idea.
>
>
> Thanks,
>
> Pingfan
>
> > > >  -3. in kernel space, during the file load
> > > > of the zboot image. At that point, the kernel masters the whole memory
> > > > information, and easily allocates a suitable memory for the decompressed
> > > > kernel image. (I think this is similar to what grub does today).
> > > >
> > >
> > > GRUB just calls LoadImage(), and the decompression code runs in the EFI 
> > > context.
> > >
> >
> > Ah, thanks for the correcting. I had made an wrong assumption of grub
> > based on [1], from which, I thought that grub is the case "For
> > compatibility with non-EFI loaders, the payload can be decompressed and
> > executed by the loader as well, provided that the loader 

Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-07 Thread Dave Young
On Mon, 7 Nov 2022 at 15:55, Ard Biesheuvel  wrote:
>
> On Mon, 7 Nov 2022 at 08:40, Dave Young  wrote:
> >
> > On Mon, 7 Nov 2022 at 15:36, Dave Young  wrote:
> > >
> > > Hi Ard,
> > >
> > > On Mon, 7 Nov 2022 at 15:30, Ard Biesheuvel  wrote:
> > > >
> > > > On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Sat, 5 Nov 2022 at 22:16,  wrote:
> > > > > >
> > > > > > On 2022-11-05 05:49, Dave Young wrote:
> > > > > > > Baoquan, thanks for cc me.
> > > > > > >
> > > > > > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > > > > > >>
> > > > > > >> Add Dave to CC
> > > > > > >>
> > > > > > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > > > > > >> > Greetings,
> > > > > > >> >
> > > > > > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where 
> > > > > > >> > kexecing will
> > > > > > >> > cause EFI mode (if that's the right term for it) to be 
> > > > > > >> > unconditionally
> > > > > > >> > disabled, even when not using the --noefi option to kexec.
> > > > > > >> >
> > > > > > >> > What I mean by "EFI mode" being disabled, more than just EFI 
> > > > > > >> > runtime
> > > > > > >> > services, is that basically nothing about the system's EFI is 
> > > > > > >> > visible
> > > > > > >> > post-kexec. Normally you have a message like this in dmesg 
> > > > > > >> > when the
> > > > > > >> > system is booted in EFI mode:
> > > > > > >> >
> > > > > > >> > [0.00] efi: EFI v2.70 by EDK II
> > > > > > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > > > > > >> > 2.0=0x7fb7e014
> > > > > > >> > MEMATTR=0x7ec63018
> > > > > > >> > (obviously not the real firmware of the machine I'm talking 
> > > > > > >> > about, but I
> > > > > > >> > can also send that if it would be of any help)
> > > > > > >> >
> > > > > > >> > No such message pops up in my dmesg as a result of this bug, & 
> > > > > > >> > this
> > > > > > >> > causes some fallout like being unable to find the system's DMI
> > > > > > >> > information:
> > > > > > >> >
> > > > > > >> > <6>[0.00] DMI not present or invalid.
> > > > > > >> >
> > > > > > >> > The efivarfs module also fails to load with -ENODEV.
> > > > > > >> >
> > > > > > >> > I've tried also booting with efi=runtime explicitly but it 
> > > > > > >> > doesn't
> > > > > > >> > change anything. The kernel still does not print the name of 
> > > > > > >> > the EFI
> > > > > > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > > > > > >> >
> > > > > > >> > I've been using the kexec_load syscall for all these tests, if 
> > > > > > >> > it's
> > > > > > >> > important.
> > > > > > >> >
> > > > > > >> > Also, to make it very clear, all this only ever happens 
> > > > > > >> > post-kexec. When
> > > > > > >> > booting straight from UEFI (with the EFI stub), all the 
> > > > > > >> > aforementioned
> > > > > > >> > stuff that fails works perfectly fine (i.e. name of firmware 
> > > > > > >> > is printed,
> > > > > > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > > > > > >> >
> > > > > > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been 
> > > > > > >> > trying to
> > > >

Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Dave Young
On Mon, 7 Nov 2022 at 15:36, Dave Young  wrote:
>
> Hi Ard,
>
> On Mon, 7 Nov 2022 at 15:30, Ard Biesheuvel  wrote:
> >
> > On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
> > >
> > > Hi,
> > >
> > > On Sat, 5 Nov 2022 at 22:16,  wrote:
> > > >
> > > > On 2022-11-05 05:49, Dave Young wrote:
> > > > > Baoquan, thanks for cc me.
> > > > >
> > > > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > > > >>
> > > > >> Add Dave to CC
> > > > >>
> > > > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > > > >> > Greetings,
> > > > >> >
> > > > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where kexecing 
> > > > >> > will
> > > > >> > cause EFI mode (if that's the right term for it) to be 
> > > > >> > unconditionally
> > > > >> > disabled, even when not using the --noefi option to kexec.
> > > > >> >
> > > > >> > What I mean by "EFI mode" being disabled, more than just EFI 
> > > > >> > runtime
> > > > >> > services, is that basically nothing about the system's EFI is 
> > > > >> > visible
> > > > >> > post-kexec. Normally you have a message like this in dmesg when the
> > > > >> > system is booted in EFI mode:
> > > > >> >
> > > > >> > [0.00] efi: EFI v2.70 by EDK II
> > > > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > > > >> > 2.0=0x7fb7e014
> > > > >> > MEMATTR=0x7ec63018
> > > > >> > (obviously not the real firmware of the machine I'm talking about, 
> > > > >> > but I
> > > > >> > can also send that if it would be of any help)
> > > > >> >
> > > > >> > No such message pops up in my dmesg as a result of this bug, & this
> > > > >> > causes some fallout like being unable to find the system's DMI
> > > > >> > information:
> > > > >> >
> > > > >> > <6>[0.00] DMI not present or invalid.
> > > > >> >
> > > > >> > The efivarfs module also fails to load with -ENODEV.
> > > > >> >
> > > > >> > I've tried also booting with efi=runtime explicitly but it doesn't
> > > > >> > change anything. The kernel still does not print the name of the 
> > > > >> > EFI
> > > > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > > > >> >
> > > > >> > I've been using the kexec_load syscall for all these tests, if it's
> > > > >> > important.
> > > > >> >
> > > > >> > Also, to make it very clear, all this only ever happens 
> > > > >> > post-kexec. When
> > > > >> > booting straight from UEFI (with the EFI stub), all the 
> > > > >> > aforementioned
> > > > >> > stuff that fails works perfectly fine (i.e. name of firmware is 
> > > > >> > printed,
> > > > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > > > >> >
> > > > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been 
> > > > >> > trying to
> > > > >> > bisect it, but it seems like it goes pretty far back. I've got 
> > > > >> > vanilla
> > > > >> > mainline kernel builds dating back to 5.17 that have the exact same
> > > > >> > issue. It might be worth noting that during this testing, I made 
> > > > >> > sure
> > > > >> > the version of the kernel being kexeced & the kernel kexecing were 
> > > > >> > the
> > > > >> > same version. It may not have been a problem in older kernels, but 
> > > > >> > that
> > > > >> > would be difficult to test for me (a pretty important driver for 
> > > > >> > this
> > > > >> > machine was only merged during v5.17-rc4). So it may not have been 
> > > > >> > a
> > > > >> > regression & just a hidden prob

Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Dave Young
Hi Ard,

On Mon, 7 Nov 2022 at 15:30, Ard Biesheuvel  wrote:
>
> On Mon, 7 Nov 2022 at 07:55, Dave Young  wrote:
> >
> > Hi,
> >
> > On Sat, 5 Nov 2022 at 22:16,  wrote:
> > >
> > > On 2022-11-05 05:49, Dave Young wrote:
> > > > Baoquan, thanks for cc me.
> > > >
> > > > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> > > >>
> > > >> Add Dave to CC
> > > >>
> > > >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > > >> > Greetings,
> > > >> >
> > > >> > I've been hitting a bug on my Lenovo ThinkPad T480 where kexecing 
> > > >> > will
> > > >> > cause EFI mode (if that's the right term for it) to be 
> > > >> > unconditionally
> > > >> > disabled, even when not using the --noefi option to kexec.
> > > >> >
> > > >> > What I mean by "EFI mode" being disabled, more than just EFI runtime
> > > >> > services, is that basically nothing about the system's EFI is visible
> > > >> > post-kexec. Normally you have a message like this in dmesg when the
> > > >> > system is booted in EFI mode:
> > > >> >
> > > >> > [0.00] efi: EFI v2.70 by EDK II
> > > >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 
> > > >> > 2.0=0x7fb7e014
> > > >> > MEMATTR=0x7ec63018
> > > >> > (obviously not the real firmware of the machine I'm talking about, 
> > > >> > but I
> > > >> > can also send that if it would be of any help)
> > > >> >
> > > >> > No such message pops up in my dmesg as a result of this bug, & this
> > > >> > causes some fallout like being unable to find the system's DMI
> > > >> > information:
> > > >> >
> > > >> > <6>[0.00] DMI not present or invalid.
> > > >> >
> > > >> > The efivarfs module also fails to load with -ENODEV.
> > > >> >
> > > >> > I've tried also booting with efi=runtime explicitly but it doesn't
> > > >> > change anything. The kernel still does not print the name of the EFI
> > > >> > firmware, DMI is still missing, & efivarfs still fails to load.
> > > >> >
> > > >> > I've been using the kexec_load syscall for all these tests, if it's
> > > >> > important.
> > > >> >
> > > >> > Also, to make it very clear, all this only ever happens post-kexec. 
> > > >> > When
> > > >> > booting straight from UEFI (with the EFI stub), all the 
> > > >> > aforementioned
> > > >> > stuff that fails works perfectly fine (i.e. name of firmware is 
> > > >> > printed,
> > > >> > DMI is properly found, & efivarfs loads & mounts just fine).
> > > >> >
> > > >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been trying 
> > > >> > to
> > > >> > bisect it, but it seems like it goes pretty far back. I've got 
> > > >> > vanilla
> > > >> > mainline kernel builds dating back to 5.17 that have the exact same
> > > >> > issue. It might be worth noting that during this testing, I made sure
> > > >> > the version of the kernel being kexeced & the kernel kexecing were 
> > > >> > the
> > > >> > same version. It may not have been a problem in older kernels, but 
> > > >> > that
> > > >> > would be difficult to test for me (a pretty important driver for this
> > > >> > machine was only merged during v5.17-rc4). So it may not have been a
> > > >> > regression & just a hidden problem since time immemorial.
> > > >> >
> > > >> > I am willing to test any patches I may get to further debug or fix
> > > >> > this issue, preferably based on the current state of 
> > > >> > torvalds/linux.git.
> > > >> > I can build & test kernels quite a few times per day.
> > > >> >
> > > >> > I can also send any important materials (kernel config, dmesg, 
> > > >> > firmware
> > > >> > information, so on & so forth) on request. I'll also just mention I'm
&

Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-06 Thread Dave Young
Hi,

On Sat, 5 Nov 2022 at 22:16,  wrote:
>
> On 2022-11-05 05:49, Dave Young wrote:
> > Baoquan, thanks for cc me.
> >
> > On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
> >>
> >> Add Dave to CC
> >>
> >> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> >> > Greetings,
> >> >
> >> > I've been hitting a bug on my Lenovo ThinkPad T480 where kexecing will
> >> > cause EFI mode (if that's the right term for it) to be unconditionally
> >> > disabled, even when not using the --noefi option to kexec.
> >> >
> >> > What I mean by "EFI mode" being disabled, more than just EFI runtime
> >> > services, is that basically nothing about the system's EFI is visible
> >> > post-kexec. Normally you have a message like this in dmesg when the
> >> > system is booted in EFI mode:
> >> >
> >> > [0.00] efi: EFI v2.70 by EDK II
> >> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 2.0=0x7fb7e014
> >> > MEMATTR=0x7ec63018
> >> > (obviously not the real firmware of the machine I'm talking about, but I
> >> > can also send that if it would be of any help)
> >> >
> >> > No such message pops up in my dmesg as a result of this bug, & this
> >> > causes some fallout like being unable to find the system's DMI
> >> > information:
> >> >
> >> > <6>[0.00] DMI not present or invalid.
> >> >
> >> > The efivarfs module also fails to load with -ENODEV.
> >> >
> >> > I've tried also booting with efi=runtime explicitly but it doesn't
> >> > change anything. The kernel still does not print the name of the EFI
> >> > firmware, DMI is still missing, & efivarfs still fails to load.
> >> >
> >> > I've been using the kexec_load syscall for all these tests, if it's
> >> > important.
> >> >
> >> > Also, to make it very clear, all this only ever happens post-kexec. When
> >> > booting straight from UEFI (with the EFI stub), all the aforementioned
> >> > stuff that fails works perfectly fine (i.e. name of firmware is printed,
> >> > DMI is properly found, & efivarfs loads & mounts just fine).
> >> >
> >> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been trying to
> >> > bisect it, but it seems like it goes pretty far back. I've got vanilla
> >> > mainline kernel builds dating back to 5.17 that have the exact same
> >> > issue. It might be worth noting that during this testing, I made sure
> >> > the version of the kernel being kexeced & the kernel kexecing were the
> >> > same version. It may not have been a problem in older kernels, but that
> >> > would be difficult to test for me (a pretty important driver for this
> >> > machine was only merged during v5.17-rc4). So it may not have been a
> >> > regression & just a hidden problem since time immemorial.
> >> >
> >> > I am willing to test any patches I may get to further debug or fix
> >> > this issue, preferably based on the current state of torvalds/linux.git.
> >> > I can build & test kernels quite a few times per day.
> >> >
> >> > I can also send any important materials (kernel config, dmesg, firmware
> >> > information, so on & so forth) on request. I'll also just mention I'm
> >> > using kexec-tools 2.0.24 upfront, if it matters.
> >
> > Can you check the efi runtime in sysfs:
> > ls /sys/firmware/efi/runtime-map/
> >
> > If nothing then maybe you did not enable CONFIG_EFI_RUNTIME_MAP=y, it
> > is needed for kexec UEFI boot on x86_64.
>
> Oh my, it really is that simple.
>
> Indeed, enabling this in the pre-kexec kernel fixes it all up. I had
> blindly disabled it in my quest to downsize the pre-kexec kernel to
> reduce boot time (it only runs a bootloader). In hindsight, the firmware
> drivers section is not really a good section to tweak on a whim.
>
> I'm terribly sorry to have taken your time to "fix" this "bug". But I
> must ask, is there any reason why this is a visible config option, or at
> least not gated behind CONFIG_EXPERT? drivers/firmware/efi/runtime-map.c
> is pretty tiny, & considering it depends on CONFIG_KEXEC_CORE, one
> probably wants to have kexec work properly if they can even enable it.

Glad to know it works with the .config tweaking. I can not recall any
reason for that though.

Since it sits in the efi code path, let's see how Ard thinks about
your proposal.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Bug: kexec on Lenovo ThinkPad T480 disables EFI mode

2022-11-04 Thread Dave Young
Baoquan, thanks for cc me.

On Sat, 5 Nov 2022 at 11:10, Baoquan He  wrote:
>
> Add Dave to CC
>
> On 10/28/22 at 01:02pm, n...@tfwno.gf wrote:
> > Greetings,
> >
> > I've been hitting a bug on my Lenovo ThinkPad T480 where kexecing will
> > cause EFI mode (if that's the right term for it) to be unconditionally
> > disabled, even when not using the --noefi option to kexec.
> >
> > What I mean by "EFI mode" being disabled, more than just EFI runtime
> > services, is that basically nothing about the system's EFI is visible
> > post-kexec. Normally you have a message like this in dmesg when the
> > system is booted in EFI mode:
> >
> > [0.00] efi: EFI v2.70 by EDK II
> > [0.00] efi: SMBIOS=0x7f98a000 ACPI=0x7fb7e000 ACPI 2.0=0x7fb7e014
> > MEMATTR=0x7ec63018
> > (obviously not the real firmware of the machine I'm talking about, but I
> > can also send that if it would be of any help)
> >
> > No such message pops up in my dmesg as a result of this bug, & this
> > causes some fallout like being unable to find the system's DMI
> > information:
> >
> > <6>[0.00] DMI not present or invalid.
> >
> > The efivarfs module also fails to load with -ENODEV.
> >
> > I've tried also booting with efi=runtime explicitly but it doesn't
> > change anything. The kernel still does not print the name of the EFI
> > firmware, DMI is still missing, & efivarfs still fails to load.
> >
> > I've been using the kexec_load syscall for all these tests, if it's
> > important.
> >
> > Also, to make it very clear, all this only ever happens post-kexec. When
> > booting straight from UEFI (with the EFI stub), all the aforementioned
> > stuff that fails works perfectly fine (i.e. name of firmware is printed,
> > DMI is properly found, & efivarfs loads & mounts just fine).
> >
> > This is reproducible with a vanilla 6.1-rc2 kernel. I've been trying to
> > bisect it, but it seems like it goes pretty far back. I've got vanilla
> > mainline kernel builds dating back to 5.17 that have the exact same
> > issue. It might be worth noting that during this testing, I made sure
> > the version of the kernel being kexeced & the kernel kexecing were the
> > same version. It may not have been a problem in older kernels, but that
> > would be difficult to test for me (a pretty important driver for this
> > machine was only merged during v5.17-rc4). So it may not have been a
> > regression & just a hidden problem since time immemorial.
> >
> > I am willing to test any patches I may get to further debug or fix
> > this issue, preferably based on the current state of torvalds/linux.git.
> > I can build & test kernels quite a few times per day.
> >
> > I can also send any important materials (kernel config, dmesg, firmware
> > information, so on & so forth) on request. I'll also just mention I'm
> > using kexec-tools 2.0.24 upfront, if it matters.

Can you check the efi runtime in sysfs:
ls /sys/firmware/efi/runtime-map/

If nothing then maybe you did not enable CONFIG_EFI_RUNTIME_MAP=y, it
is needed for kexec UEFI boot on x86_64.

Otherwise you can add debug printf in kexec-tools efi error path to
see what is wrong.
kexec/arch/i386/x86-linux-setup.c : function setup_efi_data

And if it still not work please post your kernel config, I can have a
try although I do not have the t480 now.


> >
> > Regards,
> >
> > ___
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> >
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-28 Thread Dave Young
Hi David,

On Sat, 27 Aug 2022 at 01:02, David Hildenbrand  wrote:
>
> On 26.08.22 03:43, Dave Young wrote:
> > Hi David,
> >
> > [Added more people in cc]
> >
>
> Hi Dave,
>
> thanks for your input!

You are welcome :)

>
> [...]
>
> >> Side note: especially with kdump() I feel like we might see much more
> >> widespread use of panic_on_warn to be able to actually extract debug
> >> information in a controlled manner -- for example on enterprise distros.
> >> ... which would then make these systems more likely to crash, because
> >> there is no way to distinguish a rather harmless warning from a severe
> >> warning :/ . But let's see if some kdump() folks will share their
> >> opinion as reply to the cover letter.
> >
> > I can understand the intention of this patch, and I totally agree that
> > BUG() should be used carefully, this is a good proposal if we can
> > clearly define the standard about when to use BUG().  But I do have
>
> Essentially, the general rule from Linus is "absolutely no new BUG_ON()
> calls ever" -- but I think the consensus in that thread was that there
> are corner cases when it comes to unavoidable data corruption/security
> issues. And these are rare cases, not the usual case where we'd have
> used BUG_ON()/VM_BUG_ON().

Yes, probably.. (say probably because those cases are hidden and not
clear sometimes)

>
> > some worries,  I think this standard is different for different sub
> > components, it is not clear to me at least,  so this may introduce an
> > unstable running kernel and cause troubles (eg. data corruption) with
> > a WARN instead of a BUG. Probably it would be better to say "Do not
> > WARN lightly, and do not hesitate to use BUG if it is really needed"?
>
>
> Well, I don't make the rules, I document them and share them for general
> awareness/comments :) Documenting this is valuable, because there seem
> to be quite some different opinions floating around in the community --
> and I've been learning different rules from different people over the years.

Understand.

>
> >
> > About "patch_on_warn", it will depend on the admin/end user to set it,
> > it is not a good idea for distribution to set it. It seems we are
> > leaving it to end users to take the risk of a kernel panic even with
> > all kernel WARN even if it is sometimes not necessary.
>
> My question would be what we could add/improve to keep systems with
> kdump armed running as expected for end users, that is most probably:
>
> 1) don't crash on harmless WARN() that can just be reported and the
>machine will continue running mostly fine without real issues.
> 2) crash on severe issues (previously BUG) such that we can properly
>capture a system dump via kdump. The restart the machine.
>
> Of course, once one would run into 2), one could try reproducing with
> "panic_on_warn" to get a reasonable system dump. But I guess that's not
> what enterprise customers expect.
>

Sometimes the bug can not be easily reproduced again. So there seems
no easy and good way to use..

>
> One wild idea (in the cover letter) was to add something new that can be
> configured by user space and that expresses that something is more
> severe than just some warning that can be recovered easily. But it can
> eventually be recovered to keep the system running to some degree. But
> still, it's configurable if we want to trigger a panic or let the system
> run.
>
> John mentioned PANIC_ON().
>

I would vote for PANIC_ON(), it sounds like a good idea, because
BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
this will cause a kernel panic and one will be more careful before
using it.

>
> What would be your expectation for kdump users under which conditions we
> want to trigger kdump and when not?
>
> Regarding panic_on_warn, how often do e.g., RHEL users observe warnings
> that we're not able to catch during testing, such that "panic_on_warn"
> would be a real no-go?

Well, I'm not sure how to answer the questions,  when to panic should
be decided by kernel developers instead of kdump users,  but I think
the panic behaviour does impact the supporting team.  I added Stephen
who is from the RH supporting team, maybe he can have some inputs.

BTW, I vaguely remember Prarit introduced the panic_on_warn, see if he
has any comments here.

Thanks
Dave



>
> --
> Thanks,
>
> David / dhildenb
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-25 Thread Dave Young
Hi David,

[Added more people in cc]

On Thu, 25 Aug 2022 at 20:13, David Hildenbrand  wrote:
>
> On 24.08.22 23:59, John Hubbard wrote:
> > On 8/24/22 09:30, David Hildenbrand wrote:
> >> diff --git a/Documentation/process/coding-style.rst 
> >> b/Documentation/process/coding-style.rst
> >> index 03eb53fd029a..a6d81ff578fe 100644
> >> --- a/Documentation/process/coding-style.rst
> >> +++ b/Documentation/process/coding-style.rst
> >> @@ -1186,6 +1186,33 @@ expression used.  For instance:
> >>  #endif /* CONFIG_SOMETHING */
> >>
> >
> > I like the idea of adding this documentation, and this is the right
> > place. Naturally, if one likes something, one must immediately change
> > it. :) Therefore, here is an alternative writeup that I think captures
> > what you and the email threads were saying.
> >
> > How's this sound?
>
> Much better, thanks! :)
>
> >
> > diff --git a/Documentation/process/coding-style.rst 
> > b/Documentation/process/coding-style.rst
> > index 03eb53fd029a..32df0d503388 100644
> > --- a/Documentation/process/coding-style.rst
> > +++ b/Documentation/process/coding-style.rst
> > @@ -1185,6 +1185,53 @@ expression used.  For instance:
> > ...
> > #endif /* CONFIG_SOMETHING */
> >
> > +22) Do not crash the kernel
> > +---
> > +
> > +Use WARN() rather than BUG()
> > +
> > +
> > +Do not add new code that uses any of the BUG() variants, such as BUG(),
> > +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably
> > +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not 
> > required
> > +if there is no reasonable way to at least partially recover.
>
> I'll tend to keep in this section:
>
> "Unavoidable data corruption / security issues might be a very rare
> exception to this rule and need good justification."
>
> Because there are rare exceptions, and I'd much rather document the
> clear exception to this rule.
>
> > +
> > +Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
> > +**
> > +
> > +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it 
> > is
> > +common for a given warning condition, if it occurs at all, to occur 
> > multiple
> > +times. (For example, once per file, or once per struct page.) This can 
> > fill up
>
> I'll drop the "For example" part. I feel like this doesn't really need
> an example -- most probably we've all been there already when the kernel
> log was flooded :)
>
> > +and wrap the kernel log, and can even slow the system enough that the 
> > excessive
> > +logging turns into its own, additional problem.
> > +
> > +Do not WARN lightly
> > +***
> > +
> > +WARN*() is intended for unexpected, this-should-never-happen situations. 
> > WARN*()
> > +macros are not to be used for anything that is expected to happen during 
> > normal
> > +operation. These are not pre- or post-condition asserts, for example. 
> > Again:
> > +WARN*() must not be used for a condition that is expected to trigger 
> > easily, for
> > +example, by user space actions. pr_warn_once() is a possible alternative, 
> > if you
> > +need to notify the user of a problem.
> > +
> > +Do not worry about panic_on_warn users
> > +**
> > +
> > +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > +available kernel option, and that many users set this option. This is why 
> > there
> > +is a "Do not WARN lightly" writeup, above. However, the existence of
> > +panic_on_warn users is not a valid reason to avoid the judicious use 
> > WARN*().
> > +That is because, whoever enables panic_on_warn has explicitly asked the 
> > kernel
> > +to crash if a WARN*() fires, and such users must be prepared to deal with 
> > the
> > +consequences of a system that is somewhat more likely to crash.
>
> Side note: especially with kdump() I feel like we might see much more
> widespread use of panic_on_warn to be able to actually extract debug
> information in a controlled manner -- for example on enterprise distros.
> ... which would then make these systems more likely to crash, because
> there is no way to distinguish a rather harmless warning from a severe
> warning :/ . But let's see if some kdump() folks will share their
> opinion as reply to the cover letter.

I can understand the intention of this patch, and I totally agree that
BUG() should be used carefully, this is a good proposal if we can
clearly define the standard about when to use BUG().  But I do have
some worries,  I think this standard is different for different sub
components, it is not clear to me at least,  so this may introduce an
unstable running kernel and cause troubles (eg. data corruption) with
a WARN instead of a BUG. Probably it would be better to say "Do not
WARN lightly, and do not hesitate to use BUG if it is really needed"?

About "patch_on_warn", it will depend on the admin/end user to set it,
it 

Re: [Crash-utility] [PATCH 0/5] Fixups to work with crash tool

2022-07-22 Thread Dave Young
Hi,

On Sun, 17 Jul 2022 at 18:13, Xianting Tian
 wrote:
>
> I ever sent the patch 1,2 in the link:
> https://patchwork.kernel.org/project/linux-riscv/patch/20220708073150.352830-2-xianting.t...@linux.alibaba.com/
> https://patchwork.kernel.org/project/linux-riscv/patch/20220708073150.352830-3-xianting.t...@linux.alibaba.com/
> And patch 3,4 in the link:
> https://patchwork.kernel.org/project/linux-riscv/patch/20220714113300.367854-2-xianting.t...@linux.alibaba.com/
> https://patchwork.kernel.org/project/linux-riscv/patch/20220714113300.367854-3-xianting.t...@linux.alibaba.com/
>
> This patch series just put these patches together, and with a new patch 5.
> these five patches are the fixups for kexec, vmcore and improvements
> for vmcoreinfo and memory layout dump.
>
> The main changes in the five patchs as below,
> Patch 1: Add a fast call path of crash_kexec() as other Arch(x86, arm64) do.
> Patch 2: use __smp_processor_id() instead of smp_processor_id() to cleanup
>  the console prints.
> Patch 3: Add VM layout, va bits, ram base to vmcoreinfo, which can simplify
>  the development of crash tool as ARM64 already did
>  (arch/arm64/kernel/crash_core.c).
> Patch 4: Add modules to virtual kernel memory layout dump.
> Patch 5: Fixup to get correct kernel mode PC for vmcore
>
> With these 5 patches(patch 3 is must), crash tool can work well to analyze
> a vmcore. The patches for crash tool for RISCV64 is in the link:
> https://lore.kernel.org/linux-riscv/20220717042929.370022-1-xianting.t...@linux.alibaba.com/
>
> Xianting Tian (5):
>   RISC-V: Fixup fast call of crash_kexec()
>   RISC-V: use __smp_processor_id() instead of smp_processor_id()
>   RISC-V: Add arch_crash_save_vmcoreinfo support

Vmcoreinfo changes need to be documented in
Documentation/admin-guide/kdump/vmcoreinfo.rst

Otherwise, I suggest to always cc kexec mail list (added in cc) for
kexec | kdump patches.


>   riscv: Add modules to virtual kernel memory layout dump
>   RISC-V: Fixup getting correct current pc
>
>  arch/riscv/kernel/Makefile  |  1 +
>  arch/riscv/kernel/crash_core.c  | 29 +
>  arch/riscv/kernel/crash_save_regs.S |  2 +-
>  arch/riscv/kernel/machine_kexec.c   |  2 +-
>  arch/riscv/kernel/traps.c   |  4 
>  arch/riscv/mm/init.c|  4 
>  6 files changed, 40 insertions(+), 2 deletions(-)
>  create mode 100644 arch/riscv/kernel/crash_core.c
>
> --
> 2.17.1
>
> --
> Crash-utility mailing list
> crash-util...@redhat.com
> https://listman.redhat.com/mailman/listinfo/crash-utility
> Contribution Guidelines: https://github.com/crash-utility/crash/wiki
>

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory

2022-07-04 Thread Dave Young
On Wed, 29 Jun 2022 at 08:59, Kirill A. Shutemov
 wrote:
>
> On Tue, Jun 28, 2022 at 05:10:56PM -0700, Dave Hansen wrote:
> > On 6/28/22 16:51, Kirill A. Shutemov wrote:
> > > On Fri, Jun 24, 2022 at 05:00:05AM +0300, Kirill A. Shutemov wrote:
> > >>> If there is some deep and fundamental why this can not be supported
> > >>> then it probably makes sense to put some code in the arch_kexec_load
> > >>> hook that verifies that deep and fundamental reason is present.
> > ...
> > > +   /*
> > > +* TODO: Information on memory acceptance status has to be 
> > > communicated
> > > +* between kernel.
> > > +*/
> >
> > So, the deep and fundamental reason is... drum roll... you haven't
> > gotten around to implementing bitmap passing yet?!?!?   I have the
> > feeling that wasn't what Eric was looking for.
>
> The deep fundamental reason is that everything cannot be implemented and
> upstreamed at once.

If the only thing is to pass bitmap to kexec kernel, since you have
reserved the bitmap memory I guess it is straightforward to set the
kexec bootparams.unaccepted_memory as the old value.  Not sure if
there are problems when the decompress code accepts memory again
though.
for kernel kexec_file_load, refer to function setup_boot_parameters()
in arch/x86/kernel/kexec-bzimage64.c for kexec_file_load,
for kexec-tools kexec_load code refer to
setup_linux_system_parameters() kexec/arch/i386/x86-linux-setup.c

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] Printbufs & shrinker OOM reporting

2022-04-29 Thread Dave Young
Hi Kent,
On Fri, 22 Apr 2022 at 07:56, Kent Overstreet  wrote:
>
> Debugging OOMs has been one of my sources of frustration, so this patch series
> is an attempt to do something about it.
>
> The first patch in the series is something I've been slowly evolving in 
> bcachefs
> for years: simple heap allocated strings meant for appending to and building 
> up
> structured log/error messages. They make it easy and straightforward to write
> pretty-printers for everything, which in turn makes good logging and error
> messages something that just happens naturally.
>
> We want it here because that means the reporting I'm adding to shrinkers can 
> be
> used by both OOM reporting, and for the sysfs (or is it debugfs now) interface
> that Roman is adding.
>

I added the kexec list in cc.  It seems like a nice enhancement to oom
reporting.
I suspect kdump tooling need changes to retrieve the kmsg log from
vmcore, could you confirm it?  For example makedumpfile, crash, and
kexec-tools (its vmcore-dmesg tool).


> This patch series also:
>  - adds OOM reporting on shrinkers, reporting on top 10 shrinkers (in sorted
>order!)
>  - changes slab reporting to be always-on, also reporting top 10 slabs in 
> sorted
>order
>  - starts centralizing OOM reporting in mm/show_mem.c
>
> The last patch in the series is only a demonstration of how to implement the
> shrinker .to_text() method, since bcachefs isn't upstream yet.
>
> Kent Overstreet (4):
>   lib/printbuf: New data structure for heap-allocated strings
>   mm: Add a .to_text() method for shrinkers
>   mm: Centralize & improve oom reporting in show_mem.c
>   bcachefs: shrinker.to_text() methods
>
>  fs/bcachefs/btree_cache.c |  18 ++-
>  fs/bcachefs/btree_key_cache.c |  18 ++-
>  include/linux/printbuf.h  | 140 ++
>  include/linux/shrinker.h  |   5 +
>  lib/Makefile  |   4 +-
>  lib/printbuf.c| 271 ++
>  mm/Makefile   |   2 +-
>  mm/oom_kill.c |  23 ---
>  {lib => mm}/show_mem.c|  14 ++
>  mm/slab.h |   6 +-
>  mm/slab_common.c  |  53 ++-
>  mm/vmscan.c   |  75 ++
>  12 files changed, 587 insertions(+), 42 deletions(-)
>  create mode 100644 include/linux/printbuf.h
>  create mode 100644 lib/printbuf.c
>  rename {lib => mm}/show_mem.c (78%)
>
> --
> 2.35.2
>

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [ANNOUNCE] kexec-tools v2.0.24 preparation

2022-03-29 Thread Dave Young
On 03/29/22 at 06:16pm, Baoquan He wrote:
> On 03/29/22 at 11:17am, Simon Horman wrote:
> > On Sat, Mar 26, 2022 at 02:55:43PM +0800, RuiRui Yang wrote:
> > > Hi Simon,
> > > 
> > > Recently RH CKI detected a kexec/kdump failure with kexec_load being
> > > used,  the 2nd kernel can not boot, just reset to bios/firmware
> > > instead.
> > > 
> > > This only happens with gcc 12.x compiled kexec-tools,  add
> > > "-fno-tree-vectorize" to purgatory/Makefile CFLAGS make the bug
> > > dissappear, but I hope people can help to verify and confirm it does
> > > fix the problem.   It would be better to wait a few days so that we
> > > can fix it in the coming release, what do you think?
> > > 
> > > Cced Baoquan who is handling the issue..
> > > 
> > > Thanks!
> > 
> > Hi RuiRui,
> > 
> > thanks for letting us know.
> > 
> > I'm happy with delaying the release for a short time to accommodate
> > resolving this problem. Please let us know when there is some progress.
> 
> Thanks for quick response, Simon. I have sent a patch to fix this,
> please check the patch v2 with subject 'purgatory: do not enable
> vectorization automatically for purgatory compiling'.
> 
> I tested on machine where the issue can stably reproduced, and can
> confirm adding -fno-tree-vectorize to compiling options of purgatory can
> fix the issue.
> 
> Dave, please check if it is expected from you.
> 

Hi Baoquan, it works for me.  Thanks for posting it. 

Simon, I used another email/name Ruirui Yang, the two emails both are
fine, but I usually use "Dave Young" in community, just fyi in case this
cause some confusion :)

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v19 02/13] x86/setup: Use parse_crashkernel_high_low() to simplify code

2021-12-29 Thread Dave Young
On 12/29/21 at 11:03am, Borislav Petkov wrote:
> On Wed, Dec 29, 2021 at 03:27:48PM +0800, Dave Young wrote:
> > So I think you can unify the parse_crashkernel* in x86 first with just
> > one function.  And leave the further improvements to later work. But
> > let's see how Boris think about this.
> 
> Well, I think this all unnecessary work. Why?
> 
> If the goal is to support crashkernel...high,low on arm64, then you
> should simply *copy* the functionality on arm64 and be done with it.
> 
> Unification is done by looking at code which is duplicated across
> architectures and which has been untouched for a while now, i.e., no
> new or arch-specific changes are going to it so a unification can be
> as simple as trivially switching the architectures to call a generic
> function.
> 
> What this does is carve out the "generic" parts and then try not to
> break existing usage.
> 
> Which is a total waste of energy and resources. And it is casting that
> functionality in stone so that when x86 wants to change something there,
> it should do it in a way not to break arm64. And I fail to see the
> advantage of all that. Code sharing ain't it.
> 
> So what it should do is simply copy the necessary code to arm64.
> Unifications can always be done later, when the dust settles.

I think I agree with you about the better way is to doing some
improvements so that arches can logically doing things better.  I can
leave with the way I suggested although it is not the best.  But I think
Leizhen needs a clear direction about how to do it. It is very clear
now.  See how he will handle this. 

> 
> IMNSVHO.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v19 02/13] x86/setup: Use parse_crashkernel_high_low() to simplify code

2021-12-29 Thread Dave Young
On 12/29/21 at 11:11am, Borislav Petkov wrote:
> On Wed, Dec 29, 2021 at 03:45:12PM +0800, Dave Young wrote:
> > BTW, I would suggest to wait for reviewers to response (eg. one week at
> > least, or more due to the holidays) before updating another version
> > 
> > Do not worry to miss the 5.17.  I would say take it easy if it will
> > miss then let's just leave with it and continue to work on the future
> > improvements.  I think one reason this issue takes too long time is that it 
> > was
> > discussed some time but no followup and later people need to warm up
> > again.  Just keep it warm and continue to engage in the improvements, do
> > not hurry for the specific mainline release.
> 
> Can you tell this to *all* patch submitters please?

I appreciate you further explanation below to describe the situation.  I do not
see how can I tell this to *all* submitters,  but I am and I will try to do this
as far as I can.  Maintainers and patch submitters, it would help for both
parties show sympathy with each other, some soft reminders will help
people to understand each other, especially for new comers.

> 
> I can't count the times where people simply hurry to send the new
> revision just to get it in the next kernel, and make silly mistakes
> while doing so. Or not think things straight and misdesign it all.
> 
> And what this causes is the opposite of what they wanna achieve - pissed
> maintainers and ignored threads.
> 
> And they all *know* that the next kernel is around the corner. So why
> the hell does it even matter when?
> 
> What most submitters fail to realize is, the moment your code hits
> upstream, it becomes the maintainers' problem and submitters can relax.
> 
> But maintainers get to deal with this code forever. So after a while
> maintainers learn that they either accept ready code and it all just
> works or they make the mistake to take half-baked crap in and then they
> themselves get to clean it up and fix it.
> 
> So maintainers learn quickly to push back.
> 
> But it is annoying and it would help immensely if submitters would
> consider this and stop hurrying the code in but try to do a *good* job
> first, design-wise and code-wise by thinking hard about what they're
> trying to do.
> 
> Yeah, things could be a lot simpler and easier - it only takes a little
> bit of effort...
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v19 02/13] x86/setup: Use parse_crashkernel_high_low() to simplify code

2021-12-28 Thread Dave Young
On 12/29/21 at 03:27pm, Dave Young wrote:
> On 12/29/21 at 10:27am, Leizhen (ThunderTown) wrote:
> > 
> > 
> > On 2021/12/29 0:13, Borislav Petkov wrote:
> > > On Tue, Dec 28, 2021 at 09:26:01PM +0800, Zhen Lei wrote:
> > >> Use parse_crashkernel_high_low() to bring the parsing of
> > >> "crashkernel=X,high" and the parsing of "crashkernel=Y,low" together, 
> > >> they
> > >> are strongly dependent, make code logic clear and more readable.
> > >>
> > >> Suggested-by: Borislav Petkov 
> > > 
> > > Yeah, doesn't look like something I suggested...
> > > 
> > >> @@ -474,10 +472,9 @@ static void __init reserve_crashkernel(void)
> > >>  /* crashkernel=XM */
> > >>  ret = parse_crashkernel(boot_command_line, total_mem, 
> > >> _size, _base);
> > >>  if (ret != 0 || crash_size <= 0) {
> > >> -/* crashkernel=X,high */
> > >> -ret = parse_crashkernel_high(boot_command_line, 
> > >> total_mem,
> > >> - _size, _base);
> > >> -if (ret != 0 || crash_size <= 0)
> > >> +/* crashkernel=X,high and possible crashkernel=Y,low */
> > >> +ret = parse_crashkernel_high_low(boot_command_line, 
> > >> _size, _size);
> > > 
> > > So this calls parse_crashkernel() and when that one fails, it calls this
> > > new weird parse high/low helper you added.
> > > 
> > > But then all three end up in the same __parse_crashkernel() worker
> > > function which seems to do the actual parsing.
> > > 
> > > What I suggested and what would be real clean is if the arches would
> > > simply call a *single* 
> > > 
> > >   parse_crashkernel()
> > > 
> > > function and when that one returns, *all* crashkernel= options would
> > > have been parsed properly, low, high, middle crashkernel, whatever...
> > > and the caller would know what crash kernel needs to be allocated.
> > > 
> > > Then each arch can do its memory allocations and checks based on that
> > > parsed data and decide to allocate or bail.
> > 
> > However, only x86 currently supports "crashkernel=X,high" and 
> > "crashkernel=Y,low", and arm64
> > will also support it. It is not supported on other architectures. So 
> > changing parse_crashkernel()
> > is not appropriate unless a new function is introduced. But naming this new 
> > function isn't easy,
> > and the name parse_crashkernel_in_order() that I've named before doesn't 
> > seem to be good.
> > Of course, we can also consider changing parse_crashkernel() to another 
> > name, then use
> > parse_crashkernel() to parse all possible "crashkernel=" options in order, 
> > but this will cause
> > other architectures to change as well.
> 
> Hi, I did not follow up all discussions, but if the only difference is
> about the low -> high fallback, I think you can add another argument in
> parse_crashkernel(..., *fallback_high),  and doing some changes in
> __parse_crashkernel() before it returns.  But since there are two
> many arguments, you could need a wrapper struct for crashkernel_param if
> needed.
> 
> If you do not want to touch other arches, another function maybe
> something like parse_crashkernel_fallback() for x86 and arm64 to use.
> 
> But I may not get all the context, please ignore if this is not the
> case.  I agree that calling parse_crash_kernel* in the
> reserve_crashkernel funtions looks not good though. 
> 
> OTOH there are bunch of other logics in param parsing code,
> eg. determin the final size and offset etc. To split the logic out more
> things need to be done, eg. firstly parsing function just get all the
> inputs from cmdline string eg. an array of struct crashkernel_param with
> mem_range, expected size, expected offset, high, or low, and do not do
> any other things.   Then pass these parsed inputs to another function to
> determine the final crash_size and crash_base, that part can be arch
> specific somehow. 
> 
> So I think you can unify the parse_crashkernel* in x86 first with just
> one function.  And leave the further improvements to later work. But
> let's see how Boris think about this.
> 

BTW, I would suggest to wait for reviewers to response (eg. one week at
least, or more due to the holidays) before updating another version

Do not worry to miss the 5.17.  I would say take it easy if it will
miss then let's just leave with it and continue to work on the future
improvements.  I think one reason this issue takes too long time is that it was
discussed some time but no followup and later people need to warm up
again.  Just keep it warm and continue to engage in the improvements, do
not hurry for the specific mainline release.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v19 02/13] x86/setup: Use parse_crashkernel_high_low() to simplify code

2021-12-28 Thread Dave Young
On 12/29/21 at 10:27am, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/12/29 0:13, Borislav Petkov wrote:
> > On Tue, Dec 28, 2021 at 09:26:01PM +0800, Zhen Lei wrote:
> >> Use parse_crashkernel_high_low() to bring the parsing of
> >> "crashkernel=X,high" and the parsing of "crashkernel=Y,low" together, they
> >> are strongly dependent, make code logic clear and more readable.
> >>
> >> Suggested-by: Borislav Petkov 
> > 
> > Yeah, doesn't look like something I suggested...
> > 
> >> @@ -474,10 +472,9 @@ static void __init reserve_crashkernel(void)
> >>/* crashkernel=XM */
> >>ret = parse_crashkernel(boot_command_line, total_mem, _size, 
> >> _base);
> >>if (ret != 0 || crash_size <= 0) {
> >> -  /* crashkernel=X,high */
> >> -  ret = parse_crashkernel_high(boot_command_line, total_mem,
> >> -   _size, _base);
> >> -  if (ret != 0 || crash_size <= 0)
> >> +  /* crashkernel=X,high and possible crashkernel=Y,low */
> >> +  ret = parse_crashkernel_high_low(boot_command_line, 
> >> _size, _size);
> > 
> > So this calls parse_crashkernel() and when that one fails, it calls this
> > new weird parse high/low helper you added.
> > 
> > But then all three end up in the same __parse_crashkernel() worker
> > function which seems to do the actual parsing.
> > 
> > What I suggested and what would be real clean is if the arches would
> > simply call a *single* 
> > 
> > parse_crashkernel()
> > 
> > function and when that one returns, *all* crashkernel= options would
> > have been parsed properly, low, high, middle crashkernel, whatever...
> > and the caller would know what crash kernel needs to be allocated.
> > 
> > Then each arch can do its memory allocations and checks based on that
> > parsed data and decide to allocate or bail.
> 
> However, only x86 currently supports "crashkernel=X,high" and 
> "crashkernel=Y,low", and arm64
> will also support it. It is not supported on other architectures. So changing 
> parse_crashkernel()
> is not appropriate unless a new function is introduced. But naming this new 
> function isn't easy,
> and the name parse_crashkernel_in_order() that I've named before doesn't seem 
> to be good.
> Of course, we can also consider changing parse_crashkernel() to another name, 
> then use
> parse_crashkernel() to parse all possible "crashkernel=" options in order, 
> but this will cause
> other architectures to change as well.

Hi, I did not follow up all discussions, but if the only difference is
about the low -> high fallback, I think you can add another argument in
parse_crashkernel(..., *fallback_high),  and doing some changes in
__parse_crashkernel() before it returns.  But since there are two
many arguments, you could need a wrapper struct for crashkernel_param if
needed.

If you do not want to touch other arches, another function maybe
something like parse_crashkernel_fallback() for x86 and arm64 to use.

But I may not get all the context, please ignore if this is not the
case.  I agree that calling parse_crash_kernel* in the
reserve_crashkernel funtions looks not good though. 

OTOH there are bunch of other logics in param parsing code,
eg. determin the final size and offset etc. To split the logic out more
things need to be done, eg. firstly parsing function just get all the
inputs from cmdline string eg. an array of struct crashkernel_param with
mem_range, expected size, expected offset, high, or low, and do not do
any other things.   Then pass these parsed inputs to another function to
determine the final crash_size and crash_base, that part can be arch
specific somehow. 

So I think you can unify the parse_crashkernel* in x86 first with just
one function.  And leave the further improvements to later work. But
let's see how Boris think about this.

> 
> > 
> > So it is getting there but it needs more surgery...
> > 
> > Thx.
> > 
> 
> -- 
> Regards,
>   Zhen Lei
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] panic: Allow printing extra panic information on kdump

2021-12-26 Thread Dave Young
On 12/25/21 at 04:21pm, Guilherme G. Piccoli wrote:
> On 23/12/2021 22:35, Dave Young wrote:
> > Hi Guilherme,
> > [...]
> > If only the doc update, I think it is fine to be another follup-up
> > patch.
> > 
> > About your 1st option in patch log, there is crash_kexec_post_notifiers
> > kernel param which can be used to switch on panic notifiers before kdump
> > bootup.   Another way probably you can try to move panic print to be
> > panic notifier. Have this been discussed before? 
> > 
> 
> Hey Dave, thanks for the suggestion. I've considered that but didn't
> like the idea. My reasoning was: allowing post notifiers on kdump will
> highly compromise the reliability, whereas the panic_print is a solo
> option, and not very invasive.
> 
> To mix it with all panic notifiers would just increase a lot the risk of
> a kdump failure. Put in other words: if I'm a kdump user and in order to
> have this panic_print setting I'd also need to enable post notifiers,
> certainly I'll not use the feature, 'cause I don't wanna risk kdump too
> much.

Hi Guilherme, yes, I have the same concern.  But there could be more
things like the panic_print in the future, it looks odd to have more
kernel cmdline params though.

> 
> One other option I've considered however, and I'd appreciate your
> opinion here, would be a new option on crash_kexec_post_notifiers that
> allows the users to select *which few notifiers* they want to enable.
> Currently it's all or nothing, and this approach is too heavy/risky I
> believe. Allowing customization on which post notifiers the user wants
> would be much better and in this case, having a post notifier for
> panic_print makes a lot of sense. What do you think?

It is definitely a good idea, I'm more than glad to see this if you
would like to work on this! 

> 
> Thanks!
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] panic: Allow printing extra panic information on kdump

2021-12-23 Thread Dave Young
Hi Guilherme,
On 12/22/21 at 09:34am, Guilherme G. Piccoli wrote:
> On 22/12/2021 08:45, Dave Young wrote:
> > Hi Guilherme,
> > 
> > Thanks for you patch.  Could you add kexec list for any following up
> > patches?  This could change kdump behavior so let's see if any comments
> > from kexec list.
> > 
> > Kudos for the lore+lei tool so that I can catch this by seeing this
> > coming into Andrews tree :)
> 
> Hi Dave, I'm really sorry for not adding the kexec list, I forgot. But I
> will do next time for sure, my apologies. And thanks for taking a look
> after you noticed that on lore, I appreciate your feedback!

Thanks!

> 
> > [...]
> > People may enable kdump crashkernel and panic_print together but
> > they are not aware the extra panic print could cause kdump not reliable
> > (in theory).  So at least some words in kernel-parameters.txt would
> > help.
> >  
> 
> That makes sense, I'll improve that in a follow-up patch, how about
> that? Indeed it's a good idea to let people be sure that panic_print
> might affect kdump reliability, although I consider the risk to be
> pretty low. And I'll loop the kexec list for sure!

If only the doc update, I think it is fine to be another follup-up
patch.

About your 1st option in patch log, there is crash_kexec_post_notifiers
kernel param which can be used to switch on panic notifiers before kdump
bootup.   Another way probably you can try to move panic print to be
panic notifier. Have this been discussed before? 

> 
> Cheers,
> 
> 
> Guilherme

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] panic: Allow printing extra panic information on kdump

2021-12-22 Thread Dave Young
Hi Guilherme,

Thanks for you patch.  Could you add kexec list for any following up
patches?  This could change kdump behavior so let's see if any comments
from kexec list.

Kudos for the lore+lei tool so that I can catch this by seeing this
coming into Andrews tree :)
On 11/09/21 at 05:28pm, Guilherme G. Piccoli wrote:
> Currently we have the "panic_print" parameter/sysctl to allow some extra
> information to be printed in a panic event. On the other hand, the kdump
> mechanism allows to kexec a new kernel to collect a memory dump for the
> running kernel in case of panic.
> Right now these options are incompatible: the user either sets the kdump
> or makes use of "panic_print". The code path of "panic_print" isn't
> reached when kdump is configured.
> 
> There are situations though in which this would be interesting: for
> example, in systems that are very memory constrained, a handcrafted
> tiny kernel/initrd for kdump might be used in order to only collect the
> dmesg in kdump kernel. Even more common, systems with no disk space for
> the full (compressed) memory dump might very well rely in this
> functionality too, dumping only the dmesg with the additional information
> provided by "panic_print".
> 
> So, this is what the patch does: allows both functionality to co-exist;
> if "panic_print" is set and the system performs a kdump, the extra
> information is printed on dmesg before the kexec. Some notes about the
> design choices here:
> 
> (a) We could have introduced a sysctl or an extra bit on "panic_print"
> to allow enabling the co-existence of kdump and "panic_print", but seems
> that would be over-engineering; we have 3 cases, let's check how this
> patch change things:
> 
> - if the user have kdump set and not "panic_print", nothing changes;
> - if the user have "panic_print" set and not kdump, nothing changes;
> - if both are enabled, now we print the extra information before kdump,
> which is exactly the goal of the patch (and should be the goal of the
> user, since they enabled both options).

People may enable kdump crashkernel and panic_print together but
they are not aware the extra panic print could cause kdump not reliable
(in theory).  So at least some words in kernel-parameters.txt would
help.
 
> 
> (b) We assume that the code path won't return from __crash_kexec()
> so we didn't guard against double execution of panic_print_sys_info().
> 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  kernel/panic.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 5da71fa4e5f1..439dbf93b406 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -243,6 +243,13 @@ void panic(const char *fmt, ...)
>*/
>   kgdb_panic(buf);
>  
> + /*
> +  * If we have a kdump kernel loaded, give a chance to panic_print
> +  * show some extra information on kernel log if it was set...
> +  */
> + if (kexec_crash_loaded())
> + panic_print_sys_info();
> +
>   /*
>* If we have crashed and we have a crash kernel loaded let it handle
>* everything else.
> -- 
> 2.33.1
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] MAINTAINERS: update kdump maintainers

2021-11-22 Thread Dave Young
Remove myself from kdump maintainers as I have no enough time to
maintain it now. But I can review patches on demand though.

Signed-off-by: Dave Young 
---
 MAINTAINERS |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86/MAINTAINERS
===
--- linux-x86.orig/MAINTAINERS
+++ linux-x86/MAINTAINERS
@@ -10122,9 +10122,9 @@ F:  lib/Kconfig.kcsan
 F: scripts/Makefile.kcsan
 
 KDUMP
-M: Dave Young 
 M: Baoquan He 
 R: Vivek Goyal 
+R: Dave Young 
 L: kexec@lists.infradead.org
 S: Maintained
 W: http://lse.sourceforge.net/kdump/


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] x86/kexec: fix memory leak of elf header buffer

2021-11-01 Thread Dave Young
Hi Baoquan,

On 10/29/21 at 03:24pm, Baoquan He wrote:
> The memory leak is reported by kmemleak detector, has been existing
> for very long time. It could casue much memory loss on large machine
> with huge memory hotplug which will trigger kdump kernel reloading
> many times, with kexec_file_load interface.
> 
> And in patch 2, 3, clean up is done to remove unnecessary elf header
> buffer freeing and unneeded arch_kexec_kernel_image_load().
> 
> Baoquan He (3):
>   x86/kexec: fix memory leak of elf header buffer
>   x86/kexec: remove incorrect elf header buffer freeing
>   kexec_file: clean up arch_kexec_kernel_image_load
> 
>  arch/x86/kernel/machine_kexec_64.c | 23 +--
>  include/linux/kexec.h  |  1 -
>  kernel/kexec_file.c|  9 ++---
>  3 files changed, 11 insertions(+), 22 deletions(-)
> 
> -- 
> 2.17.2
> 

Acked-by: Dave Young 

nitpick: the first two patches can be merged togeter, but I'm also fine if
they are in two patches.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec, KEYS: make the code in bzImage64_verify_sig public

2021-09-30 Thread Dave Young
On 09/30/21 at 08:49pm, Dave Young wrote:
> Hi Coiby,
> On 09/27/21 at 08:50am, Coiby Xu wrote:
> > From: Coiby Xu 
> > 
> > The code in bzImage64_verify_sig could make use of system keyrings including
> > .buitin_trusted_keys, .secondary_trusted_keys and .platform keyring to 
> > verify
> > signed kernel image as PE file. Move it to a public function.
> > 
> > Signed-off-by: Coiby Xu 
> > ---
> >  arch/x86/kernel/kexec-bzimage64.c | 13 +
> >  include/linux/kexec.h |  3 +++
> >  kernel/kexec_file.c   | 15 +++
> >  3 files changed, 19 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > b/arch/x86/kernel/kexec-bzimage64.c
> > index 170d0fd68b1f..4136dd3be5a9 100644
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -17,7 +17,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  
> >  #include 
> >  #include 
> > @@ -531,17 +530,7 @@ static int bzImage64_cleanup(void *loader_data)
> >  #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
> >  static int bzImage64_verify_sig(const char *kernel, unsigned long 
> > kernel_len)
> >  {
> > -   int ret;
> > -
> > -   ret = verify_pefile_signature(kernel, kernel_len,
> > - VERIFY_USE_SECONDARY_KEYRING,
> > - VERIFYING_KEXEC_PE_SIGNATURE);
> > -   if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> > -   ret = verify_pefile_signature(kernel, kernel_len,
> > - VERIFY_USE_PLATFORM_KEYRING,
> > - VERIFYING_KEXEC_PE_SIGNATURE);
> > -   }
> > -   return ret;
> > +   return arch_kexec_kernel_verify_pe_sig(kernel, kernel_len);
> >  }
> >  #endif
> >  
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index 0c994ae37729..d45f32336dbe 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -19,6 +19,7 @@
> >  #include 
> >  
> >  #include 
> > +#include 
> >  
> >  #ifdef CONFIG_KEXEC_CORE
> >  #include 
> > @@ -199,6 +200,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage 
> > *image);
> >  #ifdef CONFIG_KEXEC_SIG
> >  int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
> >  unsigned long buf_len);
> > +int arch_kexec_kernel_verify_pe_sig(const char *kernel,
> > +   unsigned long kernel_len);
> >  #endif
> >  int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
> >  
> > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> > index 33400ff051a8..85ed6984ad8f 100644
> > --- a/kernel/kexec_file.c
> > +++ b/kernel/kexec_file.c
> > @@ -106,6 +106,21 @@ int __weak arch_kexec_kernel_verify_sig(struct kimage 
> > *image, void *buf,
> >  {
> > return kexec_image_verify_sig_default(image, buf, buf_len);
> >  }
> > +
> > +int arch_kexec_kernel_verify_pe_sig(const char *kernel, unsigned long 
> > kernel_len)
> > +{
> > +   int ret;
> > +
> > +   ret = verify_pefile_signature(kernel, kernel_len,
> > + VERIFY_USE_SECONDARY_KEYRING,
> > + VERIFYING_KEXEC_PE_SIGNATURE);
> > +   if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> > +   ret = verify_pefile_signature(kernel, kernel_len,
> > + VERIFY_USE_PLATFORM_KEYRING,
> > + VERIFYING_KEXEC_PE_SIGNATURE);
> > +   }
> > +   return ret;
> > +}
> 
> Since the function is moved as generic code, the kconfig option
> CONFIG_KEXEC_BZIMAGE_VERIFY_SIG can be removed.
> 
> Instead a CONFIG_KEXEC_PEFILE_VERIFY_SIG can be added so that it does
> not need to be compiled for only platform which support UEFI pefile

Fix the sick sentence: I means only to compile for x86_64 and arm64..

> signature verification.  And the related arch kexec_file kconfig can
> just select it.
> 
> Coiby, can you try above?
> 
> >  #endif
> >  
> >  /*
> > -- 
> > 2.33.0
> > 
> > 
> > ___
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> > 
> > 
> 
> Thanks
> Dave
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec, KEYS: make the code in bzImage64_verify_sig public

2021-09-30 Thread Dave Young
Hi Coiby,
On 09/27/21 at 08:50am, Coiby Xu wrote:
> From: Coiby Xu 
> 
> The code in bzImage64_verify_sig could make use of system keyrings including
> .buitin_trusted_keys, .secondary_trusted_keys and .platform keyring to verify
> signed kernel image as PE file. Move it to a public function.
> 
> Signed-off-by: Coiby Xu 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 13 +
>  include/linux/kexec.h |  3 +++
>  kernel/kexec_file.c   | 15 +++
>  3 files changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 170d0fd68b1f..4136dd3be5a9 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -17,7 +17,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include 
>  #include 
> @@ -531,17 +530,7 @@ static int bzImage64_cleanup(void *loader_data)
>  #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
>  static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
>  {
> - int ret;
> -
> - ret = verify_pefile_signature(kernel, kernel_len,
> -   VERIFY_USE_SECONDARY_KEYRING,
> -   VERIFYING_KEXEC_PE_SIGNATURE);
> - if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> - ret = verify_pefile_signature(kernel, kernel_len,
> -   VERIFY_USE_PLATFORM_KEYRING,
> -   VERIFYING_KEXEC_PE_SIGNATURE);
> - }
> - return ret;
> + return arch_kexec_kernel_verify_pe_sig(kernel, kernel_len);
>  }
>  #endif
>  
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 0c994ae37729..d45f32336dbe 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -19,6 +19,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #ifdef CONFIG_KEXEC_CORE
>  #include 
> @@ -199,6 +200,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage 
> *image);
>  #ifdef CONFIG_KEXEC_SIG
>  int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
>unsigned long buf_len);
> +int arch_kexec_kernel_verify_pe_sig(const char *kernel,
> + unsigned long kernel_len);
>  #endif
>  int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
>  
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 33400ff051a8..85ed6984ad8f 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -106,6 +106,21 @@ int __weak arch_kexec_kernel_verify_sig(struct kimage 
> *image, void *buf,
>  {
>   return kexec_image_verify_sig_default(image, buf, buf_len);
>  }
> +
> +int arch_kexec_kernel_verify_pe_sig(const char *kernel, unsigned long 
> kernel_len)
> +{
> + int ret;
> +
> + ret = verify_pefile_signature(kernel, kernel_len,
> +   VERIFY_USE_SECONDARY_KEYRING,
> +   VERIFYING_KEXEC_PE_SIGNATURE);
> + if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> + ret = verify_pefile_signature(kernel, kernel_len,
> +   VERIFY_USE_PLATFORM_KEYRING,
> +   VERIFYING_KEXEC_PE_SIGNATURE);
> + }
> + return ret;
> +}

Since the function is moved as generic code, the kconfig option
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG can be removed.

Instead a CONFIG_KEXEC_PEFILE_VERIFY_SIG can be added so that it does
not need to be compiled for only platform which support UEFI pefile
signature verification.  And the related arch kexec_file kconfig can
just select it.

Coiby, can you try above?

>  #endif
>  
>  /*
> -- 
> 2.33.0
> 
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH linux-next] include:crash_dump: fix boolreturn.cocci warnings

2021-08-26 Thread Dave Young
On 08/23/21 at 11:01pm, CGEL wrote:
> From: Jing Yangyang 
> 
> ./include/linux/crash_dump.h:98:50-51:WARNING: return of 0/1 in
> function 'is_kdump_kernel' with return type bool
> 
> Return statements in functions returning bool should use true/false
> instead of 1/0.
> 
> Generated by: scripts/coccinelle/misc/boolreturn.cocci
> 
> Reported-by: Zeal Robot 
> Signed-off-by: Jing Yangyang 
> ---
>  include/linux/crash_dump.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index a5192b7..f92ebfe 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -95,7 +95,7 @@ static inline void vmcore_unusable(void)
>  extern void unregister_oldmem_pfn_is_ram(void);
>  
>  #else /* !CONFIG_CRASH_DUMP */
> -static inline bool is_kdump_kernel(void) { return 0; }
> +static inline bool is_kdump_kernel(void) { return false; }
>  #endif /* CONFIG_CRASH_DUMP */
>  
>  /* Device Dump information to be filled by drivers */
> -- 
> 1.8.3.1
> 
> 

Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Panic on ppc64le using kernel 5.13.0-rc3

2021-06-15 Thread Dave Young
[readd kexec/petiboot list]
On 06/15/21 at 11:20am, Dave Young wrote:
> Hi Bruno,
> 
> [cced kexec and petiboot list]
> On 06/14/21 at 10:03am, Bruno Goncalves wrote:
> > On Mon, Jun 14, 2021 at 7:47 AM Bruno Goncalves  wrote:
> > >
> > > On Fri, Jun 11, 2021 at 11:49 PM Rasmus Villemoes
> > >  wrote:
> > > >
> > > > On 11/06/2021 17.06, Bruno Goncalves wrote:
> > > > > On Fri, Jun 11, 2021 at 9:13 AM Rasmus Villemoes
> > > > >  wrote:
> > > > >>
> > > > >> On 10/06/2021 17.14, Bruno Goncalves wrote:
> > > > >>> On Thu, Jun 10, 2021 at 3:02 PM Rasmus Villemoes
> > > > >>>  wrote:
> > > > >>>>
> > > > >>>> On 10/06/2021 13.47, Bruno Goncalves wrote:
> > > > >>>>> Hello,
> > > > >>>>>
> > > > >>>>> We've observed in some cases kernel panic when trying to boot on
> > > > >>>>> ppc64le using a kernel based on 5.13.0-rc3. We are not sure if it
> > > > >>>>> could be related to patch
> > > > >>>>> https://lore.kernel.org/lkml/20210313212528.2956377-2-li...@rasmusvillemoes.dk/
> > > > >>>>>
> > > > >>>>
> > > > >>>> Thanks for the report. It's possible, but I'll need some help from 
> > > > >>>> you
> > > > >>>> to get more info.
> > > > >>>>
> > > > >>>> First, can you send me the .config?
> > > > >>>
> > > > >>> The .config is on
> > > > >>> https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2021/06/09/317881801/build_ppc64le_redhat:1332368174/kernel-block-ppc64le-d3f02e52f5548006f04358d407bbb7fe51255c41.config
> > > > >>
> > > > >> Thanks.
> > > > >>
> > > > >>>>
> > > > >>>>>
> > > > >>>>> [1.516075] wait_for_initramfs() called before rootfs_initcalls
> > > > >>>>
> > > > >>>> This is likely because you have CONFIG_UEVENT_HELPER_PATH set to 
> > > > >>>> some
> > > > >>>> non-empty path (/sbin/hotplug perhaps). This did get reported once 
> > > > >>>> before:
> > > > >>>>
> > > > >>>
> > > > >>> CONFIG_UEVENT_HELPER_PATH is not set. In the .config we have "#
> > > > >>> CONFIG_UEVENT_HELPER is not set"
> > > > >>
> > > > >> OK. Then I assume some quite early initcall does a request_module() 
> > > > >> or
> > > > >> request_firmware() (or similar). I don't think this matters - that 
> > > > >> call
> > > > >> would be done before the initramfs was unpacked with or without my
> > > > >> patch, so it won't find anything in the empty rootfs. It's just my 
> > > > >> patch
> > > > >> added a note. But just to figure out where that triggers, can you do
> > > > >>
> > > > >> -   pr_warn_once("wait_for_initramfs() called before
> > > > >> rootfs_initcalls\n");
> > > > >> +   WARN_ONCE(1, "wait_for_initramfs() called before
> > > > >> rootfs_initcalls\n");
> > > > >>
> > > > >> in init/initramfs.c.
> > > > >>
> > > > >
> > > > > I've managed to reproduce the panic with the patch.
> > > > >
> > > > > [1.498654] NIP [c00137d4] wait_for_initramfs+0x94/0xa4
> > > > > [1.498661] LR [c00137d0] wait_for_initramfs+0x90/0xa4
> > > > > [1.498668] Call Trace:
> > > > > [1.498671] [c00027debd60] [c00137d0]
> > > > > wait_for_initramfs+0x90/0xa4 (unreliable)
> > > > > [1.498680] [c00027debdc0] [c0172fc8]
> > > > > call_usermodehelper_exec_async+0x178/0x2c0
> > > > > [1.498691] [c00027debe10] [c000d6ec]
> > > > > ret_from_kernel_thread+0x5c/0x70
> > > >
> > > > Thanks, but unfortunately (and I should have known better) that doesn't
> > > > tell us who actually initated that call_usermodehelper - it's m

Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-06-11 Thread Dave Young
> Probably it is doable to have kexec on 32bit efi working
> without runtime service support, that means no need the trick of fixed
> mapping.
> 
> If I can restore my vm to boot 32bit efi on this weekend then I may provide 
> some draft
> patches for test.

Unfortunately I failed to setup a 32bit efi guest,  here are some
untested draft patches, please have a try.

= kernel draft patch ==

---
 arch/x86/boot/header.S |2 +-
 arch/x86/platform/efi/efi.c|6 +++---
 arch/x86/platform/efi/quirks.c |3 ---
 3 files changed, 4 insertions(+), 7 deletions(-)

--- linux-x86.orig/arch/x86/boot/header.S
+++ linux-x86/arch/x86/boot/header.S
@@ -416,7 +416,7 @@ xloadflags:
 # define XLF23 0
 #endif
 
-#if defined(CONFIG_X86_64) && defined(CONFIG_EFI) && defined(CONFIG_KEXEC_CORE)
+#if defined(CONFIG_EFI) && defined(CONFIG_KEXEC_CORE)
 # define XLF4 XLF_EFI_KEXEC
 #else
 # define XLF4 0
--- linux-x86.orig/arch/x86/platform/efi/efi.c
+++ linux-x86/arch/x86/platform/efi/efi.c
@@ -710,10 +710,10 @@ static void __init kexec_enter_virtual_m
unsigned int num_pages;
 
/*
-* We don't do virtual mode, since we don't do runtime services, on
-* non-native EFI.
+* We don't do virtual mode, since we don't do runtime services
+* on non-native or IA32 EFI.
 */
-   if (efi_is_mixed()) {
+   if (!efi_enabled(EFI_64BIT)) {
efi_memmap_unmap();
clear_bit(EFI_RUNTIME_SERVICES, );
return;
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -524,9 +524,6 @@ int __init efi_reuse_config(u64 tables,
if (!efi_setup)
return 0;
 
-   if (!efi_enabled(EFI_64BIT))
-   return 0;
-
data = early_memremap(efi_setup, sizeof(*data));
if (!data) {
ret = -ENOMEM;


= kexec-tools draft patch =

---
 kexec/arch/i386/kexec-bzImage.c |5 +
 1 file changed, 5 insertions(+)

--- kexec-tools.orig/kexec/arch/i386/kexec-bzImage.c
+++ kexec-tools/kexec/arch/i386/kexec-bzImage.c
@@ -83,6 +83,11 @@ int bzImage_probe(const char *buf, off_t
if (probe_debug) {
fprintf(stderr, "It's a bzImage\n");
}
+
+#define XLF_EFI_KEXEC   (1 << 4)
+   if ((header->xloadflags & XLF_EFI_KEXEC) == XLF_EFI_KEXEC)
+   bzImage_support_efi_boot = 1;
+
return 0;
 }
 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] Documentation: kdump: update kdump guide

2021-06-09 Thread Dave Young
   (this is the "rescue" case)
> -2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
> -3) if the RAM size is larger than 2G, then reserve 128M
> +   While the "crashkernel=size[@offset]" syntax is sufficient for most
> +   configurations, sometimes it's handy to have the reserved memory dependent
> +   on the value of System RAM -- that's mostly for distributors that 
> pre-setup
> +   the kernel command line to avoid a unbootable system after some memory has
> +   been removed from the machine.
>  
> +   The syntax is::
>  
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
>  
> -Boot into System Kernel
> -===
> +   For example::
> +
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
>  
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then reserve 
> 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
> +3) crashkernel=size,high and crashkernel=size,low
> +
> +   If memory above 4G is preferred, crashkernel=size,high can be used to
> +   fulfill that. With it, physical memory is allowed to be allocated from 
> top,
> +   so could be above 4G if system has more than 4G RAM installed. Otherwise,
> +   memory region will be allocated below 4G if available.
> +
> +   When crashkernel=X,high is passed, kernel could allocate physical memory
> +   region above 4G, low memory under 4G is needed in this case. There are
> +   three ways to get low memory:
> +
> +  1) Kernel will allocate at least 256M memory below 4G automatically
> + if crashkernel=Y,low is not specified.
> +  2) Let user specify low memory size instead.
> +  3) Specified value 0 will disable low memory allocation::
> +
> +crashkernel=0,low
> +
> +Boot into System Kernel
> +---
>  1) Update the boot loader (such as grub, yaboot, or lilo) configuration
> files as necessary.
>  
> -2) Boot the system kernel with the boot parameter "crashkernel=Y@X",
> -   where Y specifies how much memory to reserve for the dump-capture kernel
> -   and X specifies the beginning of this reserved memory. For example,
> -   "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
> -   starting at physical address 0x0100 (16MB) for the dump-capture 
> kernel.
> +2) Boot the system kernel with the boot parameter "crashkernel=Y@X".
>  
> -   On x86 and x86_64, use "crashkernel=64M@16M".
> +   On x86 and x86_64, use "crashkernel=Y[@X]". Most of the time, the
> +   start address 'X' is not necessary, kernel will search a suitable
> +   area. Unless an explicit start address is expected.
>  
> On ppc64, use "crashkernel=128M@32M".
>  
> @@ -331,8 +375,8 @@ of dump-capture kernel. Following is the summary.
>  
>  For i386 and x86_64:
>  
> - - Use vmlinux if kernel is not relocatable.
>   - Use bzImage/vmlinuz if kernel is relocatable.
> + - Use vmlinux if kernel is not relocatable.
>  
>  For ppc64:
>  
> @@ -392,7 +436,7 @@ loading dump-capture kernel.
>  
>  For i386, x86_64 and ia64:
>  
> - "1 irqpoll maxcpus=1 reset_devices"
> + "1 irqpoll nr_cpus=1 reset_devices"
>  
>  For ppc64:
>  
> @@ -400,7 +444,7 @@ For ppc64:
>  
>  For s390x:
>  
> - "1 maxcpus=1 cgroup_disable=memory"
> + "1 nr_cpus=1 cgroup_disable=memory"
>  
>  For arm:
>  
> @@ -408,7 +452,7 @@ For arm:
>  
>  For arm64:
>  
> - "1 maxcpus=1 reset_devices"
> + "1 nr_cpus=1 reset_devices"
>  
>  Notes on loading the dump-capture kernel:
>  
> @@ -488,6 +532,10 @@ the following command::
>  
> cp /proc/vmcore 
>  
> +You can also use makedumpfile utility to write out the dump file
> +with specified options to filter out unwanted contents, e.g::
> +
> +   makedumpfile -l --message-level 1 -d 31 /proc/vmcore 
>  
>  Analysis
>  
> @@ -535,8 +583,7 @@ This will cause a kdump to occur at the 
> add_taint()->panic() call.
>  Contact
>  ===
>  
> -- Vivek Goyal (vgo...@redhat.com)
> -- Maneesh Soni (mane...@in.ibm.com)
> +- kexec@lists.infradead.org
>  
>  GDB macros
>  ==
> -- 
> 2.17.2
> 

Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-06-09 Thread Dave Young
On 06/08/21 at 03:38pm, Andy Shevchenko wrote:
> On Tue, Jun 8, 2021 at 3:29 PM Dave Young  wrote:
> > On 06/07/21 at 08:18pm, Andy Shevchenko wrote:
> > > On Mon, Jun 07, 2021 at 07:22:21PM +0300, Andy Shevchenko wrote:
> > > > On Sat, Jun 05, 2021 at 03:51:05PM +0800, Dave Young wrote:
> > > > > On 06/02/21 at 11:53am, Andy Shevchenko wrote:
> > > > > > On Wed, Jun 02, 2021 at 11:42:14AM +0300, Andy Shevchenko wrote:
> > > > > > > On Fri, Dec 02, 2016 at 09:54:14PM +0200, Andy Shevchenko wrote:
> > > > > > > > Until now DMI information is lost when kexec'ing. Fix this in 
> > > > > > > > the same way as
> > > > > > > > it has been done for ACPI RSDP.
> > > > > > > >
> > > > > > > > Series has been tested on Galileo Gen2 where DMI is used by 
> > > > > > > > drivers, in
> > > > > > > > particular the default I2C host speed is choosen based on DMI 
> > > > > > > > system
> > > > > > > > information and now gets it correct.
> > > > > > >
> > > > > > > Still nothing happens for a while and problem still exists.
> > > > > > > Can we do something about it, please?
> > > > >
> > > > > Seems I totally missed this thread. Old emails lost.
> > > >
> > > > You can always access to it via lore :-)
> > > > https://lore.kernel.org/linux-efi/20161217105721.gb6...@dhcp-128-65.nay.redhat.com/T/#u
> >
> > Thanks.  Hmm, this is for 32bit efi.  kexec efi boot support was only
> > added for 64bit. So if 32bit dmidecode does not work I'm not surprise.
> >
> > > >
> > > > (Okay, it's not full, but contains main parts anyway)
> > > >
> > > >
> > > > > The question Ard asked is to confirm if the firmware converted the
> > > > > SMBIOS3 addr to a virtual address after exit boot service. I do not
> > > > > remember some easy way to check it due to lost the context of the 
> > > > > code.
> > > > > But you can try to check it via dmesg|grep SMBIOS both in normal boot
> > > > > and kexeced boot log.  And then compare if those addresses are
> > > > > identical.
> > > > >
> > > > > If the SMBIOS3 addr in kexec kernel is different then it should have
> > > > > been modified by firmware. Then we need patch kernel and kexec-tools 
> > > > > to
> > > > > support it.
> > > > >
> > > > > You can try below patch to see if it works:
> > > >
> > > > So, AFAIU I have to apply patch to kexec tools for the fist kernel + 
> > > > userspace
> > > > and apply kernel patch for the second kernel? Or it's all for the first 
> > > > one?
> > > >
> > > > > apply a kexec-tools patch to kexec-tools if you do not use kexec -s
> > > > > (kexec_file_load):
> > > >
> > > > Here is how we are using it:
> > > > https://github.com/andy-shev/buildroot/blob/intel/board/intel/common/netboot/udhcpc-script.sh#L54
> > >
> > > Okay, thanks for the patches. I have applied them to both kernels, so the 
> > > first
> > > one and second one are the same and kexec tools have a patch provided in 
> > > the
> > > user space of the both kernels (only first one in use though).
> > >
> > > Before applying your patch, I have reverted my hacks (as per this series).
> > >
> > > Result is:
> > >
> > > # uname -a
> > > Linux buildroot 5.13.0-rc5+ #1 SMP Mon Jun 7 19:49:40 EEST 2021 i586 
> > > GNU/Linux
> > > # dmidecode
> > > # dmidecode 3.3
> > > Scanning /dev/mem for entry point.
> > > # No SMBIOS nor DMI entry point found, sorry.
> > >
> > > I.o.w. it does NOT fix the issue. My patches do (with a hint from user 
> > > space).
> >
> > As I said, since it is 32bit efi, so your test results are expected,
> > also no need to check the kernel log about SMBIOS3 address changed or
> > not.
> 
> So, what shall I do? It's already 5 years passed without any progress
> while my patches definitely help here.
> Should I rebase and resubmit?

Probably it is doable to have kexec on 32bit efi working
without runtime service support, that means no need the trick of fixed
mapping.

If I can restore my vm to boot 32bit efi on this weekend then I may provide 
some draft
patches for test.

> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-06-08 Thread Dave Young
On 06/07/21 at 08:18pm, Andy Shevchenko wrote:
> On Mon, Jun 07, 2021 at 07:22:21PM +0300, Andy Shevchenko wrote:
> > On Sat, Jun 05, 2021 at 03:51:05PM +0800, Dave Young wrote:
> > > On 06/02/21 at 11:53am, Andy Shevchenko wrote:
> > > > On Wed, Jun 02, 2021 at 11:42:14AM +0300, Andy Shevchenko wrote:
> > > > > On Fri, Dec 02, 2016 at 09:54:14PM +0200, Andy Shevchenko wrote:
> > > > > > Until now DMI information is lost when kexec'ing. Fix this in the 
> > > > > > same way as
> > > > > > it has been done for ACPI RSDP.
> > > > > > 
> > > > > > Series has been tested on Galileo Gen2 where DMI is used by 
> > > > > > drivers, in
> > > > > > particular the default I2C host speed is choosen based on DMI system
> > > > > > information and now gets it correct.
> > > > > 
> > > > > Still nothing happens for a while and problem still exists.
> > > > > Can we do something about it, please?
> > > 
> > > Seems I totally missed this thread. Old emails lost.
> > 
> > You can always access to it via lore :-)
> > https://lore.kernel.org/linux-efi/20161217105721.gb6...@dhcp-128-65.nay.redhat.com/T/#u

Thanks.  Hmm, this is for 32bit efi.  kexec efi boot support was only
added for 64bit. So if 32bit dmidecode does not work I'm not surprise.

> > 
> > (Okay, it's not full, but contains main parts anyway)
> > 
> > 
> > > The question Ard asked is to confirm if the firmware converted the
> > > SMBIOS3 addr to a virtual address after exit boot service. I do not
> > > remember some easy way to check it due to lost the context of the code.
> > > But you can try to check it via dmesg|grep SMBIOS both in normal boot
> > > and kexeced boot log.  And then compare if those addresses are
> > > identical.
> > > 
> > > If the SMBIOS3 addr in kexec kernel is different then it should have
> > > been modified by firmware. Then we need patch kernel and kexec-tools to
> > > support it.
> > > 
> > > You can try below patch to see if it works:
> > 
> > So, AFAIU I have to apply patch to kexec tools for the fist kernel + 
> > userspace
> > and apply kernel patch for the second kernel? Or it's all for the first one?
> > 
> > > apply a kexec-tools patch to kexec-tools if you do not use kexec -s
> > > (kexec_file_load):
> > 
> > Here is how we are using it:
> > https://github.com/andy-shev/buildroot/blob/intel/board/intel/common/netboot/udhcpc-script.sh#L54
> 
> Okay, thanks for the patches. I have applied them to both kernels, so the 
> first
> one and second one are the same and kexec tools have a patch provided in the
> user space of the both kernels (only first one in use though).
> 
> Before applying your patch, I have reverted my hacks (as per this series).
> 
> Result is:
> 
> # uname -a
> Linux buildroot 5.13.0-rc5+ #1 SMP Mon Jun 7 19:49:40 EEST 2021 i586 GNU/Linux
> # dmidecode
> # dmidecode 3.3
> Scanning /dev/mem for entry point.
> # No SMBIOS nor DMI entry point found, sorry.
> 
> I.o.w. it does NOT fix the issue. My patches do (with a hint from user space).

As I said, since it is 32bit efi, so your test results are expected,
also no need to check the kernel log about SMBIOS3 address changed or
not.

> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] firmware: dmi_scan: Make it work in kexec'ed kernel

2021-06-05 Thread Dave Young
Hi,
On 06/02/21 at 11:53am, Andy Shevchenko wrote:
> +Cc: Ard
> 
> On Wed, Jun 02, 2021 at 11:42:14AM +0300, Andy Shevchenko wrote:
> > On Fri, Dec 02, 2016 at 09:54:14PM +0200, Andy Shevchenko wrote:
> > > Until now DMI information is lost when kexec'ing. Fix this in the same 
> > > way as
> > > it has been done for ACPI RSDP.
> > > 
> > > Series has been tested on Galileo Gen2 where DMI is used by drivers, in
> > > particular the default I2C host speed is choosen based on DMI system
> > > information and now gets it correct.
> > 
> > Still nothing happens for a while and problem still exists.
> > Can we do something about it, please?

Seems I totally missed this thread. Old emails lost.

The question Ard asked is to confirm if the firmware converted the
SMBIOS3 addr to a virtual address after exit boot service. I do not
remember some easy way to check it due to lost the context of the code.
But you can try to check it via dmesg|grep SMBIOS both in normal boot
and kexeced boot log.  And then compare if those addresses are
identical.

If the SMBIOS3 addr in kexec kernel is different then it should have
been modified by firmware. Then we need patch kernel and kexec-tools to
support it.

You can try below patch to see if it works:

apply a kexec-tools patch to kexec-tools if you do not use kexec -s
(kexec_file_load):
--- kexec-tools.orig/kexec/arch/i386/x86-linux-setup.c
+++ kexec-tools/kexec/arch/i386/x86-linux-setup.c
@@ -533,7 +533,8 @@ struct efi_setup_data {
uint64_t runtime;
uint64_t tables;
uint64_t smbios;
-   uint64_t reserved[8];
+   uint64_t smbios3;
+   uint64_t reserved[7];
 };
 
 struct setup_data {
@@ -580,6 +581,8 @@ static int get_efi_values(struct efi_set
 
ret = get_efi_value("/sys/firmware/efi/systab", "SMBIOS=0x",
>smbios);
+   ret |= get_efi_value("/sys/firmware/efi/systab", "SMBIOS3=0x",
+   >smbios3);
ret |= get_efi_value("/sys/firmware/efi/fw_vendor", "0x",
 >fw_vendor);
ret |= get_efi_value("/sys/firmware/efi/runtime", "0x",

=
Kernel patch:

--- linux-x86.orig/arch/x86/include/asm/efi.h
+++ linux-x86/arch/x86/include/asm/efi.h
@@ -167,7 +167,8 @@ struct efi_setup_data {
u64 __unused;
u64 tables;
u64 smbios;
-   u64 reserved[8];
+   u64 smbios3;
+   u64 reserved[7];
 };
 
 extern u64 efi_setup;
--- linux-x86.orig/arch/x86/kernel/kexec-bzimage64.c
+++ linux-x86/arch/x86/kernel/kexec-bzimage64.c
@@ -144,6 +144,7 @@ prepare_add_efi_setup_data(struct boot_p
esd->fw_vendor = efi_fw_vendor;
esd->tables = efi_config_table;
esd->smbios = efi.smbios;
+   esd->smbios3 = efi.smbios3;
 
sd->type = SETUP_EFI;
sd->len = sizeof(struct efi_setup_data);
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -497,8 +497,8 @@ void __init efi_free_boot_services(void)
  * their physical addresses therefore we pass them via setup_data and
  * correct those entries to their respective physical addresses here.
  *
- * Currently only handles smbios which is necessary for some firmware
- * implementation.
+ * Currently only handles smbios and smbios3 which is necessary for
+ * some firmware implementation.
  */
 int __init efi_reuse_config(u64 tables, int nr_tables)
 {
@@ -521,7 +521,7 @@ int __init efi_reuse_config(u64 tables,
goto out;
}
 
-   if (!data->smbios)
+   if (!data->smbios  && !data->smbios3)
goto out_memremap;
 
sz = sizeof(efi_config_table_64_t);
@@ -538,8 +538,10 @@ int __init efi_reuse_config(u64 tables,
 
guid = ((efi_config_table_64_t *)p)->guid;
 
-   if (!efi_guidcmp(guid, SMBIOS_TABLE_GUID))
+   if (!efi_guidcmp(guid, SMBIOS_TABLE_GUID) && data->smbios)
((efi_config_table_64_t *)p)->table = data->smbios;
+   else if (!efi_guidcmp(guid, SMBIOS3_TABLE_GUID) && 
data->smbios3)
+   ((efi_config_table_64_t *)p)->table = data->smbios3;
p += sz;
}
early_memunmap(tablep, nr_tables * sz);


Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] Documentation: kdump: update kdump guide

2021-06-03 Thread Dave Young
Hi Baoquan,

Just some spell checking found issues please see comments inline.
Otherwise looks good to me:

Acked-by: Dave Young 

On 06/03/21 at 12:30pm, Baoquan He wrote:
> Some parts of the guide are aged, hence need be updated.
> 
> 1) The backup area of the 1st 640K on X86_64 has been removed
>by below commits, update the description accordingly.
> 
>commit 7c321eb2b843 ("x86/kdump: Remove the backup region handling")
>commit 6f599d84231f ("x86/kdump: Always reserve the low 1M when the 
> crashkernel option is specified")
> 
> 2) Sort out the descripiton of "crashkernel syntax" part.
> 
> 3) And some other minor cleanups.
> 
> Signed-off-by: Baoquan He 
> ---
> v1->v2:
>  Update the obsolete descriptions about SMP and RELOCATABLE according
>  to Dave's comment.
> 
>  Documentation/admin-guide/kdump/kdump.rst | 165 ++
>  1 file changed, 106 insertions(+), 59 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..f83bf7bac503 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -2,7 +2,7 @@
>  Documentation for Kdump - The kexec-based Crash Dumping Solution
>  
>  
> -This document includes overview, setup and installation, and analysis
> +This document includes overview, setup, installation, and analysis
>  information.
>  
>  Overview
> @@ -13,9 +13,9 @@ dump of the system kernel's memory needs to be taken (for 
> example, when
>  the system panics). The system kernel's memory image is preserved across
>  the reboot and is accessible to the dump-capture kernel.
>  
> -You can use common commands, such as cp and scp, to copy the
> -memory image to a dump file on the local disk, or across the network to
> -a remote system.
> +You can use common commands, such as cp, scp or makedumpfile to copy
> +the memory image to a dump file on the local disk, or across the network
> +to a remote system.
>  
>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
>  s390x, arm and arm64 architectures.
> @@ -27,12 +27,14 @@ The kexec -p command loads the dump-capture kernel into 
> this reserved
>  memory.
>  
>  On x86 machines, the first 640 KB of physical memory is needed to boot,

s/to boot/for boot

> -regardless of where the kernel loads. Therefore, kexec backs up this
> -region just before rebooting into the dump-capture kernel.
> +regardless of where the kernel loads. For simpler handling, the whole
> +low 1M is reserved to avoid any later kernel or device driver writing
> +data into this area. Like this, the low 1M can be reused as system RAM
> +by kdump kernel without extra handling.
>  
> -Similarly on PPC64 machines first 32KB of physical memory is needed for
> -booting regardless of where the kernel is loaded and to support 64K page
> -size kexec backs up the first 64KB memory.
> +On PPC64 machines first 32KB of physical memory is needed for booting
> +regardless of where the kernel is loaded and to support 64K page size
> +kexec backs up the first 64KB memory.
>  
>  For s390x, when kdump is triggered, the crashkernel region is exchanged
>  with the region [0, crashkernel region size] and then the kdump kernel
> @@ -46,14 +48,14 @@ passed to the dump-capture kernel through the elfcorehdr= 
> boot
>  parameter. Optionally the size of the ELF header can also be passed
>  when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
>  
> -
>  With the dump-capture kernel, you can access the memory image through
>  /proc/vmcore. This exports the dump as an ELF-format file that you can
> -write out using file copy commands such as cp or scp. Further, you can
> -use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
> -debug the dump file. This method ensures that the dump pages are correctly
> -ordered.
> -
> +write out using file copy commands such as cp or scp. You can also use
> +makedumpfile utility to analyze and write out filtered contents with
> +options, e.g with '-d 31' it will only write out kernel data. Further,
> +you can use analysis tools such as the GNU Debugger (GDB) and the Crash
> +tool to debug the dump file. This method ensures that the dump pages are
> +correctly ordered.
>  
>  Setup and Installation
>  ==
> @@ -125,9 +127,18 @@ dump-capture kernels for enabling kdump support.
>  System kernel config options
>  
>  
> -1) Enable "kexec system call" in "Processor type and features."::
> +1) Enable "kexec system call&qu

Re: [PATCH] Documentation: kdump: update kdump guide

2021-05-26 Thread Dave Young
Hi Baoquan,
On 05/26/21 at 03:11pm, Baoquan He wrote:
> On 05/25/21 at 07:41pm, Dave Young wrote:
> > Hi Baoquan,
> > > @@ -180,7 +191,7 @@ Dump-capture kernel config options (Arch Dependent, 
> > > i386 and x86_64)
> > >  
> > >   CONFIG_SMP=n
> > >  
> > > -   (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
> > > +   (If CONFIG_SMP=y, then specify nr_cpus=1 on the kernel command line
> > > when loading the dump-capture kernel, see section "Load the 
> > > Dump-capture
> > > Kernel".)
> > 
> > This part should be obsolete?  Since for X86_64 we can enable smp boot
> > with disable_cpu_apicid=X set (see the Notes on loading the dump-capture
> > kernel part)  So I think no need to disable CONFIG_SMP at all.  The
> > current RHEL use of nr_cpus=1 is just to save 2nd kernel memory use.
> 
> Keeping them because they are not wrong. Talking about default config,
> currently we only care about x86_64 mostly, not sure if we should remove
> i386 part too. Anyway, I am fine to remove them and the below
> relocatable thing.

I also agree it is not wrong :)  But I personally think the doc should
target for the most common use cases.  If CONFIG_SMP=n is not common
then we may just describe the default words for CONFIG_SMP=y,  and we
may add some words for exection cases

for example:
Specify nr_cpus=1 blabla
Note: if CONFIG_SMP is not set then nr_cpus=1 is not needed ...


> 
> > 
> > Ditto for the text for other arches, not sure if they need update
> > though, see if other maintainers can provide inputs..
> > 
> > 
> > Otherwise for the CONFIG_RELOCATABLE related part,  it may be better to
> > update as well? 
> > ''' quote:
> > 3) If one wants to build and use a relocatable kernel,
> >Enable "Build a relocatable kernel" support under "Processor type and
> >features"::
> > 
> > CONFIG_RELOCATABLE=y
> > 
> > 4) Use a suitable value for "Physical address where the kernel is
> >loaded" (under "Processor type and features"). This only appears when
> >"kernel crash dumps" is enabled. A suitable value depends upon
> >whether kernel is relocatable or not.
> > 
> >If you are using a relocatable kernel use CONFIG_PHYSICAL_START=0x10
> >This will compile the kernel for physical address 1MB, but given the fact
> >kernel is relocatable, it can be run from any physical address hence
> >kexec boot loader will load it in memory region reserved for dump-capture
> >kernel.
> > 
> >Otherwise it should be the start of memory region reserved for
> >second kernel using boot parameter "crashkernel=Y@X". Here X is
> >start of memory region reserved for dump-capture kernel.
> >Generally X is 16MB (0x100). So you can set
> >CONFIG_PHYSICAL_START=0x100
> > ''' end quote
> > 
> > Since relocatable kernel is used by default now so we may just not describe 
> > it as "If one
> > want to build with it =y", I feel it should be a corner case instead of
> > the default use case.   Maybe HPA, Vivek, Eric can provide more opinions 
> > since
> > they may know more about the background.  
> > 
> > >  
> ...  
> > > -Boot into System Kernel
> > > -===
> > > +   crashkernel=512M-2G:64M,2G-:128M
> > >  
> > > +   This would mean:
> > > +
> > > +   1) if the RAM is smaller than 512M, then don't reserve anything
> > > +  (this is the "rescue" case)
> > > +   2) if the RAM size is between 512M and 2G (exclusive), then 
> > > reserve 64M
> > > +   3) if the RAM size is larger than 2G, then reserve 128M
> > > +
> > > +3) crashkernel=size,high and crashkernel=size,low
> > > +
> > > +   If memory above 4G is preferred, crashkernel=size,high can be used to
> > > +   fulfill that. With it, physical memory is allowed to allocate from 
> > > top,
> > > +   so could be above 4G if system has more than 4G RAM installed. 
> > > Otherwise,
> > > +   memory region will be allocated below 4G if available.
> > > +
> > > +   When crashkernel=X,high is passed, kernel could allocate physical 
> > > memory
> > > +   region above 4G, low memory under 4G is needed in this case. There are
> > > +   three ways to get low memory:
> > > +
> > > +  1) Kernel will allocate at least 256M memory below 4G automatically
&

Re: [PATCH] Documentation: kdump: update kdump guide

2021-05-25 Thread Dave Young
Hi Baoquan,

Thanks for the update!  Since we are updating it I added arch
maintainers to see if they have any comments about the archtectures
part.

I added a few comments inline, but still want more inputs from other
people :)
On 05/20/21 at 06:37pm, Baoquan He wrote:
> Some parts of the guide are aged, hence need be updated.
> 
> 1) The backup area of the 1st 640K on X86_64 has been removed
>by below commits, update the description accordingly.
> 
>commit 7c321eb2b843 ("x86/kdump: Remove the backup region handling")
>commit 6f599d84231f ("x86/kdump: Always reserve the low 1M when the 
> crashkernel option is specified")
> 
> 2) Sort out the descripiton of "crashkernel syntax" part.
> 
> 3) And some other minor cleanups.
> 
> Signed-off-by: Baoquan He 
> ---
>  Documentation/admin-guide/kdump/kdump.rst | 150 ++
>  1 file changed, 97 insertions(+), 53 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..6d0dcf5b5e1f 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -2,7 +2,7 @@
>  Documentation for Kdump - The kexec-based Crash Dumping Solution
>  
>  
> -This document includes overview, setup and installation, and analysis
> +This document includes overview, setup, installation, and analysis
>  information.
>  
>  Overview
> @@ -13,12 +13,12 @@ dump of the system kernel's memory needs to be taken (for 
> example, when
>  the system panics). The system kernel's memory image is preserved across
>  the reboot and is accessible to the dump-capture kernel.
>  
> -You can use common commands, such as cp and scp, to copy the
> -memory image to a dump file on the local disk, or across the network to
> -a remote system.
> +You can use common commands, such as cp, scp or makedumpfile to copy
> +the memory image to a dump file on the local disk, or across the network
> +to a remote system.
>  
> -Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> -s390x, arm and arm64 architectures.
> +Kdump and kexec are currently supported on the x86/64, ppc64, ia64,
> +s390x, arm/64 architectures.
>  
>  When the system kernel boots, it reserves a small section of memory for
>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> @@ -27,12 +27,14 @@ The kexec -p command loads the dump-capture kernel into 
> this reserved
>  memory.
>  
>  On x86 machines, the first 640 KB of physical memory is needed to boot,
> -regardless of where the kernel loads. Therefore, kexec backs up this
> -region just before rebooting into the dump-capture kernel.
> +regardless of where the kernel loads. For simpler handling, the whole
> +low 1M is reserved to avoid any later kernel or device driver writing
> +data into this area. Like this, the low 1M can be reused as system RAM
> +by kdump kernel without extra handling.
>  
> -Similarly on PPC64 machines first 32KB of physical memory is needed for
> -booting regardless of where the kernel is loaded and to support 64K page
> -size kexec backs up the first 64KB memory.
> +On PPC64 machines first 32KB of physical memory is needed for booting
> +regardless of where the kernel is loaded and to support 64K page size
> +kexec backs up the first 64KB memory.
>  
>  For s390x, when kdump is triggered, the crashkernel region is exchanged
>  with the region [0, crashkernel region size] and then the kdump kernel
> @@ -46,14 +48,14 @@ passed to the dump-capture kernel through the elfcorehdr= 
> boot
>  parameter. Optionally the size of the ELF header can also be passed
>  when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
>  
> -
>  With the dump-capture kernel, you can access the memory image through
>  /proc/vmcore. This exports the dump as an ELF-format file that you can
> -write out using file copy commands such as cp or scp. Further, you can
> -use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
> -debug the dump file. This method ensures that the dump pages are correctly
> -ordered.
> -
> +write out using file copy commands such as cp or scp. You can also use
> +makedumpfile utility to analyze and write out filtered contents with
> +options, e.g with '-d 31' it will only write out kernel data. Further,
> +you can use analysis tools such as the GNU Debugger (GDB) and the Crash
> +tool to debug the dump file. This method ensures that the dump pages are
> +correctly ordered.
>  
>  Setup and Installation
>  ==
> @@ -111,7 +113,7 @@ There are two possible methods of using Kdump.
>  2) Or use the system kernel binary itself as dump-capture kernel and there is
> no need to build a separate dump-capture kernel. This is possible
> only with the architectures which support a relocatable kernel. As
> -   of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support

Re: i386 kexec-tools for x86_64 kdump kernels

2021-05-19 Thread Dave Young
Hi Kevin,
On 05/17/21 at 09:40pm, Kevin Mitchell wrote:
> Hi,
> 
> As a space-saving strategy for our embedded boot environment, we use an i386
> kexec binary to load our x86_64 kdump kernel from an x86_64 system kernel. 
> This
> worked great up until linux-5.2, which included the commit
> 
> 9ca5c8e632ce ("x86/kdump: Have crashkernel=X reserve under 4G by
> default")
> 
> Sure enough, according to /proc/iomem, the "Crash kernel" area went from
> starting at 0x3400 to 0x7b00, which is above the 896M
> limit. Unfortunately, since i386 kexec seems to use
> kexec/arch/i386/kexec-bzImage.c even to load an x86_64 kernel, the
> DEFAULT_BZIMAGE_ADDR_MAX = 0x37FF 896M limit is still enforced when 
> loading
> the panic kernel:
> 
> # kexec32 --load-panic bzImage64
> Could not find a free area of memory of 0x8000 bytes...
> locate_hole failed
> 
> I can work around this by patching kexec-tools to raise that limit to
> DEFAULT_BZIMAGE_ADDR_MAX = 0x which allows loading the x86_64 kdump
> bzImage. This does in fact kexec fine from that position if I trigger a panic.
> 
> However, this doesn't appear to be a general solution since the 896M does 
> still
> apply if either of the kernels is i386. In that case, attempting to kexec from
> the higher address will just hang with no console output. In this case, it
> probably is better to continue to fail to load the kdump image rather than 
> wait
> until the panic to find out something is wrong.

I'm not sure if you can try to detect the kernel type and special case
this in kexec-tools, eg. if the 2nd kernel is 64-bit kernel then just
bump the addr max otherwise go original logic.  If this is doable then
it would be a good way IMO.

See if Eric, Baoquan and other X86 people have more idea.

> 
> Fortunately, while 9ca5c8e632ce allows an i386 kernel to reserve a "Crash
> kernel" region > 896M, it doesn't actually do that by default - I have to 
> force
> it to go there with crashkernel=@. I am not sure if this is just a fluke or if
> there is something actually ensuring it defaults to a working
> location. Nevertheless, it appears the restriction removed by this commit is
> still required by i386 kernels. Its enforcement has just moved to userspace.
> 
> So it seems that the largest fallout of the commit is restricted to the
> admittedly niche combination linux-x86_64 -> kexec-i386 -> 
> linux-x86_64(kdump),
> which no longer works out of the box without pinning the crashkernel address 
> or
> patching kexec.
> 
> Is this just something we need to live with or is it worth looking into how to
> better support this combination?

This is the case I missed, but I would think it as not a common use
case. It would be better to leave it as is in kernel and try to fix in
kexec-tools or just use the workaround.

> 
> Thanks,
> Kevin
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [patch 48/91] kernel/crash_core: add crashkernel=auto for vmcore creation

2021-05-18 Thread Dave Young
[Add kexec list, for people interested about the old replies, please find in 
linux-mm archive]
On 05/18/21 at 10:51am, David Hildenbrand wrote:
> On 18.05.21 10:49, Baoquan He wrote:
> > On 05/17/21 at 10:22am, David Hildenbrand wrote:
> > > On 12.05.21 16:51, Baoquan He wrote:
> > > > On 05/11/21 at 07:07pm, David Hildenbrand wrote:
> > > > > > > If the way adding default value into kernel config is disliked,
> > > > > > > this a) option looks good. We can get value with x% of system 
> > > > > > > RAM, but
> > > > > > > clamp it with CRASH_KERNEL_MIN/MAX. The CRASH_KERNEL_MIN/MAX may 
> > > > > > > need be
> > > > > > > defined with a default value for different ARCHes. It's very 
> > > > > > > close to
> > > > > > > our current implementation, and handling 'auto' in kernel.
> > > > > > > 
> > > > > > > And kernel config provided so that people can tune the MIN/MAX 
> > > > > > > value,
> > > > > > > but no need to post patch to do the tuning each time if have to?
> > > > > > Maybe I'm missing something, but the whole point is to avoid kernel
> > > > > > configuration option at all. If the crashkernel=auto works good for 
> > > > > > 99% of
> > > > > > the cases, there is no need to provide build time configuration 
> > > > > > along with
> > > > > > it. There are plenty of ways users can control crashkernel 
> > > > > > reservations
> > > > > > with the existing 2-4 (depending on architecture) command line 
> > > > > > options.
> > > > > > 
> > > > > > Simply hard coding a reasonable defaults (e.g.
> > > > > > "1G-64G:128M,64G-1T:256M,1T-:512M"), and using these defaults when
> > > > > > crashkernel=auto is set would cover the same 99% of users you 
> > > > > > referred to.
> > > > > 
> > > > > Right, and we can easily allocate a bit more as a safety net 
> > > > > temporarily
> > > > > when we can actually shrink the area later.
> > > > > 
> > > > > > 
> > > > > > If we can resize the reservation later during boot this will also 
> > > > > > address
> > > > > > David's concern about the wasted memory.
> > > > > > 
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > You mentioned that amount of memory that is required for crash 
> > > > > > kernel
> > > > > > reservation depends on the devices present on the system. Is is 
> > > > > > possible to
> > > > > > detect how much memory is required at late stages of boot?
> > > > > 
> > > > > Here is my thinking:
> > > > > 
> > > > > There seems to be some kind of formula we can roughly use to come up 
> > > > > with
> > > > > the final crashkernel size. Baoquan for sure knows all the dirty 
> > > > > details, I
> > > > > assume it's roughly "core kernel + drivers + user space".
> > > > > 
> > > > > In the kernel, we can only come up with "core kernel + drivers" 
> > > > > expecting
> > > > > that we will run
> > > > > 
> > > > > a) roughly the same kernel
> > > > > b) with roughly the same drivers
> > > > 
> > > > As replied to Mike, kernel size is undecided for different kernel with
> > > > different configs. We can define a default minimal size to cover kernel
> > > > and driver on systems with not many devices, but hardcoding the size
> > > > into upstream is not helpful. If the size is big, users will be asked to
> > > > check and shrink always. If the size is too small, a new value need be
> > > > got and added to cmdline and reboot.
> > > > 
> > > 
> > > Hi Baoquan, Kairui, Dave,
> > > 
> > > so IIUC now, our "old" kernel cannot actually tell us any reliable
> > > "crashkernel area size" because
> > > 
> > > a) it has no idea with which cmdline parameters the crashkernel will be
> > > started with, and these can have a big impact.
> > > b) it has no idea which driver will be loaded in the crashkernel.
> > > c) It has no idea what will be running in the crashkernel user space.
> > > 
> > > 
> > > AFAIKS, best we can do without further information is, therefore, use some
> > > heuristic to a) allocate some memory early during boot in the kernel and 
> > > b)
> > > later refine our allocation, triggered by user space (-> shrink the
> > > crashkernel area).
> > > 
> > > I dislike calling a) "auto". It provides a default based on some heuristic
> > > (boot memory size), and that default might be very unfortunate in some
> > > scenarios (-> waste memory).
> > > 
> > > While we could discuss calling the current approach ( a)
> > > )"crashkernel=default", whereby the default is encoded at compile time as
> > > determined by a distributor, I still still quite don't like it because it
> > > feels like this is not necessary. We have a way to pass something like 
> > > that
> > > via the cmdline, so it's just a matter of properly using that feature from
> > > user space.
> > > 
> > > 
> > > AFAIKS, all you want is most probably a more dynamic way to construct a
> > > kernel cmdline, with some properties specific to a kernel.
> > > 
> > > Let's assume the following:
> > > 
> > > a) When a distributor ships a kernel, he also ships some kind of defaults
> > > 

Re: [PATCH 1/2] firmware/efi: Tell memblock about EFI reservations

2021-05-12 Thread Dave Young
On 05/03/21 at 11:56am, Moritz Fischer wrote:
> Marc,
> 
> On Thu, Apr 29, 2021 at 02:35:32PM +0100, Marc Zyngier wrote:
> > kexec_load_file() relies on the memblock infrastructure to avoid
> > stamping over regions of memory that are essential to the survival
> > of the system.
> > 
> > However, nobody seems to agree how to flag these regions as reserved,
> > and (for example) EFI only publishes its reservations in /proc/iomem
> > for the benefit of the traditional, userspace based kexec tool.
> > 
> > On arm64 platforms with GICv3, this can result in the payload being
> > placed at the location of the LPI tables. Shock, horror!
> > 
> > Let's augment the EFI reservation code with a memblock_reserve() call,
> > protecting our dear tables from the secondary kernel invasion.
> > 
> > At some point, someone will have to go and figure out a way to unify
> > these multiple reservation trees, because sprinkling random reservation
> > calls is only a temporary workaround.
> > 
> 
> Feel free to add (and/or):
> 
> Reported-by: Moritz Fischer 
> Tested-by: Moritz Fischer 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  drivers/firmware/efi/efi.c | 23 ++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index 4b7ee3fa9224..026b02f5f7d8 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -896,11 +896,25 @@ static int __init efi_memreserve_map_root(void)
> >  static int efi_mem_reserve_iomem(phys_addr_t addr, u64 size)
> >  {
> > struct resource *res, *parent;
> > +   int ret;
> >  
> > res = kzalloc(sizeof(struct resource), GFP_ATOMIC);
> > if (!res)
> > return -ENOMEM;
> >  
> > +   /*
> > +* Given that efi_mem_reserve_iomem() can be called at any
> > +* time, only call memblock_reserve() if the architecture
> > +* keeps the infrastructure around.
> > +*/
> > +   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> > +   ret = memblock_reserve(addr, size);
> > +   if (ret) {
> > +   kfree(res);
> > +   return ret;
> > +   }
> > +   }
> > +

If you go with memblock, it would be better to handle it separately from
the iomem?

> > res->name   = "reserved";
> > res->flags  = IORESOURCE_MEM;
> > res->start  = addr;
> > @@ -908,7 +922,14 @@ static int efi_mem_reserve_iomem(phys_addr_t addr, u64 
> > size)
> >  
> > /* we expect a conflict with a 'System RAM' region */
> > parent = request_resource_conflict(_resource, res);
> > -   return parent ? request_resource(parent, res) : 0;
> > +   ret = parent ? request_resource(parent, res) : 0;
> > +   if (ret) {
> > +   kfree(res);
> > +   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
> > +   memblock_free(addr, size);
> > +   }
> > +
> > +   return ret;

It looks odd to free memblock when reqeust resource fails, they are not
relavant?

> >  }
> >  
> >  int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
> > -- 
> > 2.29.2
> > 
> > 
> > ___
> > linux-arm-kernel mailing list
> > linux-arm-ker...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> Thanks,
> Moritz
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] arm64: kexec_file_load vs memory reservations

2021-05-12 Thread Dave Young
Hi Marc,
On 05/12/21 at 07:04pm, Marc Zyngier wrote:
> + Dave Young, which I accidentally missed in my initial post
> 
> On Thu, 29 Apr 2021 14:35:31 +0100,
> Marc Zyngier  wrote:
> > 
> > It recently became apparent that using kexec with kexec_file_load() on
> > arm64 is pretty similar to playing Russian roulette.
> > 
> > Depending on the amount of memory, the HW supported and the firmware
> > interface used, your secondary kernel may overwrite critical memory
> > regions without which the secondary kernel cannot boot (the GICv3 LPI
> > tables being a prime example of such reserved regions).
> > 
> > It turns out that there is at least two ways for reserved memory
> > regions to be described to kexec: /proc/iomem for the userspace
> > implementation, and memblock.reserved for kexec_file. And of course,
> > our LPI tables are only reserved using the resource tree, leading to
> > the aforementioned stamping. Similar things could happen with ACPI
> > tables as well.
> > 
> > On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> > boots with that little memory), trying to kexec a secondary kernel
> > failed every times. I can only presume that this was mostly tested
> > using kdump, which preserves the entire kernel memory range.
> > 
> > This small series aims at triggering a discussion on what are the
> > expectations for kexec_file, and whether we should unify the two
> > reservation mechanisms.
> > 
> > And in the meantime, it gets things going...
> > 
> > Marc Zyngier (2):
> >   firmware/efi: Tell memblock about EFI reservations
> >   ACPI: arm64: Reserve the ACPI tables in memblock
> > 
> >  arch/arm64/kernel/acpi.c   |  1 +
> >  drivers/firmware/efi/efi.c | 23 ++-
> >  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> Any comment on this?
> 
> I've separately started working on using the resource tree to slice
> and dice the memblocks that are candidate for kexec_file_load(), but
> I'd like some consensus on whether this is the right way to address
> the issue.
> 
> Without something like this, kexec_file_load() is not usable on arm64,
> so I'm pretty eager to see the back of this bug.

The arm64 memory reservation is tricky, I do not think I understand it
correctly.  Previously there were a lot discussion, Ard and AKASHI
should know more about it, see if they can provide comments.

About the problem you see, another way is to just add an arch weak
function like powerpc: arch_kexec_locate_mem_hole, and walking resource
tree for kexec_file_load as well.  But I might be wrong since I did not
follow up the arm64 specific history.

> 
> Thanks,
> 
>   M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-16 Thread Dave Young
On 04/16/21 at 01:28pm, Mike Galbraith wrote:
> On Fri, 2021-04-16 at 19:07 +0800, Dave Young wrote:
> >
> > > We're excluding two ranges, allocate the scratch space we need to do that.
> >
> > I think 1 range should be fine, have you tested 1?
> 
> Have now, and vzalloc(struct_size(cmem, ranges, 1)) worked just fine.

Ok, thanks for your quick response.  Care to resend and cc x86 list and
Andrew?

Andrew usually takes core kexec/kdump fixes, x86 usually go through x86
maintainer.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-16 Thread Dave Young
Hi Mike,

Thanks for the patch! I suggest always cc kexec list for kexec/kdump
patches.
On 04/15/21 at 07:56pm, Mike Galbraith wrote:
> x86/crash: fix crash_setup_memmap_entries() KASAN vmalloc-out-of-bounds gripe
> 
> [   15.428011] BUG: KASAN: vmalloc-out-of-bounds in 
> crash_setup_memmap_entries+0x17e/0x3a0
> [   15.428018] Write of size 8 at addr c9426008 by task kexec/1187
> 
> (gdb) list *crash_setup_memmap_entries+0x17e
> 0x8107cafe is in crash_setup_memmap_entries 
> (arch/x86/kernel/crash.c:322).
> 317  unsigned long long mend)
> 318 {
> 319 unsigned long start, end;
> 320
> 321 cmem->ranges[0].start = mstart;
> 322 cmem->ranges[0].end = mend;
> 323 cmem->nr_ranges = 1;
> 324
> 325 /* Exclude elf header region */
> 326 start = image->arch.elf_load_addr;
> (gdb)
> 
> We're excluding two ranges, allocate the scratch space we need to do that.

I think 1 range should be fine, have you tested 1?

The code is just excluding the elf header space which will be loaded
first before anything else so I assume it will be just at the start of
the crashkernel resource region.  Thus [a b] after exclude the start
part will be [c b].  But I have not read the code for long time, maybe I
need to double check.

But anyway 2 would be good since the code is obscure we can easily miss
it in the future.  See how other people think.

> 
> Signed-off-by: Mike Galbraith 
> ---
>  arch/x86/kernel/crash.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -337,7 +337,7 @@ int crash_setup_memmap_entries(struct ki
>   struct crash_memmap_data cmd;
>   struct crash_mem *cmem;
> 
> - cmem = vzalloc(sizeof(struct crash_mem));
> + cmem = vzalloc(sizeof(struct crash_mem)+(2*sizeof(struct 
> crash_mem_range)));

Thanks for the patch, can you try below?
vzalloc(struct_size(cmem, ranges, 2));


>   if (!cmem)
>   return -ENOMEM;
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-25 Thread Dave Young
On 02/23/21 at 09:41am, Saeed Mirzamohammadi wrote:
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
> 
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
>  .../admin-guide/kernel-parameters.txt |  6 ++
>  arch/Kconfig  | 20 +++
>  kernel/crash_core.c   |  7 +++
>  4 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..ae030111e22a 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>  
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>  
>  Boot into System Kernel
>  ===
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 9e3cdb271d06..a5deda5c85fe 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -747,6 +747,12 @@
>   a memory unit (amount[KMG]). See also
>   Documentation/admin-guide/kdump/kdump.rst for an 
> example.
>  
> + crashkernel=auto
> + [KNL] This parameter will set the reserved memory for
> + the crash kernel based on the value of the 
> CRASH_AUTO_STR
> + that is the best effort estimation for each arch. See 
> also
> + arch/Kconfig for further details.
> +
>   crashkernel=size[KMG],high
>   [KNL, X86-64] range could be above 4G. Allow kernel
>   to allocate physical memory region from top, so could
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..23d047548772 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,26 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +config CRASH_AUTO_STR
> + string "Memory reserved for crash kernel"
> + depends on CRASH_CORE
> + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> + help
> +   This configures the reserved memory dependent
> +   on the value of System RAM. The syntax is:
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
> +
> +   For example:
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
> +
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..90f9e4bb6704 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
>   if (suffix)
>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> + pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> + }
> +#endif
>   /*
>* if the commandline contains a ':', then that's the extended
>* syntax -- if not, it must be the classic syntax
> -- 
> 2.27.0
> 


Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-17 Thread Dave Young
On 02/17/21 at 02:42pm, Vivek Goyal wrote:
> On Wed, Feb 17, 2021 at 02:26:53PM -0500, Steven Rostedt wrote:
> > On Wed, 17 Feb 2021 12:40:43 -0600
> > john.p.donne...@oracle.com wrote:
> > 
> > > Hello.
> > > 
> > > Ping.
> > > 
> > > Can we get this reviewed and staged ?
> > > 
> > > Thank you.
> > 
> > Andrew,
> > 
> > Seems you are the only one pushing patches in for kexec/crash. Is this
> > maintained by anyone?
> 
> Dave Young and Baoquan He still maintain kexec/kdump stuff, AFAIK. I
> don't get time to look into this stuff now a days. 

Vivek, no problem, both Baoquan and me are on holiday leaves previously.

I'm fine with the change. 
This patch benefits distributions and those people who want to deploy a lot of
machines.  It is a good start and we can continue to improve the estimation 
later.

Thanks
Dave 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-17 Thread Dave Young
On 02/11/21 at 10:08am, Saeed Mirzamohammadi wrote:
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
> 
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
>  .../admin-guide/kernel-parameters.txt |  6 +
>  arch/Kconfig  | 24 +++
>  kernel/crash_core.c   |  7 ++
>  4 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 2da65fef2a1c..e55cdc404c6b 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>  
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>  
>  Boot into System Kernel
>  ===
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 7d4e523646c3..aa2099465458 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -736,6 +736,12 @@
>   a memory unit (amount[KMG]). See also
>   Documentation/admin-guide/kdump/kdump.rst for an 
> example.
>  
> + crashkernel=auto
> + [KNL] This parameter will set the reserved memory for
> + the crash kernel based on the value of the 
> CRASH_AUTO_STR
> + that is the best effort estimation for each arch. See 
> also
> + arch/Kconfig for further details.
> +
>   crashkernel=size[KMG],high
>   [KNL, X86-64] range could be above 4G. Allow kernel
>   to allocate physical memory region from top, so could
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..f87c88ffa2f8 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,30 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +if CRASH_CORE
> +
> +config CRASH_AUTO_STR
> + string "Memory reserved for crash kernel"
> + depends on CRASH_CORE
> + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> + help
> +   This configures the reserved memory dependent
> +   on the value of System RAM. The syntax is:
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
> +
> +   For example:
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
> +
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
> +endif # CRASH_CORE
> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..ab0a2b4b1ffa 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
>   if (suffix)
>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> + pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> + }
> +#endif
>   /*
>* if the commandline contains a ':', then that's the extended
>* syntax -- if not, it must be the classic syntax
> -- 
> 2.27.0
> 

Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kexec-tools: fix build on pre 4.4 kernels

2021-02-07 Thread Dave Young
Added Kairui in cc since he contributed that part
On 02/05/21 at 09:15am, Federico Pellegrin wrote:
> kexec build will fail on older kernels (pre 4.4) as the define
> VIDEO_CAPABILITY_64BIT_BASE was not present at that time.
> 
> This patch adds it, as per linux/include/uapi/linux/screen_info.h,
> if not present.
> 
> Signed-off-by: Federico Pellegrin 
> ---
>  kexec/arch/i386/x86-linux-setup.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kexec/arch/i386/x86-linux-setup.c 
> b/kexec/arch/i386/x86-linux-setup.c
> index 76e1185..ab54a4a 100644
> --- a/kexec/arch/i386/x86-linux-setup.c
> +++ b/kexec/arch/i386/x86-linux-setup.c
> @@ -37,6 +37,10 @@
>  #include "x86-linux-setup.h"
>  #include "../../kexec/kexec-syscall.h"
>  
> +#ifndef VIDEO_CAPABILITY_64BIT_BASE
> +#define VIDEO_CAPABILITY_64BIT_BASE (1 << 1) /* Frame buffer base is 64-bit 
> */
> +#endif
> +
>  void init_linux_parameters(struct x86_linux_param_header *real_mode)
>  {
>   /* Fill in the values that are usually provided by the kernel. */
> -- 
> 2.26.2
> 
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-06 Thread Dave Young
Hi Saeed,
On 02/03/21 at 04:43pm, Saeed Mirzamohammadi wrote:
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
> 
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst |  5 +
>  arch/Kconfig  | 24 +++
>  kernel/crash_core.c   |  7 +++
>  3 files changed, 36 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..f95a2af64f59 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,12 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>  
> +Or you can use crashkernel=auto if you have enough memory. The threshold
> +is 1G on x86_64 and arm64. If your system memory is less than the threshold,
> +crashkernel=auto will not reserve memory. The size changes according to
> +the system memory size like below:
>  
> +x86_64/arm64: 1G-64G:128M,64G-1T:256M,1T-:512M

This part should be updated since you do not make the default value arch
dependent.

The format of the auto str is documented well in kernel-parameters.txt
below part:
crashkernel=range1:size1[,range2:size2,...][@offset]

The crashkernel=auto should be also documented in kernel-parameters.txt
and do not need to explain the threshold etc again, just refer to the
"crashkernel=range1:size1[,range2:size2,...][@offset]" format would be
fine.

>  
>  Boot into System Kernel
>  ===
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 56b6ccc0e32d..a772eb397d73 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,30 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +if CRASH_CORE
> +
> +config CRASH_AUTO_STR
> + string "Memory reserved for crash kernel"
> + depends on CRASH_CORE
> + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> + help
> +   This configures the reserved memory dependent
> +   on the value of System RAM. The syntax is:
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
> +
> +   For example:
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
> +
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
> +endif # CRASH_CORE
> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..ab0a2b4b1ffa 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
>   if (suffix)
>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> + pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> + }
> +#endif
>   /*
>* if the commandline contains a ':', then that's the extended
>* syntax -- if not, it must be the classic syntax
> -- 
> 2.27.0
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2021-01-22 Thread Dave Young
On 01/23/21 at 11:51am, Dave Young wrote:
> Hi Saeed,
> On 01/22/21 at 05:14pm, Saeed Mirzamohammadi wrote:
> > Hi,
> > 
> > > On Jan 21, 2021, at 7:12 PM, Dave Young  wrote:
> > > 
> > > On 01/22/21 at 09:22am, Dave Young wrote:
> > >> Hi John,
> > >> 
> > >> On 01/21/21 at 09:32am, john.p.donne...@oracle.com wrote:
> > >>> On 11/22/20 9:47 PM, Dave Young wrote:
> > >>>> Hi Guilherme,
> > >>>> On 11/22/20 at 12:32pm, Guilherme Piccoli wrote:
> > >>>>> Hi Dave and Kairui, thanks for your responses! OK, if that makes sense
> > >>>>> to you I'm fine with it. I'd just recommend to test recent kernels in
> > >>>>> multiple distros with the minimum "range" to see if 64M is enough for
> > >>>>> crashkernel, maybe we'd need to bump that.
> > >>>> 
> > >>>> Giving the different kernel configs and the different userspace
> > >>>> initramfs setup it is hard to get an uniform value for all 
> > >>>> distributions,
> > >>>> but we can have an interface/kconfig-option for them to provide a 
> > >>>> value like this patch
> > >>>> is doing. And it could be improved like Kairui said about some known
> > >>>> kernel added extra values later, probably some more improvements if
> > >>>> doable.
> > >>>> 
> > >>>> Thanks
> > >>>> Dave
> > >>>> 
> > >>> 
> > >>> Hi.
> > >>> 
> > >>> Are we going to move forward with implementing this for X86 and Arm ?
> > >>> 
> > >>> If other platform maintainers want to include this CONFIG option in 
> > >>> their
> > >>> configuration settings they have a starting point.
> > >> 
> > >> I would expect this become arch independent.
> > > 
> > > Clarify a bit, it can be a general config option under arch/Kconfig and
> > > just put the code in general arch independent part.
> > 
> > Does this mean that we need to add the option to def_configs in all archs 
> > as well?
> > 
> 
> I think we do not need to add defconfig, something like this will just work?
> 
> BTW, it should depend on CRASH_CORE instead of CRASH_DUMP, the logic of
> parsing crashkernel is in kernel/crash_core.c
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..fa6efeb52dc5 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,11 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +config CRASH_AUTO_STR
> + depends on CRASH_CORE
> + string "Memory reserved for crash kernel"
> + default "1G-:128M"

People do not want to see the default value if they do not need kdump 
so it would be better to add another kconfig option as a switch which is
set default as off in bool state.

> + ... help text [snip] ...
> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> 
> [...]
> 
> > Thanks,
> > Saeed
> > 
> > > 
> > >> 
> > >> Saeed, Kairui, would any of you like to update the patch?
> > >> 
> > >>> 
> > >>> Thank you,
> > >>> 
> > >>> John.
> > >>> 
> > >>> ( I am not currently on many of the included dist lists  in this email, 
> > >>> so
> > >>> hopefully key contributors are included in this exchange )
> > >>> 
> > >> 
> > >> Thanks
> > >> Dave
> > 
> 
> Thanks
> Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2021-01-22 Thread Dave Young
Hi Saeed,
On 01/22/21 at 05:14pm, Saeed Mirzamohammadi wrote:
> Hi,
> 
> > On Jan 21, 2021, at 7:12 PM, Dave Young  wrote:
> > 
> > On 01/22/21 at 09:22am, Dave Young wrote:
> >> Hi John,
> >> 
> >> On 01/21/21 at 09:32am, john.p.donne...@oracle.com wrote:
> >>> On 11/22/20 9:47 PM, Dave Young wrote:
> >>>> Hi Guilherme,
> >>>> On 11/22/20 at 12:32pm, Guilherme Piccoli wrote:
> >>>>> Hi Dave and Kairui, thanks for your responses! OK, if that makes sense
> >>>>> to you I'm fine with it. I'd just recommend to test recent kernels in
> >>>>> multiple distros with the minimum "range" to see if 64M is enough for
> >>>>> crashkernel, maybe we'd need to bump that.
> >>>> 
> >>>> Giving the different kernel configs and the different userspace
> >>>> initramfs setup it is hard to get an uniform value for all distributions,
> >>>> but we can have an interface/kconfig-option for them to provide a value 
> >>>> like this patch
> >>>> is doing. And it could be improved like Kairui said about some known
> >>>> kernel added extra values later, probably some more improvements if
> >>>> doable.
> >>>> 
> >>>> Thanks
> >>>> Dave
> >>>> 
> >>> 
> >>> Hi.
> >>> 
> >>> Are we going to move forward with implementing this for X86 and Arm ?
> >>> 
> >>> If other platform maintainers want to include this CONFIG option in their
> >>> configuration settings they have a starting point.
> >> 
> >> I would expect this become arch independent.
> > 
> > Clarify a bit, it can be a general config option under arch/Kconfig and
> > just put the code in general arch independent part.
> 
> Does this mean that we need to add the option to def_configs in all archs as 
> well?
> 

I think we do not need to add defconfig, something like this will just work?

BTW, it should depend on CRASH_CORE instead of CRASH_DUMP, the logic of
parsing crashkernel is in kernel/crash_core.c

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..fa6efeb52dc5 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -14,6 +14,11 @@ menu "General architecture-dependent options"
 config CRASH_CORE
bool
 
+config CRASH_AUTO_STR
+   depends on CRASH_CORE
+   string "Memory reserved for crash kernel"
+   default "1G-:128M"
+   ... help text [snip] ...
+
 config KEXEC_CORE
select CRASH_CORE
bool

[...]

> Thanks,
> Saeed
> 
> > 
> >> 
> >> Saeed, Kairui, would any of you like to update the patch?
> >> 
> >>> 
> >>> Thank you,
> >>> 
> >>> John.
> >>> 
> >>> ( I am not currently on many of the included dist lists  in this email, so
> >>> hopefully key contributors are included in this exchange )
> >>> 
> >> 
> >> Thanks
> >> Dave
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2021-01-21 Thread Dave Young
On 01/22/21 at 09:22am, Dave Young wrote:
> Hi John,
> 
> On 01/21/21 at 09:32am, john.p.donne...@oracle.com wrote:
> > On 11/22/20 9:47 PM, Dave Young wrote:
> > > Hi Guilherme,
> > > On 11/22/20 at 12:32pm, Guilherme Piccoli wrote:
> > > > Hi Dave and Kairui, thanks for your responses! OK, if that makes sense
> > > > to you I'm fine with it. I'd just recommend to test recent kernels in
> > > > multiple distros with the minimum "range" to see if 64M is enough for
> > > > crashkernel, maybe we'd need to bump that.
> > > 
> > > Giving the different kernel configs and the different userspace
> > > initramfs setup it is hard to get an uniform value for all distributions,
> > > but we can have an interface/kconfig-option for them to provide a value 
> > > like this patch
> > > is doing. And it could be improved like Kairui said about some known
> > > kernel added extra values later, probably some more improvements if
> > > doable.
> > > 
> > > Thanks
> > > Dave
> > > 
> > 
> > Hi.
> > 
> > Are we going to move forward with implementing this for X86 and Arm ?
> > 
> > If other platform maintainers want to include this CONFIG option in their
> > configuration settings they have a starting point.
> 
> I would expect this become arch independent.

Clarify a bit, it can be a general config option under arch/Kconfig and
just put the code in general arch independent part.

> 
> Saeed, Kairui, would any of you like to update the patch?
> 
> > 
> > Thank you,
> > 
> > John.
> > 
> > ( I am not currently on many of the included dist lists  in this email, so
> > hopefully key contributors are included in this exchange )
> > 
> 
> Thanks
> Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2021-01-21 Thread Dave Young
Hi John,

On 01/21/21 at 09:32am, john.p.donne...@oracle.com wrote:
> On 11/22/20 9:47 PM, Dave Young wrote:
> > Hi Guilherme,
> > On 11/22/20 at 12:32pm, Guilherme Piccoli wrote:
> > > Hi Dave and Kairui, thanks for your responses! OK, if that makes sense
> > > to you I'm fine with it. I'd just recommend to test recent kernels in
> > > multiple distros with the minimum "range" to see if 64M is enough for
> > > crashkernel, maybe we'd need to bump that.
> > 
> > Giving the different kernel configs and the different userspace
> > initramfs setup it is hard to get an uniform value for all distributions,
> > but we can have an interface/kconfig-option for them to provide a value 
> > like this patch
> > is doing. And it could be improved like Kairui said about some known
> > kernel added extra values later, probably some more improvements if
> > doable.
> > 
> > Thanks
> > Dave
> > 
> 
> Hi.
> 
> Are we going to move forward with implementing this for X86 and Arm ?
> 
> If other platform maintainers want to include this CONFIG option in their
> configuration settings they have a starting point.

I would expect this become arch independent.

Saeed, Kairui, would any of you like to update the patch?

> 
> Thank you,
> 
> John.
> 
> ( I am not currently on many of the included dist lists  in this email, so
> hopefully key contributors are included in this exchange )
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Fix error code in kexec_calculate_store_digests()

2020-12-09 Thread Dave Young
On 12/08/20 at 10:55pm, Dan Carpenter wrote:
> Return -ENOMEM on allocation failure instead of returning success.
> 
> Fixes: a43cac0d9dc2 ("kexec: split kexec_file syscall code to kexec_file.c")
> Signed-off-by: Dan Carpenter 
> ---
>  kernel/kexec_file.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b02086d70492..9570f380a825 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -735,8 +735,10 @@ static int kexec_calculate_store_digests(struct kimage 
> *image)
>  
>   sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
>   sha_regions = vzalloc(sha_region_sz);
> - if (!sha_regions)
> + if (!sha_regions) {
> + ret = -ENOMEM;
>   goto out_free_desc;
> + }
>  
>   desc->tfm   = tfm;
>  
> -- 
> 2.29.2
> 

Good catch, thanks!

Acked-by: Dave Young 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2020-11-22 Thread Dave Young
Hi Guilherme,
On 11/22/20 at 12:32pm, Guilherme Piccoli wrote:
> Hi Dave and Kairui, thanks for your responses! OK, if that makes sense
> to you I'm fine with it. I'd just recommend to test recent kernels in
> multiple distros with the minimum "range" to see if 64M is enough for
> crashkernel, maybe we'd need to bump that.

Giving the different kernel configs and the different userspace
initramfs setup it is hard to get an uniform value for all distributions,
but we can have an interface/kconfig-option for them to provide a value like 
this patch
is doing. And it could be improved like Kairui said about some known
kernel added extra values later, probably some more improvements if
doable.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] kernel/crash_core.c - Add crashkernel=auto for x86 and ARM

2020-11-19 Thread Dave Young
Hi Guilherme,
On 11/19/20 at 06:56pm, Guilherme Piccoli wrote:
> Hi Saeed, thanks for your patch/idea! Comments inline, below.
> 
> On Wed, Nov 18, 2020 at 8:29 PM Saeed Mirzamohammadi
>  wrote:
> >
> > This adds crashkernel=auto feature to configure reserved memory for
> > vmcore creation to both x86 and ARM platforms based on the total memory
> > size.
> >
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: John Donnelly 
> > Signed-off-by: Saeed Mirzamohammadi 
> > ---
> >  Documentation/admin-guide/kdump/kdump.rst |  5 +
> >  arch/arm64/Kconfig| 26 ++-
> >  arch/arm64/configs/defconfig  |  1 +
> >  arch/x86/Kconfig  | 26 ++-
> >  arch/x86/configs/x86_64_defconfig |  1 +
> >  kernel/crash_core.c   | 20 +++--
> >  6 files changed, 75 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> > b/Documentation/admin-guide/kdump/kdump.rst
> > index 75a9dd98e76e..f95a2af64f59 100644
> > --- a/Documentation/admin-guide/kdump/kdump.rst
> > +++ b/Documentation/admin-guide/kdump/kdump.rst
> > @@ -285,7 +285,12 @@ This would mean:
> >  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
> >  3) if the RAM size is larger than 2G, then reserve 128M
> >
> > +Or you can use crashkernel=auto if you have enough memory. The threshold
> > +is 1G on x86_64 and arm64. If your system memory is less than the 
> > threshold,
> > +crashkernel=auto will not reserve memory. The size changes according to
> > +the system memory size like below:
> >
> > +x86_64/arm64: 1G-64G:128M,64G-1T:256M,1T-:512M
> 
> As mentioned in the thread, this was tried before and never got merged
> - I'm not sure the all the reasons, but I speculate that a stronger
> reason is that it'd likely fail in many cases. I've seen cases of 256G

Yes, there were a few tries, last time I tried to set a default value, I
do not think people are strongly against it.  We have been using the
auto in Red Hat for long time, it does work for most of usual cases
like Saeed said in the patch. But I think all of us are aligned it is
not possible to satisfy all the user cases.  Anyway I also think this is
good to have.

> servers that require crashkernel=600M (or more), due to the amount of
> devices. Also, the minimum nowadays would likely be 96M or more - I'm
> looping Cascardo and Dann (Debian/Ubuntu maintainers of kdump stuff)
> so they maybe can jump in with even more examples/considerations.

Another reason of people have different feeling about the memory
requirement is currently distributions are doing different on kdump,
especially for the userspace part. Kairui did a lot of work in dracut to
reduce the memory requirements in dracut, for example only add dump
required kernel modules in 2nd kernel initramfs, also we have a lot of
other twicks for dracut to use "hostonly" mode, eg. hostonly multipath
configurations will just bring up necessary paths instead of creating
all of the multipath devices.

> 
> What we've been trying to do in Ubuntu/Debian is using an estimator
> approach [0] - this is purely userspace and tries to infer the amount
> of necessary memory a kdump minimal[1] kernel would take. I'm not
> -1'ing your approach totally, but I think a bit more consideration is
> needed in the ranges, at least accounting the number of devices of the
> machine or something like that.

There are definitely room to improve and make it better in the future,
but I think this is a good start and simple enough proposal for the time
being :)

> 
> Cheers,
> 
> 
> Guilherme
> 
> [0] https://salsa.debian.org/debian/makedumpfile/-/merge_requests/7
> [1] Minimal as having a reduced initrd + "shrinking" parameters (like
> nr_cpus=1).
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-26 Thread Dave Young
Hi,

On 09/25/20 at 10:56am, Konrad Rzeszutek Wilk wrote:
> On Fri, Sep 25, 2020 at 11:05:58AM +0800, Dave Young wrote:
> > Hi,
> > 
> > On 09/24/20 at 01:16pm, boris.ostrov...@oracle.com wrote:
> > > 
> > > On 9/24/20 12:43 PM, Michael Kelley wrote:
> > > > From: Eric W. Biederman  Sent: Thursday, 
> > > > September 24, 2020 9:26 AM
> > > >> Michael Kelley  writes:
> > > >>
> > > >>>>> Added Hyper-V people and people who created the param, it is below
> > > >>>>> commit, I also want to remove it if possible, let's see how people
> > > >>>>> think, but the least way should be to disable the auto setting in 
> > > >>>>> both systemd
> > > >>>>> and kernel:
> > > >>> Hyper-V uses a notifier to inform the host system that a Linux VM has
> > > >>> panic'ed.  Informing the host is particularly important in a public 
> > > >>> cloud
> > > >>> such as Azure so that the cloud software can alert the customer, and 
> > > >>> can
> > > >>> track cloud-wide reliability statistics.   Whether a kdump is taken 
> > > >>> is controlled
> > > >>> entirely by the customer and how he configures the VM, and we want
> > > >>> the host to be informed either way.
> > > >> Why?
> > > >>
> > > >> Why does the host care?
> > > >> Especially if the VM continues executing into a kdump kernel?
> > > > The host itself doesn't care.  But the host is a convenient out-of-band
> > > > channel for recording that a panic has occurred and to collect basic 
> > > > data
> > > > about the panic.  This out-of-band channel is then used to notify the 
> > > > end
> > > > customer that his VM has panic'ed.  Sure, the customer should be running
> > > > his own monitoring software, but customers don't always do what they
> > > > should.  Equally important, the out-of-band channel allows the cloud
> > > > infrastructure software to notice trends, such as that the rate of Linux
> > > > panics has increased, and that perhaps there is a cloud problem that
> > > > should be investigated.
> > > 
> > > 
> > > In many cases (especially in cloud environment) your dump device is 
> > > remote (e.g. iscsi) and kdump sometimes (often?) gets stuck because of 
> > > connectivity issues (which could be cause of the panic in the first 
> > > place). So it is quite desirable to inform the infrastructure that the VM 
> > > is on its way out without waiting for kdump to complete.
> > 
> > That can probably be done in kdump kernel if it is really needed.  Say
> > informing host that panic happened and a kdump kernel is runnning.
> 
> If kdump kernel gets to that point. Sometimes (sadly) it ends up being
> misconfigured and it chokes up - and hence having multiple ways to emit
> the crash information before running kdump kernel is a life-saver.

If it is done in kernel boot phase before pid 1 comes up then things
should be good enough, specific for kvm/hyper-v guests the kdump kernel.

> 
> > 
> > But I think to set crash_kexec_post_notifiers by default is still bad. 
> 
> Because of the way it is run today I presume? If there was some
> safe/unsafe policy that should work right? I would think that the
> safe ones that work properly all the time are:
> 
>  - HyperV CRASH_MSRs,
>  - KVM PVPANIC_[PANIC,CRASHLOAD] push button knob,
>  - pstore EFI variables
>  - Dumping in memory,
> 
> And then some that depend on firmware version (aka BIOS, and vendor) are:
>  - ACPI ERST,
> 
> And then the unsafe:
>  - s390, PowerPC (I don't actually know what they are but that
> was Dave's primary motivator).

As I said we also got reports of kdump kernel hang with Hyper-V with the
crash_kexec_post_notifiers enabled.

EFI pstore also depends on efi runtime that is in firmware, also we can
not ensure it works well after a panic happened.  Ditto for other pstore
backends we do not prefer to do it before kdump.  But as I said I'm not
saying they are not useful, people can use them by their choose.

As for the virtual machine panic events maybe it is ok to add some other
hooks instead of the notifiers.  But frankly I still feel it is better to do
it in kdump kernel boot path since kdump works well for virt from our
experience.

> 
> > 
> > > 
> > > 
> > > >
> > > >> Further like I have mentioned everytime something like this has come up
> > > >> a call on the kexec on panic code path should be a direct call (That 
> > > >> can
> > > >> be audited) not something hidden in a notifier call chain (which can 
> > > >> not).
> > > >>
> > > 
> > > We btw already have a direct call from panic() to kmsg_dump() which is 
> > > indirectly controlled by crash_kexec_post_notifiers, and it would also be 
> > > preferable to be able to call it before kdump as well.
> > 
> > Right, that is the same thing we are talking about.
> > 
> > Thanks
> > Dave
> > 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-24 Thread Dave Young
Hi,

On 09/24/20 at 01:16pm, boris.ostrov...@oracle.com wrote:
> 
> On 9/24/20 12:43 PM, Michael Kelley wrote:
> > From: Eric W. Biederman  Sent: Thursday, September 
> > 24, 2020 9:26 AM
> >> Michael Kelley  writes:
> >>
> > Added Hyper-V people and people who created the param, it is below
> > commit, I also want to remove it if possible, let's see how people
> > think, but the least way should be to disable the auto setting in both 
> > systemd
> > and kernel:
> >>> Hyper-V uses a notifier to inform the host system that a Linux VM has
> >>> panic'ed.  Informing the host is particularly important in a public cloud
> >>> such as Azure so that the cloud software can alert the customer, and can
> >>> track cloud-wide reliability statistics.   Whether a kdump is taken is 
> >>> controlled
> >>> entirely by the customer and how he configures the VM, and we want
> >>> the host to be informed either way.
> >> Why?
> >>
> >> Why does the host care?
> >> Especially if the VM continues executing into a kdump kernel?
> > The host itself doesn't care.  But the host is a convenient out-of-band
> > channel for recording that a panic has occurred and to collect basic data
> > about the panic.  This out-of-band channel is then used to notify the end
> > customer that his VM has panic'ed.  Sure, the customer should be running
> > his own monitoring software, but customers don't always do what they
> > should.  Equally important, the out-of-band channel allows the cloud
> > infrastructure software to notice trends, such as that the rate of Linux
> > panics has increased, and that perhaps there is a cloud problem that
> > should be investigated.
> 
> 
> In many cases (especially in cloud environment) your dump device is remote 
> (e.g. iscsi) and kdump sometimes (often?) gets stuck because of connectivity 
> issues (which could be cause of the panic in the first place). So it is quite 
> desirable to inform the infrastructure that the VM is on its way out without 
> waiting for kdump to complete.

That can probably be done in kdump kernel if it is really needed.  Say
informing host that panic happened and a kdump kernel is runnning.

But I think to set crash_kexec_post_notifiers by default is still bad. 

> 
> 
> >
> >> Further like I have mentioned everytime something like this has come up
> >> a call on the kexec on panic code path should be a direct call (That can
> >> be audited) not something hidden in a notifier call chain (which can not).
> >>
> 
> We btw already have a direct call from panic() to kmsg_dump() which is 
> indirectly controlled by crash_kexec_post_notifiers, and it would also be 
> preferable to be able to call it before kdump as well.

Right, that is the same thing we are talking about.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-22 Thread Dave Young
+ more people who may care about this param 
On 09/21/20 at 08:45pm, Eric W. Biederman wrote:
> Konrad Rzeszutek Wilk  writes:
> 
> > On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
> >> On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:
> >> 
> >> > crash_kexec_post_notifiers enables running various panic notifier
> >> > before kdump kernel booting. This increases risks of kdump failure.
> >> > It is well documented in kernel-parameters.txt. We do not suggest
> >> > people to enable it together with kdump unless he/she is really sure.
> >> > This is also not suggested to be enabled by default when users are
> >> > not aware in distributions.
> >> > 
> >> > But unfortunately it is enabled by default in systemd, see below
> >> > discussions in a systemd report, we can not convince systemd to change
> >> > it:
> >> > https://github.com/systemd/systemd/issues/16661
> >> > 
> >> > Actually we have got reports about kdump kernel hangs in both s390x
> >> > and powerpcle cases caused by the systemd change,  also some x86 cases
> >> > could also be caused by the same (although that is in Hyper-V code
> >> > instead of systemd, that need to be addressed separately).
> >
> > Perhaps it may be better to fix the issus on s390x and PowerPC as well?
> >
> >> > 
> >> > Thus to avoid the auto enablement here just disable the param writable
> >> > permission in sysfs.
> >> > 
> >> 
> >> Well.  I don't think this is at all a desirable way of resolving a
> >> disagreement with the systemd developers
> >> 
> >> At the above github address I'm seeing "ryncsn added a commit to
> >> ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
> >> enable crash_kexec_post_notifiers by default".  So didn't that address
> >> the issue?
> >
> > It does in systemd, but there is a strong interest in making this on
> > by default.
> 
> There is also a strong interest in removing this code entirely from the
> kernel.

Added Hyper-V people and people who created the param, it is below
commit, I also want to remove it if possible, let's see how people
think, but the least way should be to disable the auto setting in both systemd
and kernel:

commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45
Author: Masami Hiramatsu 
Date:   Fri Jun 6 14:37:07 2014 -0700

kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after 
panic_notifers

Add a "crash_kexec_post_notifiers" boot option to run kdump after
running panic_notifiers and dump kmsg.  This can help rare situations
where kdump fails because of unstable crashed kernel or hardware failure
(memory corruption on critical data/code), or the 2nd kernel is already
broken by the 1st kernel (it's a broken behavior, but who can guarantee
that the "crashed" kernel works correctly?).

Usage: add "crash_kexec_post_notifiers" to kernel boot option.

Note that this actually increases risks of the failure of kdump.  This
option should be set only if you worry about the rare case of kdump
failure rather than increasing the chance of success.

> 
> This failure is a case in point.
> 
> I think I am at my I told you so point.  This is what all of the testing
> over all the years has said.  Leaving functionality to the peculiarities
> of firmware when you don't have to, and can actually control what is
> going on doesn't work.
> 
> Eric
> 
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-22 Thread Dave Young
On 09/21/20 at 04:18pm, Konrad Rzeszutek Wilk wrote:
> On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
> > On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:
> > 
> > > crash_kexec_post_notifiers enables running various panic notifier
> > > before kdump kernel booting. This increases risks of kdump failure.
> > > It is well documented in kernel-parameters.txt. We do not suggest
> > > people to enable it together with kdump unless he/she is really sure.
> > > This is also not suggested to be enabled by default when users are
> > > not aware in distributions.
> > > 
> > > But unfortunately it is enabled by default in systemd, see below
> > > discussions in a systemd report, we can not convince systemd to change
> > > it:
> > > https://github.com/systemd/systemd/issues/16661
> > > 
> > > Actually we have got reports about kdump kernel hangs in both s390x
> > > and powerpcle cases caused by the systemd change,  also some x86 cases
> > > could also be caused by the same (although that is in Hyper-V code
> > > instead of systemd, that need to be addressed separately).
> 
> Perhaps it may be better to fix the issus on s390x and PowerPC as well?
> 
> > > 
> > > Thus to avoid the auto enablement here just disable the param writable
> > > permission in sysfs.
> > > 
> > 
> > Well.  I don't think this is at all a desirable way of resolving a
> > disagreement with the systemd developers
> > 
> > At the above github address I'm seeing "ryncsn added a commit to
> > ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
> > enable crash_kexec_post_notifiers by default".  So didn't that address
> > the issue?
> 
> It does in systemd, but there is a strong interest in making this on by 
> default.

I understand there could be such interest, but we have to keep in mind
that any extra things after a system crash can cause kdump unreliable.

I do not object people to use pstore, but I do object to enable the
notifiers by default.

BTW, crash notifiers are not limited to pstore, there are quite a log of
other pieces like led trigger etc.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


  1   2   3   4   5   6   7   8   9   10   >