Re: [PATCH v2 1/2] KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown

2023-12-12 Thread Gowans, James
On Tue, 2023-12-12 at 10:50 +0200, James Gowans wrote:
> > 
> > In any event I believe the bug with respect to kexec was introduced in
> > commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after
> > disable_nonboot_cpus()").  That is where syscore_shutdown was removed
> > from kernel_restart_prepare().
> > 
> > At this point it looks like someone just needs to add the missing
> > syscore_shutdown call into kernel_kexec() right after
> > migrate_to_reboot_cpu() is called.
> 
> Seems good and I'm happy to do that; one thing we need to check first:
> are all CPUs online at that point? The commit message for
> 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
> disable_nonboot_cpus()")
> speaks about: "one CPU on-line and interrupts disabled" when
> syscore_shutdown is called. KVM's syscore shutdown hook does:
> 
> on_each_cpu(hardware_disable_nolock, NULL, 1);
> 
> ... so that smells to me like it wants all the CPUs to be online at
> kvm_shutdown point.
> 
> It's not clear to me:
> 
> 1. Does hardware_disable_nolock actually need to be done on *every* CPU
> or would the offlined ones be fine to ignore because they will be reset
> and the VMXE bit will be cleared that way? With cooperative CPU handover
> we probably do indeed want to do this on every CPU and not depend on
> resetting.
> 
> 2. Are CPUs actually offline at this point? When that commit was
> authored there used to be a call to hardware_disable_nolock() but that's
> not there anymore.

I've sent out a patch:
https://lore.kernel.org/kexec/20231213064004.2419447-1-jgow...@amazon.com/T/#u

Let's continue the discussion there.

JG
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] kexec: do syscore_shutdown() in kernel_kexec

2023-12-12 Thread James Gowans
syscore_shutdown() runs driver and module callbacks to get the system
into a state where it can be correctly shut down. In commit
6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence
got (incorrectly?) removed from the kexec flow. This was innocuous until
commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook 
restart/shutdown")
changed the way that KVM registered its shutdown callbacks, switching from
reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
missing from kexec, KVM's shutdown hook is not run and virtualisation is
left enabled on the boot CPU which results in triple faults when
switching to the new kernel on Intel x86 VT-x with VMXE enabled.

Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down. It is not totally clear if this
is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call 
syscore_shutdown() after disable_nonboot_cpus()")
it is stated that "syscore_ops operations should be carried with one
CPU on-line and interrupts disabled." APs are only offlined later in
machine_shutdown(), so this syscore_shutdown() is being run while APs
are still online. This seems to be the correct place as it matches where
syscore_shutdown() is run in the reboot and halt flows - they also run
it before APs are shut down. The assumption is that the commit message
in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
disable_nonboot_cpus()")
is no longer valid.

KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow. Looking at some of them like x86 MCE
it is probably more correct to also shut these down during kexec.
Maintainers of all drivers which use syscore_ops.shutdown are added on
CC for visibility. They are:

arch/powerpc/platforms/cell/spu_base.c  .shutdown = spu_shutdown,
arch/x86/kernel/cpu/mce/core.c  .shutdown = mce_syscore_shutdown,
arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-sun6i-r.c   .shutdown = sun6i_r_intc_shutdown,
drivers/leds/trigger/ledtrig-cpu.c  .shutdown = 
ledtrig_cpu_syscore_shutdown,
drivers/power/reset/sc27xx-poweroff.c   .shutdown = sc27xx_poweroff_shutdown,
kernel/irq/generic-chip.c   .shutdown = irq_gc_shutdown,
virt/kvm/kvm_main.c .shutdown = kvm_shutdown,

This has been tested by doing a kexec on x86_64 and aarch64.

Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook 
restart/shutdown")

Signed-off-by: James Gowans 
Cc: Eric Biederman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Marc Zyngier 
Cc: Arnd Bergmann 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Chen-Yu Tsai 
Cc: Jernej Skrabec 
Cc: Samuel Holland 
Cc: Pavel Machek 
Cc: Sebastian Reichel 
Cc: Orson Zhai 
Cc: Alexander Graf 
Cc: Jan H. Schoenherr 
---
 kernel/kexec_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index be5642a4ec49..b926c4db8a91 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1254,6 +1254,7 @@ int kernel_kexec(void)
kexec_in_progress = true;
kernel_restart_prepare("kexec reboot");
migrate_to_reboot_cpu();
+   syscore_shutdown();
 
/*
 * migrate_to_reboot_cpu() disables CPU hotplug assuming that
-- 
2.34.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 6/7] kexec_file, power: print out debugging message if required

2023-12-12 Thread Baoquan He
Then when specifying '-d' for kexec_file_load interface, loaded
locations of kernel/initrd/cmdline etc can be printed out to help
debug.

Here replace pr_debug() with the newly added kexec_dprintk() in
kexec_file loading related codes.

Signed-off-by: Baoquan He 
---
 arch/powerpc/kexec/elf_64.c   |  8 
 arch/powerpc/kexec/file_load_64.c | 18 +-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index eeb258002d1e..904016cf89ea 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -59,7 +59,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
if (ret)
goto out;
 
-   pr_debug("Loaded the kernel at 0x%lx\n", kernel_load_addr);
+   kexec_dprintk("Loaded the kernel at 0x%lx\n", kernel_load_addr);
 
ret = kexec_load_purgatory(image, );
if (ret) {
@@ -67,7 +67,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
goto out;
}
 
-   pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
+   kexec_dprintk("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
/* Load additional segments needed for panic kernel */
if (image->type == KEXEC_TYPE_CRASH) {
@@ -99,7 +99,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
goto out;
initrd_load_addr = kbuf.mem;
 
-   pr_debug("Loaded initrd at 0x%lx\n", initrd_load_addr);
+   kexec_dprintk("Loaded initrd at 0x%lx\n", initrd_load_addr);
}
 
fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
@@ -132,7 +132,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 
fdt_load_addr = kbuf.mem;
 
-   pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
+   kexec_dprintk("Loaded device tree at 0x%lx\n", fdt_load_addr);
 
slave_code = elf_info.buffer + elf_info.proghdrs[0].p_offset;
ret = setup_purgatory_ppc64(image, slave_code, fdt, kernel_load_addr,
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 961a6dd67365..5b4c5cb23354 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -577,7 +577,7 @@ static int add_usable_mem_property(void *fdt, struct 
device_node *dn,
   NODE_PATH_LEN, dn);
return -EOVERFLOW;
}
-   pr_debug("Memory node path: %s\n", path);
+   kexec_dprintk("Memory node path: %s\n", path);
 
/* Now that we know the path, find its offset in kdump kernel's fdt */
node = fdt_path_offset(fdt, path);
@@ -590,8 +590,8 @@ static int add_usable_mem_property(void *fdt, struct 
device_node *dn,
/* Get the address & size cells */
n_mem_addr_cells = of_n_addr_cells(dn);
n_mem_size_cells = of_n_size_cells(dn);
-   pr_debug("address cells: %d, size cells: %d\n", n_mem_addr_cells,
-n_mem_size_cells);
+   kexec_dprintk("address cells: %d, size cells: %d\n", n_mem_addr_cells,
+ n_mem_size_cells);
 
um_info->idx  = 0;
if (!check_realloc_usable_mem(um_info, 2)) {
@@ -664,7 +664,7 @@ static int update_usable_mem_fdt(void *fdt, struct 
crash_mem *usable_mem)
 
node = fdt_path_offset(fdt, "/ibm,dynamic-reconfiguration-memory");
if (node == -FDT_ERR_NOTFOUND)
-   pr_debug("No dynamic reconfiguration memory found\n");
+   kexec_dprintk("No dynamic reconfiguration memory found\n");
else if (node < 0) {
pr_err("Malformed device tree: error reading 
/ibm,dynamic-reconfiguration-memory.\n");
return -EINVAL;
@@ -776,8 +776,8 @@ static void update_backup_region_phdr(struct kimage *image, 
Elf64_Ehdr *ehdr)
for (i = 0; i < ehdr->e_phnum; i++) {
if (phdr->p_paddr == BACKUP_SRC_START) {
phdr->p_offset = image->arch.backup_start;
-   pr_debug("Backup region offset updated to 0x%lx\n",
-image->arch.backup_start);
+   kexec_dprintk("Backup region offset updated to 0x%lx\n",
+ image->arch.backup_start);
return;
}
}
@@ -850,7 +850,7 @@ int load_crashdump_segments_ppc64(struct kimage *image,
pr_err("Failed to load backup segment\n");
return ret;
}
-   pr_debug("Loaded the backup region at 0x%lx\n", kbuf->mem);
+   kexec_dprintk("Loaded the backup region at 0x%lx\n", kbuf->mem);
 
/* Load elfcorehdr segment - to export crashing kernel's vmcore */
ret = load_elfcorehdr_segment(image, kbuf);
@@ -858,8 +858,8 @@ int load_crashdump_segments_ppc64(struct kimage *image,
pr_err("Failed to load elfcorehdr segment\n");
return 

[PATCH v4 5/7] kexec_file, ricv: print out debugging message if required

2023-12-12 Thread Baoquan He
Then when specifying '-d' for kexec_file_load interface, loaded
locations of kernel/initrd/cmdline etc can be printed out to help debug.

Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
loading related codes.

And also replace pr_notice() with kexec_dprintk() in elf_kexec_load()
because loaded location of purgatory and device tree are only printed
out for debugging, it doesn't make sense to always print them out.

And also remove kexec_image_info() because the content has been printed
out in generic code.

Signed-off-by: Baoquan He 
---
 arch/riscv/kernel/elf_kexec.c | 11 ++-
 arch/riscv/kernel/machine_kexec.c | 26 --
 2 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c
index e60fbd8660c4..5bd1ec3341fe 100644
--- a/arch/riscv/kernel/elf_kexec.c
+++ b/arch/riscv/kernel/elf_kexec.c
@@ -216,7 +216,6 @@ static void *elf_kexec_load(struct kimage *image, char 
*kernel_buf,
if (ret)
goto out;
kernel_start = image->start;
-   pr_notice("The entry point of kernel at 0x%lx\n", image->start);
 
/* Add the kernel binary to the image */
ret = riscv_kexec_elf_load(image, , _info,
@@ -252,8 +251,8 @@ static void *elf_kexec_load(struct kimage *image, char 
*kernel_buf,
image->elf_load_addr = kbuf.mem;
image->elf_headers_sz = headers_sz;
 
-   pr_debug("Loaded elf core header at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
-image->elf_load_addr, kbuf.bufsz, kbuf.memsz);
+   kexec_dprintk("Loaded elf core header at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
+ image->elf_load_addr, kbuf.bufsz, kbuf.memsz);
 
/* Setup cmdline for kdump kernel case */
modified_cmdline = setup_kdump_cmdline(image, cmdline,
@@ -275,6 +274,8 @@ static void *elf_kexec_load(struct kimage *image, char 
*kernel_buf,
pr_err("Error loading purgatory ret=%d\n", ret);
goto out;
}
+   kexec_dprintk("Loaded purgatory at 0x%lx\n", kbuf.mem);
+
ret = kexec_purgatory_get_set_symbol(image, "riscv_kernel_entry",
 _start,
 sizeof(kernel_start), 0);
@@ -293,7 +294,7 @@ static void *elf_kexec_load(struct kimage *image, char 
*kernel_buf,
if (ret)
goto out;
initrd_pbase = kbuf.mem;
-   pr_notice("Loaded initrd at 0x%lx\n", initrd_pbase);
+   kexec_dprintk("Loaded initrd at 0x%lx\n", initrd_pbase);
}
 
/* Add the DTB to the image */
@@ -318,7 +319,7 @@ static void *elf_kexec_load(struct kimage *image, char 
*kernel_buf,
}
/* Cache the fdt buffer address for memory cleanup */
image->arch.fdt = fdt;
-   pr_notice("Loaded device tree at 0x%lx\n", kbuf.mem);
+   kexec_dprintk("Loaded device tree at 0x%lx\n", kbuf.mem);
goto out;
 
 out_free_fdt:
diff --git a/arch/riscv/kernel/machine_kexec.c 
b/arch/riscv/kernel/machine_kexec.c
index 2d139b724bc8..ed9cad20c039 100644
--- a/arch/riscv/kernel/machine_kexec.c
+++ b/arch/riscv/kernel/machine_kexec.c
@@ -18,30 +18,6 @@
 #include 
 #include 
 
-/*
- * kexec_image_info - Print received image details
- */
-static void
-kexec_image_info(const struct kimage *image)
-{
-   unsigned long i;
-
-   pr_debug("Kexec image info:\n");
-   pr_debug("\ttype:%d\n", image->type);
-   pr_debug("\tstart:   %lx\n", image->start);
-   pr_debug("\thead:%lx\n", image->head);
-   pr_debug("\tnr_segments: %lu\n", image->nr_segments);
-
-   for (i = 0; i < image->nr_segments; i++) {
-   pr_debug("\tsegment[%lu]: %016lx - %016lx", i,
-   image->segment[i].mem,
-   image->segment[i].mem + image->segment[i].memsz);
-   pr_debug("\t\t0x%lx bytes, %lu pages\n",
-   (unsigned long) image->segment[i].memsz,
-   (unsigned long) image->segment[i].memsz /  PAGE_SIZE);
-   }
-}
-
 /*
  * machine_kexec_prepare - Initialize kexec
  *
@@ -60,8 +36,6 @@ machine_kexec_prepare(struct kimage *image)
unsigned int control_code_buffer_sz = 0;
int i = 0;
 
-   kexec_image_info(image);
-
/* Find the Flattened Device Tree and save its physical address */
for (i = 0; i < image->nr_segments; i++) {
if (image->segment[i].memsz <= sizeof(fdt))
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 7/7] kexec_file, parisc: print out debugging message if required

2023-12-12 Thread Baoquan He
Then when specifying '-d' for kexec_file_load interface, loaded
locations of kernel/initrd/cmdline etc can be printed out to help
debug.

Here replace pr_debug() with the newly added kexec_dprintk() in
kexec_file loading related codes.

Signed-off-by: Baoquan He 
---
 arch/parisc/kernel/kexec_file.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/parisc/kernel/kexec_file.c b/arch/parisc/kernel/kexec_file.c
index 8c534204f0fd..3fc82130b6c3 100644
--- a/arch/parisc/kernel/kexec_file.c
+++ b/arch/parisc/kernel/kexec_file.c
@@ -38,8 +38,8 @@ static void *elf_load(struct kimage *image, char *kernel_buf,
for (i = 0; i < image->nr_segments; i++)
image->segment[i].mem = __pa(image->segment[i].mem);
 
-   pr_debug("Loaded the kernel at 0x%lx, entry at 0x%lx\n",
-kernel_load_addr, image->start);
+   kexec_dprintk("Loaded the kernel at 0x%lx, entry at 0x%lx\n",
+ kernel_load_addr, image->start);
 
if (initrd != NULL) {
kbuf.buffer = initrd;
@@ -51,7 +51,7 @@ static void *elf_load(struct kimage *image, char *kernel_buf,
if (ret)
goto out;
 
-   pr_debug("Loaded initrd at 0x%lx\n", kbuf.mem);
+   kexec_dprintk("Loaded initrd at 0x%lx\n", kbuf.mem);
image->arch.initrd_start = kbuf.mem;
image->arch.initrd_end = kbuf.mem + initrd_len;
}
@@ -68,7 +68,7 @@ static void *elf_load(struct kimage *image, char *kernel_buf,
if (ret)
goto out;
 
-   pr_debug("Loaded cmdline at 0x%lx\n", kbuf.mem);
+   kexec_dprintk("Loaded cmdline at 0x%lx\n", kbuf.mem);
image->arch.cmdline = kbuf.mem;
}
 out:
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 4/7] kexec_file, arm64: print out debugging message if required

2023-12-12 Thread Baoquan He
Then when specifying '-d' for kexec_file_load interface, loaded
locations of kernel/initrd/cmdline etc can be printed out to help
debug.

Here replace pr_debug() with the newly added kexec_dprintk() in
kexec_file loading related codes.

And also remove the kimage->segment[] printing because the generic code
has done the printing.

Signed-off-by: Baoquan He 
---
 arch/arm64/kernel/kexec_image.c|  6 +++---
 arch/arm64/kernel/machine_kexec.c  | 26 ++
 arch/arm64/kernel/machine_kexec_file.c | 12 ++--
 3 files changed, 15 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
index 636be6715155..532d72ea42ee 100644
--- a/arch/arm64/kernel/kexec_image.c
+++ b/arch/arm64/kernel/kexec_image.c
@@ -122,9 +122,9 @@ static void *image_load(struct kimage *image,
kernel_segment->memsz -= text_offset;
image->start = kernel_segment->mem;
 
-   pr_debug("Loaded kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-   kernel_segment->mem, kbuf.bufsz,
-   kernel_segment->memsz);
+   kexec_dprintk("Loaded kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ kernel_segment->mem, kbuf.bufsz,
+ kernel_segment->memsz);
 
return NULL;
 }
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 078910db77a4..b38aae5b488d 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -32,26 +32,12 @@
 static void _kexec_image_info(const char *func, int line,
const struct kimage *kimage)
 {
-   unsigned long i;
-
-   pr_debug("%s:%d:\n", func, line);
-   pr_debug("  kexec kimage info:\n");
-   pr_debug("type:%d\n", kimage->type);
-   pr_debug("start:   %lx\n", kimage->start);
-   pr_debug("head:%lx\n", kimage->head);
-   pr_debug("nr_segments: %lu\n", kimage->nr_segments);
-   pr_debug("dtb_mem: %pa\n", >arch.dtb_mem);
-   pr_debug("kern_reloc: %pa\n", >arch.kern_reloc);
-   pr_debug("el2_vectors: %pa\n", >arch.el2_vectors);
-
-   for (i = 0; i < kimage->nr_segments; i++) {
-   pr_debug("  segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu 
pages\n",
-   i,
-   kimage->segment[i].mem,
-   kimage->segment[i].mem + kimage->segment[i].memsz,
-   kimage->segment[i].memsz,
-   kimage->segment[i].memsz /  PAGE_SIZE);
-   }
+   kexec_dprintk("%s:%d:\n", func, line);
+   kexec_dprintk("  kexec kimage info:\n");
+   kexec_dprintk("type:%d\n", kimage->type);
+   kexec_dprintk("head:%lx\n", kimage->head);
+   kexec_dprintk("kern_reloc: %pa\n", >arch.kern_reloc);
+   kexec_dprintk("el2_vectors: %pa\n", >arch.el2_vectors);
 }
 
 void machine_kexec_cleanup(struct kimage *kimage)
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index a11a6e14ba89..0e017358f4ba 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -127,8 +127,8 @@ int load_other_segments(struct kimage *image,
image->elf_load_addr = kbuf.mem;
image->elf_headers_sz = headers_sz;
 
-   pr_debug("Loaded elf core header at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
-image->elf_load_addr, kbuf.bufsz, kbuf.memsz);
+   kexec_dprintk("Loaded elf core header at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
+ image->elf_load_addr, kbuf.bufsz, kbuf.memsz);
}
 
/* load initrd */
@@ -148,8 +148,8 @@ int load_other_segments(struct kimage *image,
goto out_err;
initrd_load_addr = kbuf.mem;
 
-   pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-   initrd_load_addr, kbuf.bufsz, kbuf.memsz);
+   kexec_dprintk("Loaded initrd at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
+ initrd_load_addr, kbuf.bufsz, kbuf.memsz);
}
 
/* load dtb */
@@ -179,8 +179,8 @@ int load_other_segments(struct kimage *image,
image->arch.dtb = dtb;
image->arch.dtb_mem = kbuf.mem;
 
-   pr_debug("Loaded dtb at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-   kbuf.mem, kbuf.bufsz, kbuf.memsz);
+   kexec_dprintk("Loaded dtb at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ kbuf.mem, kbuf.bufsz, kbuf.memsz);
 
return 0;
 
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 3/7] kexec_file, x86: print out debugging message if required

2023-12-12 Thread Baoquan He
Then when specifying '-d' for kexec_file_load interface, loaded
locations of kernel/initrd/cmdline etc can be printed out to help
debug.

Here replace pr_debug() with the newly added kexec_dprintk() in
kexec_file loading related codes.

And also print out e820 memmap passed to 2nd kernel just as kexec_load
interface has been doing.

Signed-off-by: Baoquan He 
---
 arch/x86/kernel/crash.c   |  4 ++--
 arch/x86/kernel/kexec-bzimage64.c | 23 ++-
 2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c92d88680dbf..1715e5f06a59 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -386,8 +386,8 @@ int crash_load_segments(struct kimage *image)
if (ret)
return ret;
image->elf_load_addr = kbuf.mem;
-   pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-image->elf_load_addr, kbuf.bufsz, kbuf.memsz);
+   kexec_dprintk("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ image->elf_load_addr, kbuf.bufsz, kbuf.memsz);
 
return ret;
 }
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index a61c12c01270..e9ae0eac6bf9 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -82,7 +82,7 @@ static int setup_cmdline(struct kimage *image, struct 
boot_params *params,
 
cmdline_ptr[cmdline_len - 1] = '\0';
 
-   pr_debug("Final command line is: %s\n", cmdline_ptr);
+   kexec_dprintk("Final command line is: %s\n", cmdline_ptr);
cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
cmdline_low_32 = cmdline_ptr_phys & 0xUL;
cmdline_ext_32 = cmdline_ptr_phys >> 32;
@@ -272,7 +272,12 @@ setup_boot_parameters(struct kimage *image, struct 
boot_params *params,
 
nr_e820_entries = params->e820_entries;
 
+   kexec_dprintk("E820 memmap:\n");
for (i = 0; i < nr_e820_entries; i++) {
+   kexec_dprintk("%016llx-%016llx (%d)\n",
+ params->e820_table[i].addr,
+ params->e820_table[i].addr + 
params->e820_table[i].size - 1,
+ params->e820_table[i].type);
if (params->e820_table[i].type != E820_TYPE_RAM)
continue;
start = params->e820_table[i].addr;
@@ -424,7 +429,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
 * command line. Make sure it does not overflow
 */
if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
-   pr_debug("Appending elfcorehdr= to command line exceeds 
maximum allowed length\n");
+   kexec_dprintk("Appending elfcorehdr= to command line 
exceeds maximum allowed length\n");
return ERR_PTR(-EINVAL);
}
 
@@ -445,7 +450,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
return ERR_PTR(ret);
}
 
-   pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
+   kexec_dprintk("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
 
/*
@@ -490,8 +495,8 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
if (ret)
goto out_free_params;
bootparam_load_addr = kbuf.mem;
-   pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
-bootparam_load_addr, kbuf.bufsz, kbuf.bufsz);
+   kexec_dprintk("Loaded boot_param, command line and misc at 0x%lx 
bufsz=0x%lx memsz=0x%lx\n",
+ bootparam_load_addr, kbuf.bufsz, kbuf.bufsz);
 
/* Load kernel */
kbuf.buffer = kernel + kern16_size;
@@ -505,8 +510,8 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
goto out_free_params;
kernel_load_addr = kbuf.mem;
 
-   pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-kernel_load_addr, kbuf.bufsz, kbuf.memsz);
+   kexec_dprintk("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ kernel_load_addr, kbuf.bufsz, kbuf.memsz);
 
/* Load initrd high */
if (initrd) {
@@ -520,8 +525,8 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
goto out_free_params;
initrd_load_addr = kbuf.mem;
 
-   pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-   initrd_load_addr, initrd_len, initrd_len);
+   kexec_dprintk("Loaded initrd at 0x%lx bufsz=0x%lx 
memsz=0x%lx\n",
+ initrd_load_addr, initrd_len, initrd_len);
 
setup_initrd(params, initrd_load_addr, initrd_len);
}
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org

[PATCH v4 0/7] kexec_file: print out debugging message if required

2023-12-12 Thread Baoquan He
Currently, specifying '-d' on kexec command will print a lot of debugging
informationabout kexec/kdump loading with kexec_load interface.

However, kexec_file_load prints nothing even though '-d' is specified.
It's very inconvenient to debug or analyze the kexec/kdump loading when
something wrong happened with kexec/kdump itself or develper want to
check the kexec/kdump loading.

In this patchset, a kexec_file flag is KEXEC_FILE_DEBUG added and checked
in code. If it's passed in, debugging message of kexec_file code will be
printed out and can be seen from console and dmesg. Otherwise, the
debugging message is printed like beofre when pr_debug() is taken.

Note:

=
1) The code in kexec-tools utility also need be changed to support
passing KEXEC_FILE_DEBUG to kernel when 'kexec -s -d' is specified.
The patch link is here:
=
[PATCH] kexec_file: add kexec_file flag to support debug printing
http://lists.infradead.org/pipermail/kexec/2023-November/028505.html

2) s390 also has kexec_file code, while I am not sure what debugging
information is necessary. So leave it to s390 developer.

Test:


Testing was done in v1 on x86_64 and arm64. For v4, tested on x86_64
again. And on x86_64, the printed messages look like below:
--
kexec measurement buffer for the loaded kernel at 0x207fffe000.
Loaded purgatory at 0x207fff9000
Loaded boot_param, command line and misc at 0x207fff3000 bufsz=0x1180 
memsz=0x1180
Loaded 64bit kernel at 0x207c00 bufsz=0xc88200 memsz=0x3c4a000
Loaded initrd at 0x2079e79000 bufsz=0x2186280 memsz=0x2186280
Final command line is: 
root=/dev/mapper/fedora_intel--knightslanding--lb--02-root ro
rd.lvm.lv=fedora_intel-knightslanding-lb-02/root console=ttyS0,115200N81 
crashkernel=256M
E820 memmap:
-0009a3ff (1)
0009a400-0009 (2)
000e-000f (2)
0010-6ff83fff (1)
6ff84000-7ac50fff (2)
..
00207fff6150-00207fff615f (128)
00207fff6160-00207fff714f (1)
00207fff7150-00207fff715f (128)
00207fff7160-00207fff814f (1)
00207fff8150-00207fff815f (128)
00207fff8160-00207fff (1)
nr_segments = 5
segment[0]: buf=0x4e5ece74 bufsz=0x211 mem=0x207fffe000 memsz=0x1000
segment[1]: buf=0x9e871498 bufsz=0x4000 mem=0x207fff9000 memsz=0x5000
segment[2]: buf=0xd879f1fe bufsz=0x1180 mem=0x207fff3000 memsz=0x2000
segment[3]: buf=0x1101cd86 bufsz=0xc88200 mem=0x207c00 
memsz=0x3c4a000
segment[4]: buf=0xc6e38ac7 bufsz=0x2186280 mem=0x2079e79000 
memsz=0x2187000
kexec_file_load: type:0, start:0x207fff91a0 head:0x109e004002 flags:0x8
---

History:
=
v3->v4:
- Add the explanation about why kexec_dprintk() need be introduced to
  replace pr_debug() in log of ARCH patch, suggested by Conor.
- Mentioned why pr_notice() need be replaced with kexec_dprintk() in
  function elf_kexec_load() of risc-v, Suggested by Conor.
- Change breaking one message line into two back to one line, pointed
  out by Joe.
v2->v3:
- Adjust all the indentation of continuation line to the open parenthesis
  for all kexec_dprintk() call sites. Thank Joe to point this out.
- Fix the LKP report that macro kexec_dprintk() is invalid when
  CONFIG_KEXEC=Y, CONFIG_KEXEC_FILE=n, CONFIG_CRASH_DUMP=y.

v1->v2:
- Take the new format of kexec_dprintk() suggested by Joe which can
  reduce kernel text size.
- Fix building error of patch 2 in kernel/crash_core.c reported by LKP.
- Fix building warning on arm64 in patch 4 reported by LKP.

Baoquan He (7):
  kexec_file: add kexec_file flag to control debug printing
  kexec_file: print out debugging message if required
  kexec_file, x86: print out debugging message if required
  kexec_file, arm64: print out debugging message if required
  kexec_file, ricv: print out debugging message if required
  kexec_file, power: print out debugging message if required
  kexec_file, parisc: print out debugging message if required

 arch/arm64/kernel/kexec_image.c|  6 +++---
 arch/arm64/kernel/machine_kexec.c  | 26 ++
 arch/arm64/kernel/machine_kexec_file.c | 12 ++--
 arch/parisc/kernel/kexec_file.c|  8 
 arch/powerpc/kexec/elf_64.c|  8 
 arch/powerpc/kexec/file_load_64.c  | 18 +-
 arch/riscv/kernel/elf_kexec.c  | 11 ++-
 arch/riscv/kernel/machine_kexec.c  | 26 --
 arch/x86/kernel/crash.c|  4 ++--
 arch/x86/kernel/kexec-bzimage64.c  | 23 ++-
 include/linux/kexec.h  |  9 -
 include/uapi/linux/kexec.h |  1 +
 kernel/crash_core.c|  8 +---
 kernel/kexec_core.c|  2 ++
 kernel/kexec_file.c| 14 

[PATCH v4 2/7] kexec_file: print out debugging message if required

2023-12-12 Thread Baoquan He
Then when specifying '-d' for kexec_file_load interface, loaded
locations of kernel/initrd/cmdline etc can be printed out to help
debug.

Here replace pr_debug() with the newly added kexec_dprintk() in
kexec_file loading related codes.

And also print out type/start/head of kimage and flags to help debug.

Signed-off-by: Baoquan He 
---
 kernel/crash_core.c|  8 +---
 kernel/kexec_file.c| 11 ---
 security/integrity/ima/ima_kexec.c |  4 ++--
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index efe87d501c8c..380d0d3acc7b 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -551,9 +551,11 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int 
need_kernel_map,
phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
phdr->p_align = 0;
ehdr->e_phnum++;
-   pr_debug("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, 
paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
-   phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
-   ehdr->e_phnum, phdr->p_offset);
+#ifdef CONFIG_KEXEC_FILE
+   kexec_dprintk("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, 
paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
+ phdr, phdr->p_vaddr, phdr->p_paddr, 
phdr->p_filesz,
+ ehdr->e_phnum, phdr->p_offset);
+#endif
phdr++;
}
 
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index aca5dac74044..76de1ac7c424 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -204,6 +204,8 @@ kimage_file_prepare_segments(struct kimage *image, int 
kernel_fd, int initrd_fd,
if (ret < 0)
return ret;
image->kernel_buf_len = ret;
+   kexec_dprintk("kernel: %p kernel_size: %#lx\n",
+ image->kernel_buf, image->kernel_buf_len);
 
/* Call arch image probe handlers */
ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
@@ -387,13 +389,14 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
initrd_fd,
if (ret)
goto out;
 
+   kexec_dprintk("nr_segments = %lu\n", image->nr_segments);
for (i = 0; i < image->nr_segments; i++) {
struct kexec_segment *ksegment;
 
ksegment = >segment[i];
-   pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx 
memsz=0x%zx\n",
-i, ksegment->buf, ksegment->bufsz, ksegment->mem,
-ksegment->memsz);
+   kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx 
memsz=0x%zx\n",
+ i, ksegment->buf, ksegment->bufsz, ksegment->mem,
+ ksegment->memsz);
 
ret = kimage_load_segment(image, >segment[i]);
if (ret)
@@ -406,6 +409,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
initrd_fd,
if (ret)
goto out;
 
+   kexec_dprintk("kexec_file_load: type:%u, start:0x%lx head:0x%lx 
flags:0x%lx\n",
+ image->type, image->start, image->head, flags);
/*
 * Free up any temporary buffers allocated which are not needed
 * after image has been loaded
diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c
index ad133fe120db..dadc1d138118 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -129,8 +129,8 @@ void ima_add_kexec_buffer(struct kimage *image)
image->ima_buffer_size = kexec_segment_size;
image->ima_buffer = kexec_buffer;
 
-   pr_debug("kexec measurement buffer for the loaded kernel at 0x%lx.\n",
-kbuf.mem);
+   kexec_dprintk("kexec measurement buffer for the loaded kernel at 
0x%lx.\n",
+ kbuf.mem);
 }
 #endif /* IMA_KEXEC */
 
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v4 1/7] kexec_file: add kexec_file flag to control debug printing

2023-12-12 Thread Baoquan He
When specifying 'kexec -c -d', kexec_load interface will print loading
information, e.g the regions where kernel/initrd/purgatory/cmdline
are put, the memmap passed to 2nd kernel taken as system RAM ranges,
and printing all contents of struct kexec_segment, etc. These are
very helpful for analyzing or positioning what's happening when
kexec/kdump itself failed. The debugging printing for kexec_load
interface is made in user space utility kexec-tools.

Whereas, with kexec_file_load interface, 'kexec -s -d' print nothing.
Because kexec_file code is mostly implemented in kernel space, and the
debugging printing functionality is missed. It's not convenient when
debugging kexec/kdump loading and jumping with kexec_file_load
interface.

Now add KEXEC_FILE_DEBUG to kexec_file flag to control the debugging
message printing. And add global variable kexec_file_dbg_print and
macro kexec_dprintk() to facilitate the printing.

This is a preparation, later kexec_dprintk() will be used to replace the
existing pr_debug(). Once 'kexec -s -d' is specified, it will print out
kexec/kdump loading information. If '-d' is not specified, it regresses
to pr_debug().

Signed-off-by: Baoquan He 
---
 include/linux/kexec.h  | 9 -
 include/uapi/linux/kexec.h | 1 +
 kernel/kexec_core.c| 2 ++
 kernel/kexec_file.c| 3 +++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 8227455192b7..400cb6c02176 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -403,7 +403,7 @@ bool kexec_load_permitted(int kexec_image_type);
 
 /* List of defined/legal kexec file flags */
 #define KEXEC_FILE_FLAGS   (KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH | \
-KEXEC_FILE_NO_INITRAMFS)
+KEXEC_FILE_NO_INITRAMFS | KEXEC_FILE_DEBUG)
 
 /* flag to track if kexec reboot is in progress */
 extern bool kexec_in_progress;
@@ -500,6 +500,13 @@ static inline int crash_hotplug_memory_support(void) { 
return 0; }
 static inline unsigned int crash_get_elfcorehdr_size(void) { return 0; }
 #endif
 
+extern bool kexec_file_dbg_print;
+
+#define kexec_dprintk(fmt, ...)\
+   printk("%s" fmt,\
+  kexec_file_dbg_print ? KERN_INFO : KERN_DEBUG,   \
+  ##__VA_ARGS__)
+
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index 01766dd839b0..c17bb096ea68 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -25,6 +25,7 @@
 #define KEXEC_FILE_UNLOAD  0x0001
 #define KEXEC_FILE_ON_CRASH0x0002
 #define KEXEC_FILE_NO_INITRAMFS0x0004
+#define KEXEC_FILE_DEBUG   0x0008
 
 /* These values match the ELF architecture values.
  * Unless there is a good reason that should continue to be the case.
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index be5642a4ec49..f70bf3a7885c 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -52,6 +52,8 @@ atomic_t __kexec_lock = ATOMIC_INIT(0);
 /* Flag to indicate we are going to kexec a new kernel */
 bool kexec_in_progress = false;
 
+bool kexec_file_dbg_print;
+
 int kexec_should_crash(struct task_struct *p)
 {
/*
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f9a419cd22d4..aca5dac74044 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -123,6 +123,8 @@ void kimage_file_post_load_cleanup(struct kimage *image)
 */
kfree(image->image_loader_data);
image->image_loader_data = NULL;
+
+   kexec_file_dbg_print = false;
 }
 
 #ifdef CONFIG_KEXEC_SIG
@@ -278,6 +280,7 @@ kimage_file_alloc_init(struct kimage **rimage, int 
kernel_fd,
if (!image)
return -ENOMEM;
 
+   kexec_file_dbg_print = !!(flags & KEXEC_FILE_DEBUG);
image->file_mode = 1;
 
if (kexec_on_panic) {
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: avoid out of bounds in crash_exclude_mem_range()

2023-12-12 Thread Baoquan He
On 11/30/23 at 09:20pm, fuqiang wang wrote:
> 
> On 2023/11/30 15:44, Baoquan He wrote:
> > On 11/27/23 at 10:56am, fuqiang wang wrote:
> > > When the split happened, judge whether mem->nr_ranges is equal to
> > > mem->max_nr_ranges. If it is true, return -ENOMEM.
> > > 
> > > The advantage of doing this is that it can avoid array bounds caused by
> > > some bugs. E.g., Before commit 4831be702b95 ("arm64/kexec: Fix missing
> > > extra range for crashkres_low."), reserve both high and low memories for
> > > the crashkernel may cause out of bounds.
> > > 
> > > On the other hand, move this code before the split to ensure that the
> > > array will not be changed when return error.
> > If out of array boundary is caused, means the laoding failed, whether
> > the out of boundary happened or not. I don't see how this code change
> > makes sense. Do I miss anything?
> > 
> > Thanks
> > Baoquan
> > 
> Hi baoquan,
> 
> In some configurations, out of bounds may not cause crash_exclude_mem_range()
> returns error, then the load will succeed.
> 
> E.g.
> There is a cmem before execute crash_exclude_mem_range():
> 
>   cmem = {
>     max_nr_ranges = 3
>     nr_ranges = 2
>     ranges = {
>    {start = 1,  end = 1000}
>    {start = 1001,    end = 2000}
>     }
>   }
> 
> After executing twice crash_exclude_mem_range() with the start/end params
> 100/200, 300/400 respectively, the cmem will be:
> 
>   cmem = {
>     max_nr_ranges = 3
>     nr_ranges = 4    <== nr_ranges > max_nr_ranges
>     ranges = {
>   {start = 1,   end = 99  }
>   {start = 201, end = 299 }
>   {start = 401, end = 1000}
>   {start = 1001,    end = 2000}  <== OUT OF BOUNDS
>     }
>   }
> 
> When an out of bounds occurs during the second execution, the function will 
> not
> return error.
> 
> Additionally, when the function returns error, means the load failed. It seems
> meaningless to keep the original data unchanged. But in my opinion, this will
> make this function more rigorous and more versatile. (However, I am not sure 
> if
> it is self-defeating and I hope to receive more suggestions).

Sorry for late reply.

I checked the code again, there seems to be cases out of bounds occur
very possiblly. We may need to enlarge the cmem array to avoid the risk.

In below draft code, we need add another slot to exclude the low 1M area
when preparing elfcorehdr. And to exclude the elf header region from
crash kernel region, we need create the cmem with 2 slots.

With these change, we can absolutely avoid out of bounds occurence.
What do you think?

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 1715e5f06a59..21facabcf699 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -147,10 +147,10 @@ static struct crash_mem *fill_up_crash_elf_data(void)
return NULL;
 
/*
-* Exclusion of crash region and/or crashk_low_res may cause
-* another range split. So add extra two slots here.
+* Exclusion of low 1M, crash region and/or crashk_low_res may
+* cause another range split. So add extra two slots here.
 */
-   nr_ranges += 2;
+   nr_ranges += 3;
cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
if (!cmem)
return NULL;
@@ -282,7 +282,7 @@ int crash_setup_memmap_entries(struct kimage *image, struct 
boot_params *params)
struct crash_memmap_data cmd;
struct crash_mem *cmem;
 
-   cmem = vzalloc(struct_size(cmem, ranges, 1));
+   cmem = vzalloc(struct_size(cmem, ranges, 2));
if (!cmem)
return -ENOMEM;
 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 5/7] kexec_file, ricv: print out debugging message if required

2023-12-12 Thread Baoquan He
On 12/07/23 at 07:22am, Baoquan He wrote:
> On 12/06/23 at 04:54pm, Conor Dooley wrote:
> > On Wed, Dec 06, 2023 at 11:37:52PM +0800, Baoquan He wrote:
> > > On 12/04/23 at 04:14pm, Conor Dooley wrote:
> > > > On Mon, Dec 04, 2023 at 11:38:05PM +0800, Baoquan He wrote:
> > > > > On 12/01/23 at 10:38am, Conor Dooley wrote:
> > > > > > On Thu, Nov 30, 2023 at 10:39:53AM +0800, Baoquan He wrote:
> > > > > > 
> > > > > > $subject has a typo in the arch bit :)
> > > > > 
> > > > > Indeed, will fix if need report. Thanks for careful checking.
> > > > > 
> > > > > > 
> > > > > > > Replace pr_debug() with the newly added kexec_dprintk() in 
> > > > > > > kexec_file
> > > > > > > loading related codes.
> > > > > > 
> > > > > > Commit messages should be understandable in isolation, but this only
> > > > > > explains (part of) what is obvious in the diff. Why is this change
> > > > > > being made?
> > > > > 
> > > > > The purpose has been detailedly described in cover letter and patch 1
> > > > > log. Andrew has picked these patches into his tree and grabbed the 
> > > > > cover
> > > > > letter log into the relevant commit for people's later checking. All
> > > > > these seven patches will be present in mainline together. This is 
> > > > > common
> > > > > way when posting patch series? Please let me know if I misunderstand
> > > > > anything.
> > > > 
> > > > Each patch having a commit message that explains why a change is being
> > > > made is the expectation. It is especially useful to explain the why
> > > > here, since it is not just a mechanical conversion of pr_debug()s as the
> > > > commit message suggests.
> > > 
> > > Sounds reasonable. I rephrase the patch 3 log as below, do you think
> > > it's OK to you?
> > 
> > Yes, but with one comment.
> > 
> > > 
> > > I will also adjust patch logs on other ARCH once this one is done.
> > > Thanks.
> > > 
> > > =
> > > Subject: [PATCH v3 5/7] kexec_file, ricv: print out debugging message if 
> > > required
> > > 
> > > Then when specifying '-d' for kexec_file_load interface, loaded
> > > locations of kernel/initrd/cmdline etc can be printed out to help debug.
> > > 
> > > Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
> > > loading related codes.
> > > 
> > 
> > > And also replace pr_notice() with kexec_dprintk() in elf_kexec_load()
> > > because it's make sense to always print out loaded location of purgatory
>   ~
> > > and device tree even though users don't expect the message.
> 
> Fixed typo:
> ==
> 
> And also replace pr_notice() with kexec_dprintk() in elf_kexec_load()
> because it doesn't make sense to always print out loaded location of
> purgatory and device tree even though users don't expect the message.

I will post v4 to include these suggested changes, please add comments
if there's any concern. Thanks for reviewing.

> 
> > 
> > This seems to contradict what you said in your earlier mail, about
> > moving these from notice to debug. I think you missed a negation in your
> > new version of the commit message. What you said in response to me seems
> > like a more complete explanation anyway:
> 
> Ah, I made mistake when typing, these printing is only for debugging,
> so always printing out them is not suggested.
> 
> > always printing out the loaded location of purgatory and
> > device tree doesn't make sense. It will be confusing when users
> > see these even when they do normal kexec/kdump loading.
> > 
> > Thanks,
> > Conor.
> > 
> > > And also remove kexec_image_info() because the content has been printed
> > > out in generic code.
> > > 
> > > 
> > > 
> > > > 
> > > > > > 
> > > > > > > 
> > > > > > > And also remove kexec_image_info() because the content has been 
> > > > > > > printed
> > > > > > > out in generic code.
> > > > > > > 
> > > > > > > Signed-off-by: Baoquan He 
> > > > > > > ---
> > > > > > >  arch/riscv/kernel/elf_kexec.c | 11 ++-
> > > > > > >  arch/riscv/kernel/machine_kexec.c | 26 --
> > > > > > >  2 files changed, 6 insertions(+), 31 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/arch/riscv/kernel/elf_kexec.c 
> > > > > > > b/arch/riscv/kernel/elf_kexec.c
> > > > > > > index e60fbd8660c4..5bd1ec3341fe 100644
> > > > > > > --- a/arch/riscv/kernel/elf_kexec.c
> > > > > > > +++ b/arch/riscv/kernel/elf_kexec.c
> > > > > > > @@ -216,7 +216,6 @@ static void *elf_kexec_load(struct kimage 
> > > > > > > *image, char *kernel_buf,
> > > > > > >   if (ret)
> > > > > > >   goto out;
> > > > > > >   kernel_start = image->start;
> > > > > > > - pr_notice("The entry point of kernel at 0x%lx\n", image->start);
> > > > > > >  
> > > > > > >   /* Add the kernel binary to the image */
> > > > > > >   ret = riscv_kexec_elf_load(image, , _info,
> > > > > > > @@ -252,8 +251,8 @@ static void *elf_kexec_load(struct kimage 
> > > > > > > *image, char *kernel_buf,
> > > > > > >

Re: [PATCH] kexec: Use ALIGN macro instead of open-coding it

2023-12-12 Thread Baoquan He
On 12/12/23 at 10:27pm, Yuntao Wang wrote:
> Use ALIGN macro instead of open-coding it to improve code readability.
> 
> Signed-off-by: Yuntao Wang 
> ---
>  kernel/kexec_core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

LGTM,

Acked-by: Baoquan He 

> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index be5642a4ec49..0113436e4a3a 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -430,7 +430,7 @@ static struct page 
> *kimage_alloc_crash_control_pages(struct kimage *image,
>  
>   pages = NULL;
>   size = (1 << order) << PAGE_SHIFT;
> - hole_start = (image->control_page + (size - 1)) & ~(size - 1);
> + hole_start = ALIGN(image->control_page, size);
>   hole_end   = hole_start + size - 1;
>   while (hole_end <= crashk_res.end) {
>   unsigned long i;
> @@ -447,7 +447,7 @@ static struct page 
> *kimage_alloc_crash_control_pages(struct kimage *image,
>   mend   = mstart + image->segment[i].memsz - 1;
>   if ((hole_end >= mstart) && (hole_start <= mend)) {
>   /* Advance the hole to the end of the segment */
> - hole_start = (mend + (size - 1)) & ~(size - 1);
> + hole_start = ALIGN(mend, size);
>   hole_end   = hole_start + size - 1;
>   break;
>   }
> -- 
> 2.43.0
> 
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 09/15] tracing: Introduce names for events

2023-12-12 Thread Steven Rostedt
On Wed, 13 Dec 2023 00:04:46 +
Alexander Graf  wrote:

> With KHO (Kexec HandOver), we want to preserve trace buffers. To parse
> them, we need to ensure that all trace events that exist in the logs are
> identical to the ones we parse as. That means we need to match the
> events before and after kexec.
> 
> As a first step towards that, let's give every event a unique name. That
> way we can clearly identify the event before and after kexec and restore
> its ID post-kexec.
> 
> Signed-off-by: Alexander Graf 
> ---
>  include/linux/trace_events.h |  1 +
>  include/trace/trace_events.h |  2 ++
>  kernel/trace/blktrace.c  |  1 +
>  kernel/trace/trace_branch.c  |  1 +
>  kernel/trace/trace_events.c  |  3 +++
>  kernel/trace/trace_functions_graph.c |  4 +++-
>  kernel/trace/trace_output.c  | 13 +
>  kernel/trace/trace_probe.c   |  3 +++
>  kernel/trace/trace_syscalls.c| 29 
>  9 files changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index d68ff9b1247f..7670224aa92d 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -149,6 +149,7 @@ struct trace_event {
>   struct hlist_node   node;
>   int type;
>   struct trace_event_functions*funcs;
> + const char  *name;
>  };

OK, this is a hard no. We definitely need to find a different way to do
this. I'm trying hard to lower the footprint of tracing, and this just
added 8 bytes to every event on a 64 bit machine.

On my box I have 1953 events, and they are constantly growing. This just
added 15,624 bytes of tracing overhead to that machine.

That may not sound like much, but as this is only for this feature, it just
added 15K to the overhead for the majority of users.

I'm not sure how easy it is to make this a config option that takes away
that field when not set. But I would need that at a minimum.

-- Steve




Re: [PATCH 08/15] tracing: Introduce names for ring buffers

2023-12-12 Thread Steven Rostedt
On Wed, 13 Dec 2023 01:35:16 +0100
Alexander Graf  wrote:

> > The trace_array is the structure that represents each tracing instance. And
> > it already has a name field. And if you can get the associated ring buffer
> > from that too.
> >
> > struct trace_array *tr;
> >
> >  tr->array_buffer.buffer
> >
> >  tr->name
> >
> > When you do: mkdir /sys/kernel/tracing/instance/foo
> >
> > You create a new trace_array instance where tr->name = "foo" and allocates
> > the buffer for it as well.  
> 
> The name in the ring buffer is pretty much just a copy of the trace 
> array name. I use it to reconstruct which buffer we're actually 
> referring to inside __ring_buffer_alloc().

No, I rather not tie the ring buffer to the trace_array.

> 
> I'm all ears for alternative suggestions. I suppose we could pass tr as 
> argument to ring_buffer_alloc() instead of the name?

I'll have to spend some time (that I don't currently have :-( ) on looking
at this more. I really don't like the copying of the name into the ring
buffer allocation, as it may be an unneeded burden to maintain, not to
mention the duplicate field.

-- Steve



Re: [PATCH 08/15] tracing: Introduce names for ring buffers

2023-12-12 Thread Alexander Graf

Hi Steve,

On 13.12.23 01:15, Steven Rostedt wrote:


On Wed, 13 Dec 2023 00:04:45 +
Alexander Graf  wrote:


With KHO (Kexec HandOver), we want to preserve trace buffers across
kexec. To carry over their state between kernels, the kernel needs a
common handle for them that exists on both sides. As handle we introduce
names for ring buffers. In a follow-up patch, the kernel can then use
these names to recover buffer contents for specific ring buffers.


Is there a way to use the trace_array name instead?

The trace_array is the structure that represents each tracing instance. And
it already has a name field. And if you can get the associated ring buffer
from that too.

struct trace_array *tr;

 tr->array_buffer.buffer

 tr->name

When you do: mkdir /sys/kernel/tracing/instance/foo

You create a new trace_array instance where tr->name = "foo" and allocates
the buffer for it as well.


The name in the ring buffer is pretty much just a copy of the trace 
array name. I use it to reconstruct which buffer we're actually 
referring to inside __ring_buffer_alloc().


I'm all ears for alternative suggestions. I suppose we could pass tr as 
argument to ring_buffer_alloc() instead of the name?



Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




Re: [PATCH 08/15] tracing: Introduce names for ring buffers

2023-12-12 Thread Steven Rostedt
On Wed, 13 Dec 2023 00:04:45 +
Alexander Graf  wrote:

> With KHO (Kexec HandOver), we want to preserve trace buffers across
> kexec. To carry over their state between kernels, the kernel needs a
> common handle for them that exists on both sides. As handle we introduce
> names for ring buffers. In a follow-up patch, the kernel can then use
> these names to recover buffer contents for specific ring buffers.
> 

Is there a way to use the trace_array name instead?

The trace_array is the structure that represents each tracing instance. And
it already has a name field. And if you can get the associated ring buffer
from that too.

struct trace_array *tr;

tr->array_buffer.buffer

tr->name

When you do: mkdir /sys/kernel/tracing/instance/foo

You create a new trace_array instance where tr->name = "foo" and allocates
the buffer for it as well.

-- Steve



[PATCH 15/15] tracing: Add config option for kexec handover

2023-12-12 Thread Alexander Graf
Now that all bits are in place to allow ftrace to pass its trace data
into the next kernel on kexec, let's give users a kconfig option to
enable the functionality.

Signed-off-by: Alexander Graf 
---
 kernel/trace/Kconfig | 13 +
 1 file changed, 13 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..af83ee755b9e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1169,6 +1169,19 @@ config HIST_TRIGGERS_DEBUG
 
   If unsure, say N.
 
+config FTRACE_KHO
+   bool "Ftrace Kexec handover support"
+   depends on KEXEC_KHO
+   help
+  Enable support for ftrace to pass metadata across kexec so the new
+ kernel continues to use the previous kernel's trace buffers.
+
+ This can be useful when debugging kexec performance or correctness
+ issues: The new kernel can dump the old kernel's trace buffer which
+ contains all events until reboot.
+
+ If unsure, say N.
+
 source "kernel/trace/rv/Kconfig"
 
 endif # FTRACE
-- 
2.40.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879






[PATCH 13/15] tracing: Add kho serialization of trace events

2023-12-12 Thread Alexander Graf
Events and thus their parsing handle in ftrace have dynamic IDs that get
assigned whenever the event is added to the system. If we want to parse
trace events after kexec, we need to link event IDs back to the original
trace event that existed before we kexec'ed.

There are broadly 2 paths we could take for that:

  1) Save full event description across KHO, restore after kexec,
 merge identical trace events into a single identifier.
  2) Recover the ID of post-kexec added events so they get the same
 ID after kexec that they had before kexec

This patch implements the second option. It's simpler and thus less
intrusive. However, it means we can not fully parse affected events
when the kernel removes or modifies trace events across a kho kexec.

Signed-off-by: Alexander Graf 
---
 kernel/trace/trace.c|  1 +
 kernel/trace/trace_output.c | 28 
 kernel/trace/trace_output.h |  1 +
 3 files changed, 30 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 71c249cc5b43..26edfd2a85fd 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10639,6 +10639,7 @@ static int trace_kho_notifier(struct notifier_block 
*self,
err |= fdt_begin_node(fdt, "ftrace");
err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
err |= trace_kho_write_trace_array(fdt, _trace);
+   err |= trace_kho_write_events(fdt);
err |= fdt_end_node(fdt);
 
if (!err) {
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index f3677e0da795..113de40c616f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "trace_output.h"
 
@@ -669,6 +670,33 @@ int trace_print_lat_context(struct trace_iterator *iter)
return !trace_seq_has_overflowed(s);
 }
 
+int trace_kho_write_events(void *fdt)
+{
+#ifdef CONFIG_FTRACE_KHO
+   const char compatible[] = "ftrace,events-v1";
+   const char *name = "events";
+   struct trace_event *event;
+   unsigned key;
+   int err = 0;
+
+   err |= fdt_begin_node(fdt, name);
+   err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+
+   for (key = 0; key < EVENT_HASHSIZE; key++) {
+   hlist_for_each_entry(event, _hash[key], node)
+   err |= fdt_property(fdt, event->name, >type,
+   sizeof(event->type));
+   }
+
+   err |= fdt_end_node(fdt);
+
+   return err;
+#else
+   return 0;
+#endif
+}
+
+
 /**
  * ftrace_find_event - find a registered event
  * @type: the type of event to look for
diff --git a/kernel/trace/trace_output.h b/kernel/trace/trace_output.h
index dca40f1f1da4..36dc7963269e 100644
--- a/kernel/trace/trace_output.h
+++ b/kernel/trace/trace_output.h
@@ -25,6 +25,7 @@ extern enum print_line_t print_event_fields(struct 
trace_iterator *iter,
 extern void trace_event_read_lock(void);
 extern void trace_event_read_unlock(void);
 extern struct trace_event *ftrace_find_event(int type);
+extern int trace_kho_write_events(void *fdt);
 
 extern enum print_line_t trace_nop_print(struct trace_iterator *iter,
 int flags, struct trace_event *event);
-- 
2.40.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879






[PATCH 14/15] tracing: Recover trace events from kexec handover

2023-12-12 Thread Alexander Graf
This patch implements all logic necessary to match a new trace event
that we add against preserved trace events from kho. If we find a match,
we give the new trace event the old event's identifier. That way, trace
read-outs are able to make sense of buffer contents again because the
parsing code for events looks at the same identifiers.

Signed-off-by: Alexander Graf 
---
 kernel/trace/trace_output.c | 65 -
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 113de40c616f..d2e2a6346322 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -749,6 +749,67 @@ void trace_event_read_unlock(void)
up_read(_event_sem);
 }
 
+/**
+ * trace_kho_fill_event_type - restore event type info from KHO
+ * @event: the event type to enumerate
+ *
+ * Event types are semi-dynamically generated. To ensure that
+ * their identifiers match before and after kexec with KHO,
+ * let's match up unique name identifiers and fill in the
+ * respective ID information if we booted with KHO.
+ */
+static bool trace_kho_fill_event_type(struct trace_event *event)
+{
+#ifdef CONFIG_FTRACE_KHO
+   const char *path = "/ftrace/events";
+   void *fdt = kho_get_fdt();
+   int err, len, off, id;
+   const void *p;
+
+   if (!fdt)
+   return false;
+
+   if (WARN_ON(!event->name))
+   return false;
+
+   pr_debug("Trying to revive event '%s'", event->name);
+
+   off = fdt_path_offset(fdt, path);
+   if (off < 0) {
+   pr_debug("Could not find '%s' in DT", path);
+   return false;
+   }
+
+   err = fdt_node_check_compatible(fdt, off, "ftrace,events-v1");
+   if (err) {
+   pr_warn("Node '%s' has invalid compatible", path);
+   return false;
+   }
+
+   p = fdt_getprop(fdt, off, event->name, );
+   if (!p) {
+   pr_warn("Event '%s' not found", event->name);
+   return false;
+   }
+
+   if (len != sizeof(event->type)) {
+   pr_warn("Event '%s' has invalid length", event->name);
+   return false;
+   }
+
+   id = *(const u32 *)p;
+
+   /* Mark ID as in use */
+   if (ida_alloc_range(_event_ida, id, id, GFP_KERNEL) != id)
+   return false;
+
+   event->type = id;
+   return true;
+#endif
+
+   return false;
+}
+
 /**
  * register_trace_event - register output for an event type
  * @event: the event type to register
@@ -777,7 +838,9 @@ int register_trace_event(struct trace_event *event)
if (WARN_ON(!event->funcs))
goto out;
 
-   if (!event->type) {
+   if (trace_kho_fill_event_type(event)) {
+   pr_debug("Recovered '%s' as id=%d", event->name, event->type);
+   } else if (!event->type) {
event->type = alloc_trace_event_type();
if (!event->type)
goto out;
-- 
2.40.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879






[PATCH 12/15] tracing: Recover trace buffers from kexec handover

2023-12-12 Thread Alexander Graf
When kexec handover is in place, we now know the location of all
previous buffers for ftrace rings. With this patch applied, ftrace
reassembles any new trace buffer that carries the same name as a
previous one with the same data pages that the previous buffer had.

That way, a buffer that we had in place before kexec becomes readable
after kexec again as soon as it gets initialized with the same name.

Signed-off-by: Alexander Graf 
---
 kernel/trace/ring_buffer.c | 173 -
 1 file changed, 171 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 691d1236eeb1..f3d07cb90762 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -575,6 +575,28 @@ struct ring_buffer_iter {
int missed_events;
 };
 
+struct trace_kho_cpu {
+   const struct kho_mem *mem;
+   uint32_t nr_mems;
+};
+
+#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+struct trace_kho_cpu *kho);
+static int trace_kho_read_cpu(const char *name, int cpu, struct trace_kho_cpu 
*kho);
+#else
+static int trace_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+struct trace_kho_cpu *kho)
+{
+   return -EINVAL;
+}
+
+static int trace_kho_read_cpu(const char *name, int cpu, struct trace_kho_cpu 
*kho)
+{
+   return -EINVAL;
+}
+#endif
+
 #ifdef RB_TIME_32
 
 /*
@@ -1807,10 +1829,12 @@ struct trace_buffer *__ring_buffer_alloc(const char 
*name,
unsigned long size, unsigned flags,
struct lock_class_key *key)
 {
+   int cpu = raw_smp_processor_id();
+   struct trace_kho_cpu kho = {};
struct trace_buffer *buffer;
+   bool use_kho = false;
long nr_pages;
int bsize;
-   int cpu;
int ret;
 
/* keep it in its own cache line */
@@ -1823,6 +1847,12 @@ struct trace_buffer *__ring_buffer_alloc(const char 
*name,
goto fail_free_buffer;
 
nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   if (!trace_kho_read_cpu(name, cpu, ) && kho.nr_mems > 4) {
+   nr_pages = kho.nr_mems / 2;
+   use_kho = true;
+   pr_debug("Using kho for buffer '%s' on CPU [%03d]", name, cpu);
+   }
+
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
@@ -1843,12 +1873,14 @@ struct trace_buffer *__ring_buffer_alloc(const char 
*name,
if (!buffer->buffers)
goto fail_free_cpumask;
 
-   cpu = raw_smp_processor_id();
cpumask_set_cpu(cpu, buffer->cpumask);
buffer->buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
if (!buffer->buffers[cpu])
goto fail_free_buffers;
 
+   if (use_kho && trace_kho_replace_buffers(buffer->buffers[cpu], ))
+   pr_warn("Could not revive all previous trace data");
+
ret = cpuhp_state_add_instance(CPUHP_TRACE_RB_PREPARE, >node);
if (ret < 0)
goto fail_free_buffers;
@@ -5886,7 +5918,9 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_page);
  */
 int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
 {
+   struct trace_kho_cpu kho = {};
struct trace_buffer *buffer;
+   bool use_kho = false;
long nr_pages_same;
int cpu_i;
unsigned long nr_pages;
@@ -5910,6 +5944,12 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node)
/* allocate minimum pages, user can later expand it */
if (!nr_pages_same)
nr_pages = 2;
+
+   if (!trace_kho_read_cpu(buffer->name, cpu, ) && kho.nr_mems > 4) {
+   nr_pages = kho.nr_mems / 2;
+   use_kho = true;
+   }
+
buffer->buffers[cpu] =
rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
if (!buffer->buffers[cpu]) {
@@ -5917,12 +5957,141 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node)
 cpu);
return -ENOMEM;
}
+
+   if (use_kho && trace_kho_replace_buffers(buffer->buffers[cpu], ))
+   pr_warn("Could not revive all previous trace data");
+
smp_wmb();
cpumask_set_cpu(cpu, buffer->cpumask);
return 0;
 }
 
 #ifdef CONFIG_FTRACE_KHO
+static int trace_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+struct trace_kho_cpu *kho)
+{
+   bool first_loop = true;
+   struct list_head *tmp;
+   int err = 0;
+   int i = 0;
+
+   if (kho->nr_mems != cpu_buffer->nr_pages * 2)
+   return -EINVAL;
+
+   for (tmp = rb_list_head(cpu_buffer->pages);
+tmp != rb_list_head(cpu_buffer->pages) || first_loop;
+tmp = rb_list_head(tmp->next), first_loop 

[PATCH 11/15] tracing: Add kho serialization of trace buffers

2023-12-12 Thread Alexander Graf
When we do a kexec handover, we want to preserve previous ftrace data
into the new kernel. At the point when we write out the handover data,
ftrace may still be running and recording new events and we want to
capture all of those too.

To allow the new kernel to revive all trace data up to reboot, we store
all locations of trace buffers as well as their linked list metadata. We
can then later reuse the linked list to reconstruct the head pointer.

This patch implements the write-out logic for trace buffers.

Signed-off-by: Alexander Graf 
---
 include/linux/ring_buffer.h |  2 +
 kernel/trace/ring_buffer.c  | 89 +
 kernel/trace/trace.c| 16 +++
 3 files changed, 107 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index f34538f97c75..049565677ef8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -212,4 +212,6 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int trace_kho_write_trace_buffer(void *fdt, struct trace_buffer *buffer);
+
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index eaaf823ddedb..691d1236eeb1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -5921,6 +5922,94 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node)
return 0;
 }
 
+#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
+{
+   int i = 0;
+   int err = 0;
+   struct list_head *tmp;
+   const char compatible[] = "ftrace,cpu-v1";
+   char name[] = "cpu";
+   int nr_pages;
+   struct ring_buffer_per_cpu *cpu_buffer;
+   bool first_loop = true;
+   struct kho_mem *mem;
+   uint64_t mem_len;
+
+   if (!cpumask_test_cpu(cpu, buffer->cpumask))
+   return 0;
+
+   cpu_buffer = buffer->buffers[cpu];
+
+   nr_pages = cpu_buffer->nr_pages;
+   mem_len = sizeof(*mem) * nr_pages * 2;
+   mem = vmalloc(mem_len);
+
+   snprintf(name, sizeof(name), "cpu%x", cpu);
+
+   err |= fdt_begin_node(fdt, name);
+   err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+   err |= fdt_property(fdt, "cpu", , sizeof(cpu));
+
+   for (tmp = rb_list_head(cpu_buffer->pages);
+tmp != rb_list_head(cpu_buffer->pages) || first_loop;
+tmp = rb_list_head(tmp->next), first_loop = false) {
+   struct buffer_page *bpage = (struct buffer_page *)tmp;
+
+   /* Ring is larger than it should be? */
+   if (i >= (nr_pages * 2)) {
+   pr_err("ftrace ring has more pages than nr_pages (%d / 
%d)", i, nr_pages);
+   err = -EINVAL;
+   break;
+   }
+
+   /* First describe the bpage */
+   mem[i++] = (struct kho_mem) {
+   .addr = __pa(bpage),
+   .len = sizeof(*bpage)
+   };
+
+   /* Then the data page */
+   mem[i++] = (struct kho_mem) {
+   .addr = __pa(bpage->page),
+   .len = PAGE_SIZE
+   };
+   }
+
+   err |= fdt_property(fdt, "mem", mem, mem_len);
+   err |= fdt_end_node(fdt);
+
+   vfree(mem);
+   return err;
+}
+
+int trace_kho_write_trace_buffer(void *fdt, struct trace_buffer *buffer)
+{
+   const char compatible[] = "ftrace,buffer-v1";
+   char name[] = "buffer";
+   int err;
+   int i;
+
+   err = fdt_begin_node(fdt, name);
+   if (err)
+   return err;
+
+   fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+
+   for (i = 0; i < buffer->cpus; i++) {
+   err = trace_kho_write_cpu(fdt, buffer, i);
+   if (err)
+   return err;
+   }
+
+   err =  fdt_end_node(fdt);
+   if (err)
+   return err;
+
+   return 0;
+}
+#endif
+
 #ifdef CONFIG_RING_BUFFER_STARTUP_TEST
 /*
  * This is a basic integrity check of the ring buffer.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 3e7f61cf773e..71c249cc5b43 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10597,6 +10597,21 @@ void __init early_trace_init(void)
 }
 
 #ifdef CONFIG_FTRACE_KHO
+static int trace_kho_write_trace_array(void *fdt, struct trace_array *tr)
+{
+   const char *name = tr->name ? tr->name : "global_trace";
+   const char compatible[] = "ftrace,array-v1";
+   int err = 0;
+
+   err |= fdt_begin_node(fdt, name);
+   err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+   err |= fdt_property(fdt, "trace_flags", >trace_flags, 
sizeof(tr->trace_flags));
+   err |= 

[PATCH 10/15] tracing: Introduce kho serialization

2023-12-12 Thread Alexander Graf
We want to be able to transfer ftrace state from one kernel to the next.
To start off with, let's establish all the boiler plate to get a write
hook when KHO wants to serialize and fill out basic data.

Follow-up patches will fill in serialization of ring buffers and events.

Signed-off-by: Alexander Graf 
---
 kernel/trace/trace.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7700ca1be2a5..3e7f61cf773e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -866,6 +867,10 @@ static struct tracer   *trace_types 
__read_mostly;
  */
 DEFINE_MUTEX(trace_types_lock);
 
+#ifdef CONFIG_FTRACE_KHO
+static bool trace_in_kho;
+#endif
+
 /*
  * serialize the access of the ring buffer
  *
@@ -10591,12 +10596,59 @@ void __init early_trace_init(void)
init_events();
 }
 
+#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_notifier(struct notifier_block *self,
+ unsigned long cmd,
+ void *v)
+{
+   const char compatible[] = "ftrace-v1";
+   void *fdt = v;
+   int err = 0;
+
+   switch (cmd) {
+   case KEXEC_KHO_ABORT:
+   if (trace_in_kho)
+   mutex_unlock(_types_lock);
+   trace_in_kho = false;
+   return NOTIFY_DONE;
+   case KEXEC_KHO_DUMP:
+   /* Handled below */
+   break;
+   default:
+   return NOTIFY_BAD;
+   }
+
+   if (unlikely(tracing_disabled))
+   return NOTIFY_DONE;
+
+   err |= fdt_begin_node(fdt, "ftrace");
+   err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+   err |= fdt_end_node(fdt);
+
+   if (!err) {
+   /* Hold all future allocations */
+   mutex_lock(_types_lock);
+   trace_in_kho = true;
+   }
+
+   return err ? NOTIFY_BAD : NOTIFY_DONE;
+}
+
+static struct notifier_block trace_kho_nb = {
+   .notifier_call = trace_kho_notifier,
+};
+#endif
+
 void __init trace_init(void)
 {
trace_event_init();
 
if (boot_instance_index)
enable_instances();
+
+#ifdef CONFIG_FTRACE_KHO
+   register_kho_notifier(_kho_nb);
+#endif
 }
 
 __init static void clear_boot_tracer(void)
-- 
2.40.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879






[PATCH 09/15] tracing: Introduce names for events

2023-12-12 Thread Alexander Graf
With KHO (Kexec HandOver), we want to preserve trace buffers. To parse
them, we need to ensure that all trace events that exist in the logs are
identical to the ones we parse as. That means we need to match the
events before and after kexec.

As a first step towards that, let's give every event a unique name. That
way we can clearly identify the event before and after kexec and restore
its ID post-kexec.

Signed-off-by: Alexander Graf 
---
 include/linux/trace_events.h |  1 +
 include/trace/trace_events.h |  2 ++
 kernel/trace/blktrace.c  |  1 +
 kernel/trace/trace_branch.c  |  1 +
 kernel/trace/trace_events.c  |  3 +++
 kernel/trace/trace_functions_graph.c |  4 +++-
 kernel/trace/trace_output.c  | 13 +
 kernel/trace/trace_probe.c   |  3 +++
 kernel/trace/trace_syscalls.c| 29 
 9 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..7670224aa92d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -149,6 +149,7 @@ struct trace_event {
struct hlist_node   node;
int type;
struct trace_event_functions*funcs;
+   const char  *name;
 };
 
 extern int register_trace_event(struct trace_event *event);
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index c2f9cabf154d..bb4e6a33eef9 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -443,6 +443,7 @@ static struct trace_event_call __used event_##call = {  
\
.tp = &__tracepoint_##call, \
},  \
.event.funcs= _event_type_funcs_##template,   \
+   .event.name = __stringify(call),\
.print_fmt  = print_fmt_##template, \
.flags  = TRACE_EVENT_FL_TRACEPOINT,\
 }; \
@@ -460,6 +461,7 @@ static struct trace_event_call __used event_##call = {  
\
.tp = &__tracepoint_##call, \
},  \
.event.funcs= _event_type_funcs_##call,   \
+   .event.name = __stringify(template),\
.print_fmt  = print_fmt_##call, \
.flags  = TRACE_EVENT_FL_TRACEPOINT,\
 }; \
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index d5d94510afd3..7f86fd41b38e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1584,6 +1584,7 @@ static struct trace_event_functions trace_blk_event_funcs 
= {
 static struct trace_event trace_blk_event = {
.type   = TRACE_BLK,
.funcs  = _blk_event_funcs,
+   .name   = "blk",
 };
 
 static int __init init_blk_tracer(void)
diff --git a/kernel/trace/trace_branch.c b/kernel/trace/trace_branch.c
index e47fdb4c92fb..3372070f2e85 100644
--- a/kernel/trace/trace_branch.c
+++ b/kernel/trace/trace_branch.c
@@ -168,6 +168,7 @@ static struct trace_event_functions trace_branch_funcs = {
 static struct trace_event trace_branch_event = {
.type   = TRACE_BRANCH,
.funcs  = _branch_funcs,
+   .name   = "branch",
 };
 
 static struct tracer branch_trace __read_mostly =
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f29e815ca5b2..4f5d37f96a17 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2658,6 +2658,9 @@ static int event_init(struct trace_event_call *call)
if (WARN_ON(!name))
return -EINVAL;
 
+   if (!call->event.name)
+   call->event.name = name;
+
if (call->class->raw_init) {
ret = call->class->raw_init(call);
if (ret < 0 && ret != -ENOSYS)
diff --git a/kernel/trace/trace_functions_graph.c 
b/kernel/trace/trace_functions_graph.c
index c35fbaab2a47..088dfd4a1a56 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -1342,11 +1342,13 @@ static struct trace_event_functions graph_functions = {
 static struct trace_event graph_trace_entry_event = {
.type   = TRACE_GRAPH_ENT,
.funcs  = _functions,
+   .name   = "graph_ent",
 };
 
 static struct trace_event graph_trace_ret_event = {
.type   = TRACE_GRAPH_RET,
-   .funcs  = _functions
+   .funcs  = _functions,
+   .name   = "graph_ret",
 };
 
 static 

[PATCH 07/15] x86: Add KHO support

2023-12-12 Thread Alexander Graf
We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for x86 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.

In addition, it enlightens it decompression code with KHO so that its
KASLR location finder only considers memory regions that are not already
occupied by KHO memory.

Signed-off-by: Alexander Graf 
---
 arch/x86/Kconfig  | 12 ++
 arch/x86/boot/compressed/kaslr.c  | 55 +++
 arch/x86/include/uapi/asm/bootparam.h | 15 +++-
 arch/x86/kernel/e820.c|  9 +
 arch/x86/kernel/kexec-bzimage64.c | 39 +++
 arch/x86/kernel/setup.c   | 46 ++
 arch/x86/mm/init_32.c |  7 
 arch/x86/mm/init_64.c |  7 
 8 files changed, 189 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3762f41bb092..849e6ddc5d94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2103,6 +2103,18 @@ config ARCH_SUPPORTS_CRASH_HOTPLUG
 config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_CORE
 
+config KEXEC_KHO
+   bool "kexec handover"
+   depends on KEXEC
+   select MEMBLOCK_SCRATCH
+   select LIBFDT
+   select CMA
+   help
+ Allow kexec to hand over state across kernels by generating and
+ passing additional metadata to the target kernel. This is useful
+ to keep data or state alive across the kexec. For this to work,
+ both source and target kernels need to have this option enabled.
+
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
default "0x100"
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index dec961c6d16a..93ea292e4c18 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -472,6 +473,60 @@ static bool mem_avoid_overlap(struct mem_vector *img,
}
}
 
+#ifdef CONFIG_KEXEC_KHO
+   if (ptr->type == SETUP_KEXEC_KHO) {
+   struct kho_data *kho = (struct kho_data *)ptr->data;
+   struct kho_mem *mems = (void *)kho->mem_cache_addr;
+   int nr_mems = kho->mem_cache_size / sizeof(*mems);
+   int i;
+
+   /* Avoid the mem cache */
+   avoid = (struct mem_vector) {
+   .start = kho->mem_cache_addr,
+   .size = kho->mem_cache_size,
+   };
+
+   if (mem_overlaps(img, ) && (avoid.start < 
earliest)) {
+   *overlap = avoid;
+   earliest = overlap->start;
+   is_overlapping = true;
+   }
+
+   /* And the KHO DT */
+   avoid = (struct mem_vector) {
+   .start = kho->dt_addr,
+   .size = kho->dt_size,
+   };
+
+   if (mem_overlaps(img, ) && (avoid.start < 
earliest)) {
+   *overlap = avoid;
+   earliest = overlap->start;
+   is_overlapping = true;
+   }
+
+   /* As well as any other KHO memory reservations */
+   for (i = 0; i < nr_mems; i++) {
+   avoid = (struct mem_vector) {
+   .start = mems[i].addr,
+   .size = mems[i].len,
+   };
+
+   /*
+* This mem starts after our current break.
+* The array is sorted, so we're done.
+*/
+   if (avoid.start >= earliest)
+   break;
+
+   if (mem_overlaps(img, )) {
+   *overlap = avoid;
+   earliest = overlap->start;
+   is_overlapping = true;
+   }
+   }
+   }
+#endif
+
ptr = (struct setup_data *)(unsigned long)ptr->next;
}
 
diff --git a/arch/x86/include/uapi/asm/bootparam.h 
b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..013af38a9673 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -13,7 +13,8 @@
 #define SETUP_CC_BLOB  7
 #define SETUP_IMA   

[PATCH 08/15] tracing: Introduce names for ring buffers

2023-12-12 Thread Alexander Graf
With KHO (Kexec HandOver), we want to preserve trace buffers across
kexec. To carry over their state between kernels, the kernel needs a
common handle for them that exists on both sides. As handle we introduce
names for ring buffers. In a follow-up patch, the kernel can then use
these names to recover buffer contents for specific ring buffers.

Signed-off-by: Alexander Graf 
---
 include/linux/ring_buffer.h | 7 ---
 kernel/trace/ring_buffer.c  | 5 -
 kernel/trace/trace.c| 7 ---
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 782e14f62201..f34538f97c75 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -85,17 +85,18 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
  * size is in bytes for each per CPU buffer.
  */
 struct trace_buffer *
-__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key 
*key);
+__ring_buffer_alloc(const char *name, unsigned long size, unsigned flags,
+   struct lock_class_key *key);
 
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
  * ring buffer's lock class separate.
  */
-#define ring_buffer_alloc(size, flags) \
+#define ring_buffer_alloc(name, size, flags)   \
 ({ \
static struct lock_class_key __key; \
-   __ring_buffer_alloc((size), (flags), &__key);   \
+   __ring_buffer_alloc((name), (size), (flags), &__key);   \
 })
 
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..eaaf823ddedb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -557,6 +557,7 @@ struct trace_buffer {
 
struct rb_irq_work  irq_work;
booltime_stamp_abs;
+   const char  *name;
 };
 
 struct ring_buffer_iter {
@@ -1801,7 +1802,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu 
*cpu_buffer)
  * when the buffer wraps. If this flag is not set, the buffer will
  * drop data when the tail hits the head.
  */
-struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
+struct trace_buffer *__ring_buffer_alloc(const char *name,
+   unsigned long size, unsigned flags,
struct lock_class_key *key)
 {
struct trace_buffer *buffer;
@@ -1823,6 +1825,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
+   buffer->name = name;
 
init_irq_work(>irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(>irq_work.waiters);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9aebf904ff97..7700ca1be2a5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9384,7 +9384,8 @@ allocate_trace_buffer(struct trace_array *tr, struct 
array_buffer *buf, int size
 
buf->tr = tr;
 
-   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   buf->buffer = ring_buffer_alloc(tr->name ? tr->name : "global_trace",
+   size, rb_flags);
if (!buf->buffer)
return -ENOMEM;
 
@@ -9421,7 +9422,7 @@ static int allocate_trace_buffers(struct trace_array *tr, 
int size)
return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-   ret = allocate_trace_buffer(tr, >max_buffer,
+   ret = allocate_trace_buffer(NULL, >max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
free_trace_buffer(>array_buffer);
@@ -10473,7 +10474,7 @@ __init static int tracer_alloc_buffers(void)
goto out_free_cpumask;
/* Used for event triggers */
ret = -ENOMEM;
-   temp_buffer = ring_buffer_alloc(PAGE_SIZE, RB_FL_OVERWRITE);
+   temp_buffer = ring_buffer_alloc("temp_buffer", PAGE_SIZE, 
RB_FL_OVERWRITE);
if (!temp_buffer)
goto out_rm_hp_state;
 
-- 
2.40.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879






[PATCH 06/15] arm64: Add KHO support

2023-12-12 Thread Alexander Graf
We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for arm64 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.

Signed-off-by: Alexander Graf 
---
 arch/arm64/Kconfig| 12 
 arch/arm64/kernel/setup.c |  2 ++
 arch/arm64/mm/init.c  |  8 
 drivers/of/fdt.c  | 41 +++
 drivers/of/kexec.c| 36 ++
 5 files changed, 99 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b071a00425d..1ba338ce7598 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1501,6 +1501,18 @@ config ARCH_SUPPORTS_CRASH_DUMP
 config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_CORE
 
+config KEXEC_KHO
+   bool "kexec handover"
+   depends on KEXEC
+   select MEMBLOCK_SCRATCH
+   select LIBFDT
+   select CMA
+   help
+ Allow kexec to hand over state across kernels by generating and
+ passing additional metadata to the target kernel. This is useful
+ to keep data or state alive across the kexec. For this to work,
+ both source and target kernels need to have this option enabled.
+
 config TRANS_TABLE
def_bool y
depends on HIBERNATION || KEXEC_CORE
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..8035b673d96d 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -346,6 +346,8 @@ void __init __no_sanitize_address setup_arch(char 
**cmdline_p)
 
paging_init();
 
+   kho_reserve_mem();
+
acpi_table_upgrade();
 
/* Parse the ACPI tables for possible boot-time configuration */
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..254d82f3383a 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -358,6 +358,8 @@ void __init bootmem_init(void)
 */
arch_reserve_crashkernel();
 
+   kho_reserve();
+
memblock_dump_all();
 }
 
@@ -386,6 +388,12 @@ void __init mem_init(void)
/* this will put all unused low memory onto the freelists */
memblock_free_all();
 
+   /*
+* Now that all KHO pages are marked as reserved, let's flip them back
+* to normal pages with accurate refcount.
+*/
+   kho_populate_refcount();
+
/*
 * Check boundaries twice: Some fundamental inconsistencies can be
 * detected at build time already.
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bf502ba8da95..af95139351ed 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1006,6 +1006,44 @@ void __init 
early_init_dt_check_for_usable_mem_range(void)
memblock_add(rgn[i].base, rgn[i].size);
 }
 
+/**
+ * early_init_dt_check_kho - Decode info required for kexec handover from DT
+ */
+void __init early_init_dt_check_kho(void)
+{
+#ifdef CONFIG_KEXEC_KHO
+   unsigned long node = chosen_node_offset;
+   u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
+   const __be32 *p;
+   int l;
+
+   if ((long)node < 0)
+   return;
+
+   p = of_get_flat_dt_prop(node, "linux,kho-dt", );
+   if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+   return;
+
+   kho_start = dt_mem_next_cell(dt_root_addr_cells, );
+
+   p = of_get_flat_dt_prop(node, "linux,kho-scratch", );
+   if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+   return;
+
+   scratch_start = dt_mem_next_cell(dt_root_addr_cells, );
+   scratch_size = dt_mem_next_cell(dt_root_addr_cells, );
+
+   p = of_get_flat_dt_prop(node, "linux,kho-mem", );
+   if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+   return;
+
+   mem_start = dt_mem_next_cell(dt_root_addr_cells, );
+   mem_size = dt_mem_next_cell(dt_root_addr_cells, );
+
+   kho_populate(kho_start, scratch_start, scratch_size, mem_start, 
mem_size);
+#endif
+}
+
 #ifdef CONFIG_SERIAL_EARLYCON
 
 int __init early_init_dt_scan_chosen_stdout(void)
@@ -1304,6 +1342,9 @@ void __init early_init_dt_scan_nodes(void)
 
/* Handle linux,usable-memory-range property */
early_init_dt_check_for_usable_mem_range();
+
+   /* Handle kexec handover */
+   early_init_dt_check_kho();
 }
 
 bool __init early_init_dt_scan(void *params)
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 68278340cecf..a612e6bb8c75 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -264,6 +264,37 @@ static inline int setup_ima_buffer(const struct kimage 
*image, void *fdt,
 }
 #endif /* CONFIG_IMA_KEXEC */
 
+static int kho_add_chosen(const struct kimage *image, void *fdt, int 
chosen_node)
+{
+   int ret = 0;
+
+#ifdef CONFIG_KEXEC_KHO
+   if (!image->kho.dt.buffer || !image->kho.mem_cache.buffer)
+

[PATCH 05/15] kexec: Add KHO support to kexec file loads

2023-12-12 Thread Alexander Graf
Kexec has 2 modes: A user space driven mode and a kernel driven mode.
For the kernel driven mode, kernel code determines the physical
addresses of all target buffers that the payload gets copied into.

With KHO, we can only safely copy payloads into the "scratch area".
Teach the kexec file loader about it, so it only allocates for that
area. In addition, enlighten it with support to ask the KHO subsystem
for its respective payloads to copy into target memory. Also teach the
KHO subsystem how to fill the images for file loads.

Signed-off-by: Alexander Graf 
---
 include/linux/kexec.h  |   9 ++
 kernel/kexec_file.c|  41 
 kernel/kexec_kho_out.c | 210 +
 3 files changed, 260 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a3c4fee6f86a..c8859a2ca872 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -362,6 +362,13 @@ struct kimage {
size_t ima_buffer_size;
 #endif
 
+#ifdef CONFIG_KEXEC_KHO
+   struct {
+   struct kexec_buf dt;
+   struct kexec_buf mem_cache;
+   } kho;
+#endif
+
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_sz;
@@ -543,6 +550,7 @@ static inline bool is_kho_boot(void)
 
 /* egest handover metadata */
 void kho_reserve(void);
+int kho_fill_kimage(struct kimage *image);
 int register_kho_notifier(struct notifier_block *nb);
 int unregister_kho_notifier(struct notifier_block *nb);
 bool kho_is_active(void);
@@ -558,6 +566,7 @@ static inline void *kho_get_fdt(void) { return NULL; }
 
 /* egest handover metadata */
 static inline void kho_reserve(void) { }
+static inline int kho_fill_kimage(struct kimage *image) { return 0; }
 static inline int register_kho_notifier(struct notifier_block *nb) { return 
-EINVAL; }
 static inline int unregister_kho_notifier(struct notifier_block *nb) { return 
-EINVAL; }
 static inline bool kho_is_active(void) { return false; }
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f9a419cd22d4..d895d0a49bd9 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -113,6 +113,13 @@ void kimage_file_post_load_cleanup(struct kimage *image)
image->ima_buffer = NULL;
 #endif /* CONFIG_IMA_KEXEC */
 
+#ifdef CONFIG_KEXEC_KHO
+   kvfree(image->kho.mem_cache.buffer);
+   image->kho.mem_cache = (struct kexec_buf) {};
+   kvfree(image->kho.dt.buffer);
+   image->kho.dt = (struct kexec_buf) {};
+#endif
+
/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
 
@@ -249,6 +256,11 @@ kimage_file_prepare_segments(struct kimage *image, int 
kernel_fd, int initrd_fd,
/* IMA needs to pass the measurement list to the next kernel. */
ima_add_kexec_buffer(image);
 
+   /* If KHO is active, add its images to the list */
+   ret = kho_fill_kimage(image);
+   if (ret)
+   goto out;
+
/* Call image load handler */
ldata = kexec_image_load_default(image);
 
@@ -518,6 +530,24 @@ static int locate_mem_hole_callback(struct resource *res, 
void *arg)
return locate_mem_hole_bottom_up(start, end, kbuf);
 }
 
+#ifdef CONFIG_KEXEC_KHO
+static int kexec_walk_kho_scratch(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *))
+{
+   int ret = 0;
+
+   struct resource res = {
+   .start = kho_scratch_phys,
+   .end = kho_scratch_phys + kho_scratch_len,
+   };
+
+   /* Try to fit the kimage into our KHO scratch region */
+   ret = func(, kbuf);
+
+   return ret;
+}
+#endif
+
 #ifdef CONFIG_ARCH_KEEP_MEMBLOCK
 static int kexec_walk_memblock(struct kexec_buf *kbuf,
   int (*func)(struct resource *, void *))
@@ -612,6 +642,17 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
return 0;
 
+#ifdef CONFIG_KEXEC_KHO
+   /*
+* If KHO is active, only use KHO scratch memory. All other memory
+* could potentially be handed over.
+*/
+   if (kho_is_active() && kbuf->image->type != KEXEC_TYPE_CRASH) {
+   ret = kexec_walk_kho_scratch(kbuf, locate_mem_hole_callback);
+   return ret == 1 ? 0 : -EADDRNOTAVAIL;
+   }
+#endif
+
if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
else
diff --git a/kernel/kexec_kho_out.c b/kernel/kexec_kho_out.c
index e6184bde5c10..24ced6c3013f 100644
--- a/kernel/kexec_kho_out.c
+++ b/kernel/kexec_kho_out.c
@@ -50,6 +50,216 @@ int unregister_kho_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_kho_notifier);
 
+static int kho_mem_cache_add(void *fdt, struct kho_mem *mem_cache, int size,
+struct kho_mem *new_mem)
+{
+   int entries = size / sizeof(*mem_cache);
+   

[PATCH 04/15] kexec: Add KHO parsing support

2023-12-12 Thread Alexander Graf
When we have a KHO kexec, we get a device tree, mem cache and scratch
region to populate the state of the system. Provide helper functions
that allow architecture code to easily handle memory reservations based
on them and give device drivers visibility into the KHO DT and memory
reservations so they can recover their own state.

Signed-off-by: Alexander Graf 
---
 Documentation/ABI/testing/sysfs-firmware-kho |   9 +
 MAINTAINERS  |   1 +
 include/linux/kexec.h|  23 ++
 kernel/Makefile  |   1 +
 kernel/kexec_kho_in.c| 298 +++
 5 files changed, 332 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
 create mode 100644 kernel/kexec_kho_in.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-kho 
b/Documentation/ABI/testing/sysfs-firmware-kho
new file mode 100644
index ..e4ed2cb7c810
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-kho
@@ -0,0 +1,9 @@
+What:  /sys/firmware/kho/dt
+Date:  December 2023
+Contact:   Alexander Graf 
+Description:
+   When the kernel was booted with Kexec HandOver (KHO),
+   the device tree that carries metadata about the previous
+   kernel's state is in this file. This file may disappear
+   when all consumers of it finished to interpret their
+   metadata.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4ebf7c5fd424..ec92a0dd628d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11769,6 +11769,7 @@ M:  Eric Biederman 
 L: kexec@lists.infradead.org
 S: Maintained
 W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-firmware-kho
 F: Documentation/ABI/testing/sysfs-kernel-kho
 F: include/linux/kexec.h
 F: include/uapi/linux/kexec.h
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index db2597e5550d..a3c4fee6f86a 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -528,12 +528,35 @@ enum kho_event {
 extern phys_addr_t kho_scratch_phys;
 extern phys_addr_t kho_scratch_len;
 
+/* ingest handover metadata */
+void kho_reserve_mem(void);
+void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys, u64 
scratch_len,
+ phys_addr_t mem_phys, u64 mem_len);
+void kho_populate_refcount(void);
+void *kho_get_fdt(void);
+void kho_return_mem(const struct kho_mem *mem);
+void *kho_claim_mem(const struct kho_mem *mem);
+static inline bool is_kho_boot(void)
+{
+   return !!kho_scratch_phys;
+}
+
 /* egest handover metadata */
 void kho_reserve(void);
 int register_kho_notifier(struct notifier_block *nb);
 int unregister_kho_notifier(struct notifier_block *nb);
 bool kho_is_active(void);
 #else
+/* ingest handover metadata */
+static inline void kho_reserve_mem(void) { }
+static inline bool is_kho_boot(void) { return false; }
+static inline void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys,
+   u64 scratch_len, phys_addr_t mem_phys,
+   u64 mem_len) { }
+static inline void kho_populate_refcount(void) { }
+static inline void *kho_get_fdt(void) { return NULL; }
+
+/* egest handover metadata */
 static inline void kho_reserve(void) { }
 static inline int register_kho_notifier(struct notifier_block *nb) { return 
-EINVAL; }
 static inline int unregister_kho_notifier(struct notifier_block *nb) { return 
-EINVAL; }
diff --git a/kernel/Makefile b/kernel/Makefile
index a6bd31e22c09..7c3065e40c75 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_KHO) += kexec_kho_in.o
 obj-$(CONFIG_KEXEC_KHO) += kexec_kho_out.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
diff --git a/kernel/kexec_kho_in.c b/kernel/kexec_kho_in.c
new file mode 100644
index ..12ec54fc537a
--- /dev/null
+++ b/kernel/kexec_kho_in.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_kho_in.c - kexec handover code to ingest metadata.
+ * Copyright (C) 2023 Alexander Graf 
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* The kho dt during runtime */
+static void *fdt;
+
+/* Globals to hand over phys/len from early to runtime */
+static phys_addr_t handover_phys __initdata;
+static u32 handover_len __initdata;
+
+static phys_addr_t mem_phys __initdata;
+static u32 mem_len __initdata;
+
+phys_addr_t kho_scratch_phys;
+phys_addr_t kho_scratch_len;
+
+void *kho_get_fdt(void)
+{
+   return fdt;
+}
+EXPORT_SYMBOL_GPL(kho_get_fdt);
+
+/**
+ * kho_populate_refcount - Scan the DT for any memory ranges. Increase the
+ * affected pages' refcount by 1 for each.
+ */
+__init void 

[PATCH 03/15] kexec: Add Kexec HandOver (KHO) generation helpers

2023-12-12 Thread Alexander Graf
This patch adds the core infrastructure to generate Kexec HandOver
metadata. Kexec HandOver is a mechanism that allows Linux to preserve
state - arbitrary properties as well as memory locations - across kexec.

It does so using 3 concepts:

  1) Device Tree - Every KHO kexec carries a KHO specific flattened
 device tree blob that describes the state of the system. Device
 drivers can register to KHO to serialize their state before kexec.

  2) Mem cache - A memblocks like structure that contains full page
 ranges of reservations. These can not be part of the architectural
 reservations, because they differ on every kexec.

  3) Scratch Region - A CMA region that we allocate in the first kernel.
 CMA gives us the guarantee that no handover pages land in that
 region, because handover pages must be at a static physical memory
 location. We use this region as the place to load future kexec
 images into which then won't collide with any handover data.

Signed-off-by: Alexander Graf 
---
 Documentation/ABI/testing/sysfs-kernel-kho|  53 +++
 .../admin-guide/kernel-parameters.txt |  10 +
 MAINTAINERS   |   1 +
 include/linux/kexec.h |  24 ++
 include/uapi/linux/kexec.h|   6 +
 kernel/Makefile   |   1 +
 kernel/kexec_kho_out.c| 316 ++
 7 files changed, 411 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
 create mode 100644 kernel/kexec_kho_out.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-kho 
b/Documentation/ABI/testing/sysfs-kernel-kho
new file mode 100644
index ..f69e7b81a337
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-kho
@@ -0,0 +1,53 @@
+What:  /sys/kernel/kho/active
+Date:  December 2023
+Contact:   Alexander Graf 
+Description:
+   Kexec HandOver (KHO) allows Linux to transition the state of
+   compatible drivers into the next kexec'ed kernel. To do so,
+   device drivers will serialize their current state into a DT.
+   While the state is serialized, they are unable to perform
+   any modifications to state that was serialized, such as
+   handed over memory allocations.
+
+   When this file contains "1", the system is in the transition
+   state. When contains "0", it is not. To switch between the
+   two states, echo the respective number into this file.
+
+What:  /sys/kernel/kho/dt_max
+Date:  December 2023
+Contact:   Alexander Graf 
+Description:
+   KHO needs to allocate a buffer for the DT that gets
+   generated before it knows the final size. By default, it
+   will allocate 10 MiB for it. You can write to this file
+   to modify the size of that allocation.
+
+What:  /sys/kernel/kho/scratch_len
+Date:  December 2023
+Contact:   Alexander Graf 
+Description:
+   To support continuous KHO kexecs, we need to reserve a
+   physically contiguous memory region that will always stay
+   available for future kexec allocations. This file describes
+   the length of that memory region. Kexec user space tooling
+   can use this to determine where it should place its payload
+   images.
+
+What:  /sys/kernel/kho/scratch_phys
+Date:  December 2023
+Contact:   Alexander Graf 
+Description:
+   To support continuous KHO kexecs, we need to reserve a
+   physically contiguous memory region that will always stay
+   available for future kexec allocations. This file describes
+   the physical location of that memory region. Kexec user space
+   tooling can use this to determine where it should place its
+   payload images.
+
+What:  /sys/kernel/kho/dt
+Date:  December 2023
+Contact:   Alexander Graf 
+Description:
+   When KHO is active, the kernel exposes the generated DT that
+   carries its current KHO state in this file. Kexec user space
+   tooling can use this as input file for the KHO payload image.
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 51575cd31741..efeef075617e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2504,6 +2504,16 @@
kgdbwait[KGDB] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
 
+   kho_scratch=n[KMG]  [KEXEC] Sets the size of the KHO scratch
+   region. The KHO scratch region is a physically
+   memory range that can only be used for non-kernel
+   

[PATCH 00/15] kexec: Allow preservation of ftrace buffers

2023-12-12 Thread Alexander Graf
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Documentation ==

If people are happy with the approach in this patch set, I will write up
conclusive documentation including schemas for the metadata as part of its
next iteration. For now, here's a rudimentary overview:

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve.  The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet.

== How to Use ==

To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.

Make sure to fill ftrace with contents that you want to observe after
kexec.  Then, before you invoke file based "kexec -l", activate KHO:

  # echo 1 > /sys/kernel/kho/active
  # kexec -l Image --initrd=initrd -s
  # kexec -e

The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.



Alex

[1] 
https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
[2] 
https://lore.kernel.org/all/20231016233215.13090-1-madve...@linux.microsoft.com/
[3] 
https://lpc.events/event/17/contributions/1485/attachments/1296/2650/jgowans-preserving-across-kexec.pdf
[4] 
https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yzn...@oracle.com/


Alexander Graf (15):
  mm,memblock: Add support for scratch memory
  memblock: Declare scratch memory as CMA
  kexec: Add Kexec HandOver (KHO) generation helpers
  kexec: Add KHO parsing support
  kexec: Add KHO support to kexec 

[PATCH 02/15] memblock: Declare scratch memory as CMA

2023-12-12 Thread Alexander Graf
When we finish populating our memory, we don't want to lose the scratch
region as memory we can use for useful data. Do do that, we mark it as
CMA memory. That means that any allocation within it only happens with
movable memory which we can then happily discard for the next kexec.

That way we don't lose the scratch region's memory anymore for
allocations after boot.

Signed-off-by: Alexander Graf 
---
 mm/memblock.c | 30 ++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index e89e6c8f9d75..44741424dab7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1100,10 +1101,6 @@ static bool should_skip_region(struct memblock_type 
*type,
if ((flags & MEMBLOCK_SCRATCH) && !memblock_is_scratch(m))
return true;
 
-   /* Leave scratch memory alone after scratch-only phase */
-   if (!(flags & MEMBLOCK_SCRATCH) && memblock_is_scratch(m))
-   return true;
-
return false;
 }
 
@@ -2153,6 +2150,20 @@ static void __init __free_pages_memory(unsigned long 
start, unsigned long end)
}
 }
 
+static void reserve_scratch_mem(phys_addr_t start, phys_addr_t end)
+{
+#ifdef CONFIG_MEMBLOCK_SCRATCH
+   ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
+   ulong end_pfn = pageblock_align(PFN_UP(end));
+   ulong pfn;
+
+   for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+   /* Mark as CMA to prevent kernel allocations in it */
+   set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_CMA);
+   }
+#endif
+}
+
 static unsigned long __init __free_memory_core(phys_addr_t start,
 phys_addr_t end)
 {
@@ -2214,6 +2225,17 @@ static unsigned long __init 
free_low_memory_core_early(void)
 
memmap_init_reserved_pages();
 
+#ifdef CONFIG_MEMBLOCK_SCRATCH
+   /*
+* Mark scratch mem as CMA before we return it. That way we ensure that
+* no kernel allocations happen on it. That means we can reuse it as
+* scratch memory again later.
+*/
+   __for_each_mem_range(i, , NULL, NUMA_NO_NODE,
+MEMBLOCK_SCRATCH, , , NULL)
+   reserve_scratch_mem(start, end);
+#endif
+
/*
 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
 *  because in some case like Node0 doesn't have RAM installed
-- 
2.40.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879






[PATCH 01/15] mm,memblock: Add support for scratch memory

2023-12-12 Thread Alexander Graf
With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over. But to know where those are, we need to include
them in the reserved memblocks array which may not be big enough to hold
all allocations. To resize the array, we need to allocate memory. That
brings us into a catch 22 situation.

The solution to that is the scratch region: a safe region to operate in.
KHO provides a "scratch region" as part of its metadata. This scratch
region is a single, contiguous memory block that we know does not
contain any KHO allocations. We can exclusively allocate from there until
we finish kernel initialization to a point where it knows about all the
KHO memory reservations. We introduce a new memblock_set_scratch_only()
function that allows KHO to indicate that any memblock allocation must
happen from the scratch region.

Later, we may want to perform another KHO kexec. For that, we reuse the
same scratch region. To ensure that no eventually handed over data gets
allocated inside that scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions. We will lift that restriction
in the next patch.

Signed-off-by: Alexander Graf 
---
 include/linux/memblock.h | 19 +
 mm/Kconfig   |  4 +++
 mm/memblock.c| 61 +++-
 3 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ae3bde302f70..14043f5b696f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,10 @@ extern unsigned long long max_possible_pfn;
  * kernel resource tree.
  * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
  * not initialized (only for reserved regions).
+ * @MEMBLOCK_SCRATCH: memory region that kexec can pass to the next kernel in
+ * handover mode. During early boot, we do not know about all memory 
reservations
+ * yet, so we get scratch memory from the previous kernel that we know is good
+ * to use. It is the only memory that allocations may happen from in this 
phase.
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
@@ -50,6 +54,7 @@ enum memblock_flags {
MEMBLOCK_NOMAP  = 0x4,  /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8,  /* always detected via a driver */
MEMBLOCK_RSRV_NOINIT= 0x10, /* don't initialize struct pages */
+   MEMBLOCK_SCRATCH= 0x20, /* scratch memory for kexec handover */
 };
 
 /**
@@ -129,6 +134,8 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t 
size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
+int memblock_mark_scratch(phys_addr_t base, phys_addr_t size);
+int memblock_clear_scratch(phys_addr_t base, phys_addr_t size);
 
 void memblock_free_all(void);
 void memblock_free(void *ptr, size_t size);
@@ -273,6 +280,11 @@ static inline bool memblock_is_driver_managed(struct 
memblock_region *m)
return m->flags & MEMBLOCK_DRIVER_MANAGED;
 }
 
+static inline bool memblock_is_scratch(struct memblock_region *m)
+{
+   return m->flags & MEMBLOCK_SCRATCH;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
@@ -610,5 +622,12 @@ static inline void early_memtest(phys_addr_t start, 
phys_addr_t end) { }
 static inline void memtest_report_meminfo(struct seq_file *m) { }
 #endif
 
+#ifdef CONFIG_MEMBLOCK_SCRATCH
+void memblock_set_scratch_only(void);
+void memblock_clear_scratch_only(void);
+#else
+static inline void memblock_set_scratch_only(void) { }
+static inline void memblock_clear_scratch_only(void) { }
+#endif
 
 #endif /* _LINUX_MEMBLOCK_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..36f5e7d95195 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -513,6 +513,10 @@ config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 config HAVE_MEMBLOCK_PHYS_MAP
bool
 
+# Enable memblock support for scratch memory which is needed for KHO
+config MEMBLOCK_SCRATCH
+   bool
+
 config HAVE_FAST_GUP
depends on MMU
bool
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a88d6d24d79..e89e6c8f9d75 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -106,6 +106,13 @@ unsigned long min_low_pfn;
 unsigned long max_pfn;
 unsigned long long max_possible_pfn;
 
+#ifdef CONFIG_MEMBLOCK_SCRATCH
+/* When set to true, only allocate from MEMBLOCK_SCRATCH ranges */
+static bool scratch_only;
+#else
+#define scratch_only false
+#endif
+
 static struct memblock_region 

[PATCH] kexec: Use ALIGN macro instead of open-coding it

2023-12-12 Thread Yuntao Wang
Use ALIGN macro instead of open-coding it to improve code readability.

Signed-off-by: Yuntao Wang 
---
 kernel/kexec_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index be5642a4ec49..0113436e4a3a 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -430,7 +430,7 @@ static struct page *kimage_alloc_crash_control_pages(struct 
kimage *image,
 
pages = NULL;
size = (1 << order) << PAGE_SHIFT;
-   hole_start = (image->control_page + (size - 1)) & ~(size - 1);
+   hole_start = ALIGN(image->control_page, size);
hole_end   = hole_start + size - 1;
while (hole_end <= crashk_res.end) {
unsigned long i;
@@ -447,7 +447,7 @@ static struct page *kimage_alloc_crash_control_pages(struct 
kimage *image,
mend   = mstart + image->segment[i].memsz - 1;
if ((hole_end >= mstart) && (hole_start <= mend)) {
/* Advance the hole to the end of the segment */
-   hole_start = (mend + (size - 1)) & ~(size - 1);
+   hole_start = ALIGN(mend, size);
hole_end   = hole_start + size - 1;
break;
}
-- 
2.43.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/2] KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown

2023-12-12 Thread Paolo Bonzini
On Tue, Dec 12, 2023 at 9:51 AM Gowans, James  wrote:
> 1. Does hardware_disable_nolock actually need to be done on *every* CPU
> or would the offlined ones be fine to ignore because they will be reset
> and the VMXE bit will be cleared that way? With cooperative CPU handover
> we probably do indeed want to do this on every CPU and not depend on
> resetting.

Offlined and onlined CPUs are handled via the CPU hotplug state machine,
which calls into kvm_online_cpu and kvm_offline_cpu.

Paolo


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 1/2] KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown

2023-12-12 Thread Gowans, James
On Mon, 2023-12-11 at 17:50 -0600, Eric W. Biederman wrote:
> "Gowans, James"  writes:
> 
> > On Mon, 2023-12-11 at 09:54 +0200, James Gowans wrote:
> > > > 
> > > > What problem are you running into with your rebase that worked with
> > > > reboot notifiers that is not working with syscore_shutdown?
> > > 
> > > Prior to this commit [1] which changed KVM from reboot notifiers to
> > > syscore_ops, KVM's reboot notifier shutdown callback was invoked on
> > > kexec via kernel_restart_prepare.
> > > 
> > > After this commit, KVM is not being shut down because currently the
> > > kexec flow does not call syscore_shutdown.
> > 
> > I think I missed what you're asking here; you're asking for a reproducer
> > for the specific failure?
> > 
> > 1. Launch a QEMU VM with -enable-kvm flag
> > 
> > 2. Do an immediate (-f flag) kexec:
> > kexec -f --reuse-cmdline ./bzImage
> > 
> > Somewhere after doing the RET to new kernel in the relocate_kernel asm
> > function the new kernel starts triple faulting; I can't exactly figure
> > out where but I think it has to do with the new kernel trying to modify
> > CR3 while the VMXE bit is still set in CR4 causing the triple fault.
> > 
> > If KVM has been shut down via the shutdown callback, or alternatively if
> > the QEMU process has actually been killed first (by not doing a -f exec)
> > then the VMXE bit is clear and the kexec goes smoothly.
> > 
> > So, TL;DR: kexec -f use to work with a KVM VM active, now it goes into a
> > triple fault crash.
> 
> You mentioned I rebase so I thought your were backporting kernel patches.
> By rebase do you mean you porting your userspace to a newer kernel?

I've been working on some patches and when I rebased my work-in-progress
patches to latest master then kexec stopped working when KVM VMs exist.
Originally the WIP patches were based on an older stable version.

> 
> In any event I believe the bug with respect to kexec was introduced in
> commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after
> disable_nonboot_cpus()").  That is where syscore_shutdown was removed
> from kernel_restart_prepare().
> 
> At this point it looks like someone just needs to add the missing
> syscore_shutdown call into kernel_kexec() right after
> migrate_to_reboot_cpu() is called.

Seems good and I'm happy to do that; one thing we need to check first:
are all CPUs online at that point? The commit message for
6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
disable_nonboot_cpus()")
speaks about: "one CPU on-line and interrupts disabled" when
syscore_shutdown is called. KVM's syscore shutdown hook does:

on_each_cpu(hardware_disable_nolock, NULL, 1);

... so that smells to me like it wants all the CPUs to be online at
kvm_shutdown point.

It's not clear to me:

1. Does hardware_disable_nolock actually need to be done on *every* CPU
or would the offlined ones be fine to ignore because they will be reset
and the VMXE bit will be cleared that way? With cooperative CPU handover
we probably do indeed want to do this on every CPU and not depend on
resetting.

2. Are CPUs actually offline at this point? When that commit was
authored there used to be a call to hardware_disable_nolock() but that's
not there anymore.

> 
> That said I am not seeing the reboot notifiers being called on the kexec
> path either so your issue with kvm might be deeper.

Previously it was called via:

kernel_kexec
  kernel_restart_prepare
blocking_notifier_call_chain(_notifier_list, SYS_RESTART, cmd);
  kvm_shutdown

JG
___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec