Re: [PATCH] kexec: remove the 2GB size limit on initrd file
On Wed, Sep 2, 2020 at 11:41 AM Robi Buranyi wrote: > > > > On Wed, Sep 2, 2020 at 11:18 AM Bhupesh Sharma wrote: >> >> Hi Robi, >> >> On Wed, Sep 2, 2020 at 1:05 PM Robi Buranyi wrote: >> > >> > Enable loading initrd files exceeding the INT_MAX size. Remove the >> > INT_MAX limit completely, and let any initrd load if it fits in the >> > memory. >> > >> > Signed-off-by: Robi Buranyi >> > --- >> > kernel/kexec_file.c | 4 ++-- >> > 1 file changed, 2 insertions(+), 2 deletions(-) >> > >> > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c >> > index ca40bef75a61..659a9d165198 100644 >> > --- a/kernel/kexec_file.c >> > +++ b/kernel/kexec_file.c >> > @@ -222,7 +222,7 @@ kimage_file_prepare_segments(struct kimage *image, int >> > kernel_fd, int initrd_fd, >> > loff_t size; >> > >> > ret = kernel_read_file_from_fd(kernel_fd, &image->kernel_buf, >> > - &size, INT_MAX, >> > READING_KEXEC_IMAGE); >> > + &size, 0, READING_KEXEC_IMAGE); >> > if (ret) >> > return ret; >> > image->kernel_buf_len = size; >> > @@ -242,7 +242,7 @@ kimage_file_prepare_segments(struct kimage *image, int >> > kernel_fd, int initrd_fd, >> > /* It is possible that there no initramfs is being loaded */ >> > if (!(flags & KEXEC_FILE_NO_INITRAMFS)) { >> > ret = kernel_read_file_from_fd(initrd_fd, >> > &image->initrd_buf, >> > - &size, INT_MAX, >> > + &size, 0, >> >READING_KEXEC_INITRAMFS); >> > if (ret) >> > goto out; >> > -- >> > 2.28.0.402.g5ffc5be6b7-goog >> >> Can you share some background about this fix? For example why is it >> needed or what is failing at your end? >> I think a 2GB initramfs is a good enough size to accommodate while >> loading it via kexec_file_load(). Eventually the initramfs to be >> loaded is passed via user-space as a command line argument to the >> kexec_file_load() syscall, so we should be careful about the file >> sizes we might be loading here. >> >> I am just trying to understand what initramfs size limits you are >> working with from a kexec_file_load() p-o-v. >> >> Thanks, >> Bhupesh >> > > Hello Bhupesh, > > Sure, and thank you very much for the question. I will do my best to anwer. > > I am experimenting with netbooting larger images of various distros. I am > looking for standalone configurations where I would like to have the system > booted without dependencies to NFS/CIFS shares and just have 'everything' on > ramdisk. In the netboot configuration I'm using, it's a u-root bootloader > (https://github.com/u-root/u-root) kexec-ing into this netboot image. The > machine HW configuration I have is typically several hundreds of GB RAM with > 100Gb+ network interface(s). > To sum up: I am netbooting (kexec-ing) from a 'tiny' 32MB linux bootloader > image into a huge 2GB+ linux image. > > My problem with the 2GB limit in the current kexec is, that in my opinion > it's somehow arbitrary selection and does not have to do with any > architectural limitations any more. If a user wants to use a bigger image > (and has the necessary memory to load the image) - the kernel should not > limit it in my opinion. I was looking at other invocations of > kernel_read_file(...) - which this kernel_read_file_from_fd comes down to - > and there are also other use cases with no limit specified. I see your point > about being specified as a command line argument, but if you let's say you > have a system with several terabytes of DDR, what's the point in limiting the > user at 2GB? We could increase the limit to 8GB for now (and that would be > certainly sufficient for a while) - but then we would possibly have the same > discussion in a few years again. > > Does it answer your question? > > Thanks, Robi Hello Bhupesh, Sure, and thank you very much for the question. I will do my best to anwer. I am experimenting with netbooting larger images of various distros. I am looking for standalone configurations where I would like to have the system booted without dependencies to NFS/CIFS shares and just have 'everything' on ramdisk. In the netboot configuration I'm using, it's a u-root bootloader (https://github.com/u-root/u-root) kexec-ing into this netboot image. The machine HW configuration I have is typically several hundreds of GB RAM with 100Gb+ network interface(s). To sum up: I am netbooting (kexec-ing) from a 'tiny' 32MB linux bootloader image into a huge 2GB+ linux image. My problem with the 2GB limit in the current kexec is, that in my opinion it's somehow arbitrary selection and does not have to do with any architectural limitations any more. If a user wants to use a bigger image (and has the necessary memory to load the image) - the kernel should not limit
Re: [PATCH] kexec: remove the 2GB size limit on initrd file
Hi Robi, On Wed, Sep 2, 2020 at 1:05 PM Robi Buranyi wrote: > > Enable loading initrd files exceeding the INT_MAX size. Remove the > INT_MAX limit completely, and let any initrd load if it fits in the > memory. > > Signed-off-by: Robi Buranyi > --- > kernel/kexec_file.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index ca40bef75a61..659a9d165198 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -222,7 +222,7 @@ kimage_file_prepare_segments(struct kimage *image, int > kernel_fd, int initrd_fd, > loff_t size; > > ret = kernel_read_file_from_fd(kernel_fd, &image->kernel_buf, > - &size, INT_MAX, READING_KEXEC_IMAGE); > + &size, 0, READING_KEXEC_IMAGE); > if (ret) > return ret; > image->kernel_buf_len = size; > @@ -242,7 +242,7 @@ kimage_file_prepare_segments(struct kimage *image, int > kernel_fd, int initrd_fd, > /* It is possible that there no initramfs is being loaded */ > if (!(flags & KEXEC_FILE_NO_INITRAMFS)) { > ret = kernel_read_file_from_fd(initrd_fd, &image->initrd_buf, > - &size, INT_MAX, > + &size, 0, >READING_KEXEC_INITRAMFS); > if (ret) > goto out; > -- > 2.28.0.402.g5ffc5be6b7-goog Can you share some background about this fix? For example why is it needed or what is failing at your end? I think a 2GB initramfs is a good enough size to accommodate while loading it via kexec_file_load(). Eventually the initramfs to be loaded is passed via user-space as a command line argument to the kexec_file_load() syscall, so we should be careful about the file sizes we might be loading here. I am just trying to understand what initramfs size limits you are working with from a kexec_file_load() p-o-v. Thanks, Bhupesh ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v11 5/5] kdump: update Documentation about crashkernel
On Sat, Aug 01, 2020 at 09:08:56PM +0800, Chen Zhou wrote: > diff --git a/Documentation/admin-guide/kdump/kdump.rst > b/Documentation/admin-guide/kdump/kdump.rst > index 2da65fef2a1c..4b58f97351d5 100644 > --- a/Documentation/admin-guide/kdump/kdump.rst > +++ b/Documentation/admin-guide/kdump/kdump.rst > @@ -299,7 +299,15 @@ Boot into System Kernel > "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory > starting at physical address 0x0100 (16MB) for the dump-capture > kernel. > > - On x86 and x86_64, use "crashkernel=64M@16M". > + On x86 use "crashkernel=64M@16M". > + > + On x86_64, use "crashkernel=X" to select a region under 4G first, and > + fall back to reserve region above 4G. > + We can also use "crashkernel=X,high" to select a region above 4G, which > + also tries to allocate at least 256M below 4G automatically and > + "crashkernel=Y,low" can be used to allocate specified size low memory. > + Use "crashkernel=Y@X" if you really have to reserve memory from specified > + start address X. > > On ppc64, use "crashkernel=128M@32M". > > @@ -316,8 +324,15 @@ Boot into System Kernel > kernel will automatically locate the crash kernel image within the > first 512MB of RAM if X is not given. > > - On arm64, use "crashkernel=Y[@X]". Note that the start address of > - the kernel, X if explicitly specified, must be aligned to 2MiB (0x20). > + On arm64, use "crashkernel=X" to try low allocation in ZONE_DMA, and > + fall back to high allocation if it fails. And go for high allocation > + directly if the required size is too large. If crash_base is outside I wouldn't mention crash_base in the admin guide. That's an implementation detail really and admins are not supposed to read the source code to make sense of the documentation. ZONE_DMA is also a kernel internal, so you'd need to define what it is for arm64. At least the DMA and DMA32 zones are printed during kernel boot. > + ZONE_DMA, try to allocate at least 256M in ZONE_DMA automatically. > + "crashkernel=Y,low" can be used to allocate specified size low memory. > + For non-RPi4 platforms, change ZONE_DMA memtioned above to ZONE_DMA32. > + Use "crashkernel=Y@X" if you really have to reserve memory from > + specified start address X. Note that the start address of the kernel, > + X if explicitly specified, must be aligned to 2MiB (0x20). -- Catalin ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v11 3/5] arm64: kdump: reimplement crashkernel=X
On Sat, Aug 01, 2020 at 09:08:54PM +0800, Chen Zhou wrote: > There are following issues in arm64 kdump: > 1. We use crashkernel=X to reserve crashkernel below 4G, which > will fail when there is no enough low memory. > 2. If reserving crashkernel above 4G, in this case, crash dump > kernel will boot failure because there is no low memory available > for allocation. > 3. Since commit 1a8e1cef7603 ("arm64: use both ZONE_DMA and ZONE_DMA32"), > if the memory reserved for crash dump kernel falled in ZONE_DMA32, > the devices in crash dump kernel need to use ZONE_DMA will alloc > fail. > > To solve these issues, change the behavior of crashkernel=X. > crashkernel=X tries low allocation in ZONE_DMA, and fall back to > high allocation if it fails. > > If requized size X is too large and leads to very little free memory > in ZONE_DMA after low allocation, the system may not work normally. > So add a threshold and go for high allocation directly if the required > size is too large. The value of threshold is set as the half of > the low memory. > > If crash_base is outside ZONE_DMA, try to allocate at least 256M in > ZONE_DMA automatically. "crashkernel=Y,low" can be used to allocate > specified size low memory. Except for the threshold to keep zone ZONE_DMA memory, reserve_crashkernel() looks very close to the x86 version. Shall we try to make this generic as well? In the first instance, you could avoid the threshold check if it takes an explicit ",high" option. -- Catalin ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v11 5/5] kdump: update Documentation about crashkernel
On Tue, Aug 18, 2020 at 03:07:04PM +0800, chenzhou wrote: > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index a8e34d97a894..4df18c7ea438 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -147,7 +147,7 @@ static void __init reserve_crashkernel(void) > } > memblock_reserve(crash_base, crash_size); > > - if (crash_base >= CRASH_ADDR_LOW_MAX) { > + if (memstart_addr < CRASH_ADDR_LOW_MAX && crash_base >= > CRASH_ADDR_LOW_MAX) { > const char *rename = "Crash kernel (low)"; Since CRASH_ADDR_LOW_MAX is defined as arm64_dma32_phys_limit and such limit is always greater than memstart_addr, this additional check doesn't do anything. See my other reply on how ZONE_DMA32 is created on arm64. -- Catalin ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] arm: kdump: Add DT properties to crash dump kernel's DTB
On Wed, Sep 02, 2020 at 05:48:00PM +0200, Geert Uytterhoeven wrote: > On Wed, Sep 2, 2020 at 5:41 PM Geert Uytterhoeven > wrote: > > Pass the following properties to the crash dump kernel, to provide a > > modern DT interface between kexec and the crash dump kernel: > > > > - linux,elfcorehdr: ELF core header segment, similar to the > > "elfcorehdr=" kernel parameter. > > - linux,usable-memory-range: Usable memory reserved for the crash dump > > kernel. > > This makes the memory reservation explicit, so Linux no longer needs > > to mask the program counter, and rely on the "mem=" kernel parameter > > to obtain the start and size of usable memory. > > > > For backwards compatibility, the "elfcorehdr=" and "mem=" kernel > > parameters are still appended to the kernel command line. > > > > Loosely based on the ARM64 version by Akashi Takahiro. > > > > Signed-off-by: Geert Uytterhoeven > > The counterpart patch for linux is "[PATCH] ARM: Parse kdump DT > properties" > (https://lore.kernel.org/r/20200902154538.6807-1-geert+rene...@glider.be) Thanks Geert, I think it would be best to hold-off on the user-space patch until the kernel patch is accepted. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] kexec/kexec.c: Add missing close() call
On Wed, Aug 26, 2020 at 09:53:34AM -0600, Khalid Aziz wrote: > On 8/25/20 6:51 PM, Youling Tang wrote: > > Add missing close() call. > > > > Signed-off-by: Youling Tang > > --- > > kexec/kexec.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kexec/kexec.c b/kexec/kexec.c > > index a62b362..bb88caa 100644 > > --- a/kexec/kexec.c > > +++ b/kexec/kexec.c > > @@ -585,6 +585,7 @@ static char *slurp_file_generic(const char *filename, > > off_t *r_size, > > die("Read on %s ended before stat said it should\n", filename); > > > > *r_size = size; > > + close(fd); > > return buf; > > } > > > > @@ -1257,12 +1258,14 @@ static int do_kexec_file_load(int fileind, int > > argc, char **argv, > > if (i == file_types) { > > fprintf(stderr, "Cannot determine the file type " "of %s\n", > > kernel); > > + close(kernel_fd); > > return EFAILED; > > } > > > > ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info); > > if (ret < 0) { > > fprintf(stderr, "Cannot load %s\n", kernel); > > + close(kernel_fd); > > return ret; > > } > > > > > > This looks good to me. > > Reviewed-by: Khalid Aziz Thanks, applied. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] AUTHORS: Update email address for Khalid
On Tue, Aug 25, 2020 at 08:29:57AM -0600, Khalid Aziz wrote: > New email address for Khalid Aziz. > > Signed-off-by: Khalid Aziz Thanks Khalid, applied. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec-tools: Fix resource leaks in kexec.c
On Tue, Aug 25, 2020 at 04:43:23PM +0800, Youling Tang wrote: > Add free() call when handling some error returns. > > Signed-off-by: Youling Tang Is there a value in cleaning up like this immediately before exiting? ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] arm: kdump: Add DT properties to crash dump kernel's DTB
On Wed, Sep 2, 2020 at 5:41 PM Geert Uytterhoeven wrote: > Pass the following properties to the crash dump kernel, to provide a > modern DT interface between kexec and the crash dump kernel: > > - linux,elfcorehdr: ELF core header segment, similar to the > "elfcorehdr=" kernel parameter. > - linux,usable-memory-range: Usable memory reserved for the crash dump > kernel. > This makes the memory reservation explicit, so Linux no longer needs > to mask the program counter, and rely on the "mem=" kernel parameter > to obtain the start and size of usable memory. > > For backwards compatibility, the "elfcorehdr=" and "mem=" kernel > parameters are still appended to the kernel command line. > > Loosely based on the ARM64 version by Akashi Takahiro. > > Signed-off-by: Geert Uytterhoeven The counterpart patch for linux is "[PATCH] ARM: Parse kdump DT properties" (https://lore.kernel.org/r/20200902154538.6807-1-geert+rene...@glider.be) Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] ARM: Parse kdump DT properties
Parse the following DT properties in the crash dump kernel, to provide a modern interface between kexec and the crash dump kernel: - linux,elfcorehdr: ELF core header segment, similar to the "elfcorehdr=" kernel parameter. - linux,usable-memory-range: Usable memory reserved for the crash dump kernel. This makes the memory reservation explicit. If present, Linux no longer needs to mask the program counter, and rely on the "mem=" kernel parameter to obtain the start and size of usable memory. For backwards compatibility, the traditional method to derive the start of memory is still used if "linux,usable-memory-range" is absent, and the "elfcorehdr=" and "mem=" kernel parameters are still parsed. Loosely based on the ARM64 version by Akashi Takahiro. Signed-off-by: Geert Uytterhoeven --- This depends on "[PATCH v9] ARM: boot: Validate start of physical memory against DTB" (https://lore.kernel.org/r/20200902153606.13652-1-geert+rene...@glider.be) The counterpart patch for kexec-tools is "PATCH] arm: kdump: Add DT properties to crash dump kernel's DTB" (https://lore.kernel.org/r/20200902154129.6358-1-geert+rene...@glider.be) Documentation/devicetree/bindings/chosen.txt | 4 +- arch/arm/boot/compressed/fdt_get_mem_start.c | 18 +++- arch/arm/mm/init.c | 90 3 files changed, 107 insertions(+), 5 deletions(-) diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt index 45e79172a646c537..ba75c58413667760 100644 --- a/Documentation/devicetree/bindings/chosen.txt +++ b/Documentation/devicetree/bindings/chosen.txt @@ -79,7 +79,7 @@ a different secondary CPU release mechanism) linux,usable-memory-range - -This property (arm64 only) holds a base address and size, describing a +This property (arm and arm64 only) holds a base address and size, describing a limited region in which memory may be considered available for use by the kernel. Memory outside of this range is not available for use. @@ -106,7 +106,7 @@ respectively, of the root node. linux,elfcorehdr -This property (currently used only on arm64) holds the memory range, +This property (currently used only on arm and arm64) holds the memory range, the address and the size, of the elf core header which mainly describes the panicked kernel's memory layout as PT_LOAD segments of elf format. e.g. diff --git a/arch/arm/boot/compressed/fdt_get_mem_start.c b/arch/arm/boot/compressed/fdt_get_mem_start.c index fd501ec3c14b4ae4..344fee8bfc7ab438 100644 --- a/arch/arm/boot/compressed/fdt_get_mem_start.c +++ b/arch/arm/boot/compressed/fdt_get_mem_start.c @@ -40,7 +40,7 @@ static uint32_t get_size(const void *fdt, const char *name) unsigned long fdt_get_mem_start(const void *fdt) { uint32_t addr_size, size_size, mem_start, masked_pc; - const __be32 *memory; + const __be32 *usable, *memory; if (!fdt) return -1; @@ -52,8 +52,16 @@ unsigned long fdt_get_mem_start(const void *fdt) addr_size = get_size(fdt, "#address-cells"); size_size = get_size(fdt, "#size-cells"); - /* Find the first memory node */ - memory = get_prop(fdt, "/memory", "reg", addr_size + size_size); + /* +* Find start of memory, either from "linux,usable-memory-range" (for +* crashkernel), or from the first memory node +*/ + usable = get_prop(fdt, "/chosen", "linux,usable-memory-range", + addr_size + size_size); + if (usable) + memory = usable; + else + memory = get_prop(fdt, "/memory", "reg", addr_size + size_size); if (!memory) return -1; @@ -62,6 +70,10 @@ unsigned long fdt_get_mem_start(const void *fdt) /* Must be a multiple of 16 MiB for phys/virt patching */ mem_start = round_up(mem_start, SZ_16M); + /* "linux,usable-memory-range" is considered authoritative */ + if (usable) + return mem_start; + /* * Traditionally, the start of memory is obtained by masking the * program counter. Prefer that method, unless it would yield an diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 000c1b48e9734f4e..407fd847dbfc9280 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -4,6 +4,7 @@ * * Copyright (C) 1995-2005 Russell King */ +#include #include #include #include @@ -210,8 +211,95 @@ void check_cpu_icache_size(int cpuid) } #endif +#ifdef CONFIG_OF_EARLY_FLATTREE +static int __init early_init_dt_scan_usablemem(unsigned long node, +const char *uname, int depth, void *data) +{ +struct memblock_region *usablemem = data; +const __be32 *reg; +int len; + +if (depth != 1 || strcmp(uname, "chosen") != 0) +return 0; + +reg = of_get_flat_dt_prop(node, "linux,usable-memory
[PATCH] arm: kdump: Add DT properties to crash dump kernel's DTB
Pass the following properties to the crash dump kernel, to provide a modern DT interface between kexec and the crash dump kernel: - linux,elfcorehdr: ELF core header segment, similar to the "elfcorehdr=" kernel parameter. - linux,usable-memory-range: Usable memory reserved for the crash dump kernel. This makes the memory reservation explicit, so Linux no longer needs to mask the program counter, and rely on the "mem=" kernel parameter to obtain the start and size of usable memory. For backwards compatibility, the "elfcorehdr=" and "mem=" kernel parameters are still appended to the kernel command line. Loosely based on the ARM64 version by Akashi Takahiro. Signed-off-by: Geert Uytterhoeven --- kexec/arch/arm/crashdump-arm.c| 10 ++- kexec/arch/arm/crashdump-arm.h| 2 + kexec/arch/arm/kexec-zImage-arm.c | 137 ++ 3 files changed, 147 insertions(+), 2 deletions(-) diff --git a/kexec/arch/arm/crashdump-arm.c b/kexec/arch/arm/crashdump-arm.c index daa478868b4976d7..1ec1826c24b52ed9 100644 --- a/kexec/arch/arm/crashdump-arm.c +++ b/kexec/arch/arm/crashdump-arm.c @@ -54,7 +54,7 @@ struct memory_ranges usablemem_rgns = { }; /* The boot-time physical memory range reserved for crashkernel region */ -static struct memory_range crash_kernel_mem; +struct memory_range crash_kernel_mem; /* reserved regions */ #define CRASH_MAX_RESERVED_RANGES 2 @@ -64,6 +64,8 @@ static struct memory_ranges crash_reserved_rgns = { .ranges = crash_reserved_ranges, }; +struct memory_range elfcorehdr_mem; + static struct crash_elf_info elf_info = { .class = ELFCLASS32, .data = ELFDATANATIVE, @@ -307,7 +309,11 @@ int load_crashdump_segments(struct kexec_info *info, char *mod_cmdline) crash_kernel_mem.start, crash_kernel_mem.end, -1, 0); - dbgprintf("elfcorehdr: %#lx\n", elfcorehdr); + elfcorehdr_mem.start = elfcorehdr; + elfcorehdr_mem.end = elfcorehdr + bufsz - 1; + + dbgprintf("elfcorehdr 0x%llx-0x%llx\n", elfcorehdr_mem.start, + elfcorehdr_mem.end); cmdline_add_elfcorehdr(mod_cmdline, elfcorehdr); /* diff --git a/kexec/arch/arm/crashdump-arm.h b/kexec/arch/arm/crashdump-arm.h index 6e87b13c4b8c86fe..bbdf8bf9de58eb5d 100644 --- a/kexec/arch/arm/crashdump-arm.h +++ b/kexec/arch/arm/crashdump-arm.h @@ -13,6 +13,8 @@ extern "C" { extern struct memory_ranges usablemem_rgns; +extern struct memory_range crash_kernel_mem; +extern struct memory_range elfcorehdr_mem; struct kexec_info; diff --git a/kexec/arch/arm/kexec-zImage-arm.c b/kexec/arch/arm/kexec-zImage-arm.c index 925a9be120aa401a..e056f79124240e40 100644 --- a/kexec/arch/arm/kexec-zImage-arm.c +++ b/kexec/arch/arm/kexec-zImage-arm.c @@ -4,6 +4,7 @@ */ #define _GNU_SOURCE #define _XOPEN_SOURCE +#include #include #include #include @@ -374,6 +375,103 @@ static const struct zimage_tag *find_extension_tag(const char *buf, off_t len, return NULL; } +static int get_cells_size(void *fdt, uint32_t *address_cells, + uint32_t *size_cells) +{ + int nodeoffset; + const uint32_t *prop = NULL; + int prop_len; + + /* default values */ + *address_cells = 1; + *size_cells = 1; + + /* under root node */ + nodeoffset = fdt_path_offset(fdt, "/"); + if (nodeoffset < 0) + return -1; + + prop = fdt_getprop(fdt, nodeoffset, "#address-cells", &prop_len); + if (prop) { + if (prop_len != sizeof(*prop)) + return -1; + + *address_cells = fdt32_to_cpu(*prop); + } + + prop = fdt_getprop(fdt, nodeoffset, "#size-cells", &prop_len); + if (prop) { + if (prop_len != sizeof(*prop)) + return -1; + + *size_cells = fdt32_to_cpu(*prop); + } + + dbgprintf("%s: #address-cells:%d #size-cells:%d\n", __func__, + *address_cells, *size_cells); + return 0; +} + +static bool cells_size_fitted(uint32_t address_cells, uint32_t size_cells, + struct memory_range *range) +{ + dbgprintf("%s: %llx-%llx\n", __func__, range->start, range->end); + + /* if *_cells >= 2, cells can hold 64-bit values anyway */ + if ((address_cells == 1) && (range->start >= (1ULL << 32))) + return false; + + if ((size_cells == 1) && + ((range->end - range->start + 1) >= (1ULL << 32))) + return false; + + return true; +} + +static void fill_property(void *buf, uint64_t val, uint32_t cells) +{ + uint32_t val32; + int i; + + if (cells == 1) { + val32 = cpu_to_fdt32((uint32_t)val); + memcpy(buf, &val32, sizeof(uint32_t)); + } else { + for (i = 0; +
Re: [PATCH next v3 7/8] printk: reimplement log_cont using record extension
On Mon 2020-08-31 03:16:57, John Ogness wrote: > Use the record extending feature of the ringbuffer to implement > continuous messages. This preserves the existing continuous message > behavior. > > Signed-off-by: John Ogness > --- > kernel/printk/printk.c | 98 +- > 1 file changed, 20 insertions(+), 78 deletions(-) It looks simple from the printk() side ;-) Reviewed-by: Petr Mladek Best Regards, Petr ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: state names: vas: Re: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
On Wed 2020-09-02 13:26:14, John Ogness wrote: > On 2020-09-02, Petr Mladek wrote: > >> +static struct prb_desc *desc_reopen_last(struct prb_desc_ring *desc_ring, > >> + u32 caller_id, unsigned long *id_out) > >> +{ > >> + unsigned long prev_state_val; > >> + enum desc_state d_state; > >> + struct prb_desc desc; > >> + struct prb_desc *d; > >> + unsigned long id; > >> + > >> + id = atomic_long_read(&desc_ring->head_id); > >> + > >> + /* > >> + * To minimize unnecessarily reopening a descriptor, first check the > >> + * descriptor is in the correct state and has a matching caller ID. > >> + */ > >> + d_state = desc_read(desc_ring, id, &desc); > >> + if (d_state != desc_reserved || > >> + !(atomic_long_read(&desc.state_var) & DESC_COMMIT_MASK) || > > > > First, define 5 desc_states, something like: > > > > enum desc_state { > > desc_miss = -1, /* ID mismatch */ > > desc_modified = 0x0, /* reserved, being modified by writer */ > > I prefer the "desc_reserved" name. It may or may not have be modified yet. Yeah, "desc_reserved" sounds better. I probably just wanted to free my fantasy from the current code ;-) > > desc_committed = 0x1, /* committed by writer, could get reopened */ > > desc_finalized = 0x2, /* committed, could not longer get modified */ > > desc_reusable = 0x3, /* free, not yet used by any writer */ > > }; > > > > Second, only 4 variants of the 3 state bits are currently used. > > It means that two bits are enough and they might use exactly > > the above names: > > > > I mean to do something like: > > > > #define DESC_SV_BITS(sizeof(unsigned long) * 8) > > #define DESC_SV(desc_state) ((unsigned long)desc_state << (DESC_SV_BITS - > > 2)) > > #define DESC_ST(state_val) ((unsigned long)state_val >> (DESC_SV_BITS - 2)) > > This makes sense and will get us back the bit we lost because of > finalization. Yup. Which is good especially on 32-bit architectures. > I am wondering if VMCOREINFO should include a DESC_FLAGS_MASK so that > crash tools could at least successfully iterate the ID's, even if they > didn't know what all the flag values mean (in the case that more bits > are added later). Good point. I am just not sure whether they should try read all ids or they should refuse reading anything when a new bit is added. Well, I really hope that we will not need new states anytime soon. It would need a really strong reason. I personally can't think about any use case. pr_cont() was special because it was the writer side. All other steps of the printk rework are on the reader side. I believe that we are getting close with all the ring buffer code. And I have good feeling about it. Best Regards, Petr ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
misc: was: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
On Mon 2020-08-31 03:16:56, John Ogness wrote: > Add support for extending the newest data block. For this, introduce > a new finalization state flag (DESC_FINAL_MASK) that denotes when a > descriptor may not be extended, i.e. is finalized. > > Three new memory barrier pairs are introduced. Two of them are not > significant because they are alternate path memory barriers that > exactly correspond to existing memory barriers. > > But the third (_prb_commit:B / desc_reserve:D) is new and guarantees > that descriptors will always be finalized, either because a > descriptor setting DESC_COMMIT_MASK sees that there is a newer > descriptor and so finalizes itself or because a new descriptor being > reserved sees that the previous descriptor has DESC_COMMIT_MASK set > and finalizes that descriptor. Just for record. All the barriers look good to me. There are always full barriers when we change DESC_COMMIT_MASK. It guarantees consistency of the descriptor and data rings. The only relaxed modification is adding DESC_FINAL_MASK. It is OK. It can be done only when DESC_COMMIT_MASK has been set before using a full barrier. It means that all changes of the rings has already been since before. > --- a/kernel/printk/printk_ringbuffer.c > +++ b/kernel/printk/printk_ringbuffer.c > + > +/** > + * prb_reserve_in_last() - Re-reserve and extend the space in the ringbuffer > + * used by the newest record. > + * > + * @e: The entry structure to setup. > + * @rb:The ringbuffer to re-reserve and extend data in. > + * @r: The record structure to allocate buffers for. > + * @caller_id: The caller ID of the caller (reserving writer). > + * > + * This is the public function available to writers to re-reserve and extend > + * data. > + * > + * The writer specifies the text size to extend (not the new total size) by > + * setting the @text_buf_size field of @r. Dictionaries cannot be extended so Better might be to say that extending dictionaries is not supported. > + * @dict_buf_size of @r should be set to 0. To ensure proper initialization > of > + * @r, prb_rec_init_wr() should be used. > + * > + * This function will fail if @caller_id does not match the caller ID of the > + * newest record. In that case the caller must reserve new data using > + * prb_reserve(). > + * > + * Context: Any context. Disables local interrupts on success. > + * Return: true if text data could be extended, otherwise false. > + * > + * On success: > + * > + * - @r->text_buf points to the beginning of the entire text buffer. > + * > + * - @r->text_buf_len is set to the new total size of the buffer. s/buf_len/buf_size/ > + * - @r->dict_buf and @r->dict_buf_len are cleared because extending > + * the dict buffer is not supported. Same here. > + * > + * - @r->info is not touched so that @r->info->text_len could be used > + * to append the text. > + * > + * - prb_record_text_space() can be used on @e to query the new > + * actually used space. > + * > + * Important: All @r->info fields will already be set with the current values > + *for the record. I.e. @r->info->text_len will be less than > + *@text_buf_size and @r->info->dict_len may be set, even though > + *@dict_buf_size is 0. Writers can use @r->info->text_len to know > + *where concatenation begins and writers should update > + *@r->info->text_len after concatenating. > + */ > +bool prb_reserve_in_last(struct prb_reserved_entry *e, struct > printk_ringbuffer *rb, > + struct printk_record *r, u32 caller_id) > +{ > + unsigned int data_size; > + struct prb_desc *d; > + unsigned long id; > + > + local_irq_save(e->irqflags); > + > + /* Transition the newest descriptor back to the reserved state. */ > + d = desc_reopen_last(&rb->desc_ring, caller_id, &id); > + if (!d) { > + local_irq_restore(e->irqflags); > + goto fail_reopen; > + } > + > + /* Now the writer has exclusive access: LMM(prb_reserve_in_last:A) */ > + > + /* > + * Set the @e fields here so that prb_commit() can be used if > + * anything fails from now on. > + */ > + e->rb = rb; > + e->id = id; > + > + /* > + * desc_reopen_last() checked the caller_id, but there was no > + * exclusive access at that point. The descriptor may have > + * changed since then. > + */ > + if (caller_id != d->info.caller_id) > + goto fail; > + > + if (BLK_DATALESS(&d->text_blk_lpos)) { We do not clear d->info when reopening the descriptor. I would at least add here a consistency check similar to the one below: if (WARN_ON_ONCE(d->info.text_len)) { pr_warn_once("wrong data size (%u, expecting 0) for data\n", d->info.text_len); d->info.text_len = 0; } > +
Re: state names: vas: Re: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
On 2020-09-02, Petr Mladek wrote: >> +static struct prb_desc *desc_reopen_last(struct prb_desc_ring *desc_ring, >> + u32 caller_id, unsigned long *id_out) >> +{ >> +unsigned long prev_state_val; >> +enum desc_state d_state; >> +struct prb_desc desc; >> +struct prb_desc *d; >> +unsigned long id; >> + >> +id = atomic_long_read(&desc_ring->head_id); >> + >> +/* >> + * To minimize unnecessarily reopening a descriptor, first check the >> + * descriptor is in the correct state and has a matching caller ID. >> + */ >> +d_state = desc_read(desc_ring, id, &desc); >> +if (d_state != desc_reserved || >> +!(atomic_long_read(&desc.state_var) & DESC_COMMIT_MASK) || > > This looks like a hack. And similar extra check of the bit is needed > also in desc_read(), see > https://lore.kernel.org/r/878sdvq8kd@jogness.linutronix.de Agreed. > I has been actually getting less and less happy with the inconsistency > between names of the bits and states. > > ... > > First, define 5 desc_states, something like: > > enum desc_state { > desc_miss = -1, /* ID mismatch */ > desc_modified = 0x0, /* reserved, being modified by writer */ I prefer the "desc_reserved" name. It may or may not have be modified yet. > desc_committed = 0x1, /* committed by writer, could get reopened */ > desc_finalized = 0x2, /* committed, could not longer get modified */ > desc_reusable = 0x3, /* free, not yet used by any writer */ > }; > > Second, only 4 variants of the 3 state bits are currently used. > It means that two bits are enough and they might use exactly > the above names: > > I mean to do something like: > > #define DESC_SV_BITS (sizeof(unsigned long) * 8) > #define DESC_SV(desc_state) ((unsigned long)desc_state << (DESC_SV_BITS - > 2)) > #define DESC_ST(state_val)((unsigned long)state_val >> (DESC_SV_BITS - 2)) This makes sense and will get us back the bit we lost because of finalization. > I am sorry that I did not came up with this earlier. I know how > painful it is to rework bigger patchsets. But it affects format > of the ring buffer, so we should do it early. Agreed. I am wondering if VMCOREINFO should include a DESC_FLAGS_MASK so that crash tools could at least successfully iterate the ID's, even if they didn't know what all the flag values mean (in the case that more bits are added later). > PS: I am still middle of review. It looks good so far. I wanted to > send this early and separately because it is a bigger change. Thanks for the heads up. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
On Mon 2020-08-31 03:16:56, John Ogness wrote: > Add support for extending the newest data block. For this, introduce > a new finalization state flag (DESC_FINAL_MASK) that denotes when a > descriptor may not be extended, i.e. is finalized. > > --- a/kernel/printk/printk_ringbuffer.c > +++ b/kernel/printk/printk_ringbuffer.c > +/* > + * Try to resize an existing data block associated with the descriptor > + * specified by @id. If the resized datablock should become wrapped, it > + * copies the old data to the new data block. > + * > + * Fail if this is not the last allocated data block or if there is not > + * enough space or it is not possible make enough space. > + * > + * Return a pointer to the beginning of the entire data buffer or NULL on > + * failure. > + */ > +static char *data_realloc(struct printk_ringbuffer *rb, > + struct prb_data_ring *data_ring, unsigned int size, > + struct prb_data_blk_lpos *blk_lpos, unsigned long id) > +{ > + struct prb_data_block *blk; > + unsigned long head_lpos; > + unsigned long next_lpos; > + bool wrapped; > + > + /* Reallocation only works if @blk_lpos is the newest data block. */ > + head_lpos = atomic_long_read(&data_ring->head_lpos); > + if (head_lpos != blk_lpos->next) > + return NULL; > + > + /* Keep track if @blk_lpos was a wrapping data block. */ > + wrapped = (DATA_WRAPS(data_ring, blk_lpos->begin) != > DATA_WRAPS(data_ring, blk_lpos->next)); > + > + size = to_blk_size(size); > + > + next_lpos = get_next_lpos(data_ring, blk_lpos->begin, size); > + > + /* If the data block does not increase, there is nothing to do. */ > + if (next_lpos == head_lpos) { > + blk = to_block(data_ring, blk_lpos->begin); > + return &blk->data[0]; > + } We might be here even when the data are shrinked but the code below is not fully ready for this. > + if (!data_push_tail(rb, data_ring, next_lpos - DATA_SIZE(data_ring))) > + return NULL; > + > + /* The memory barrier involvement is the same as data_alloc:A. */ > + if (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &head_lpos, > + next_lpos)) { /* LMM(data_realloc:A) */ > + return NULL; > + } > + > + blk = to_block(data_ring, blk_lpos->begin); > + > + if (DATA_WRAPS(data_ring, blk_lpos->begin) != DATA_WRAPS(data_ring, > next_lpos)) { > + struct prb_data_block *old_blk = blk; > + > + /* Wrapping data blocks store their data at the beginning. */ > + blk = to_block(data_ring, 0); > + > + /* > + * Store the ID on the wrapped block for consistency. > + * The printk_ringbuffer does not actually use it. > + */ > + blk->id = id; > + > + if (!wrapped) { > + /* > + * Since the allocated space is now in the newly > + * created wrapping data block, copy the content > + * from the old data block. > + */ > + memcpy(&blk->data[0], &old_blk->data[0], > +(blk_lpos->next - blk_lpos->begin) - > sizeof(blk->id)); It took me quite some time to check whether this code is correct or not. First, I wondered whether the size was correctly calculated. It is because the original block was not wrapped, so lpos->next - lpos->beging defines the real data buffer size. Second, I wondered whether the target block might be smaller than the original (the above check allows shrinking). It can't be smaller because then the new block won't be wrapped as well. Sigh, it is a bit tricky. And there is 3rd possibility that is not handled. The original block might be wrapped but the new shrunken one might not longer be wrapped. Then we would need to copy the data the other way. I know that this function is not currently used for shrinking. But I would prefer to be on the safe side. Either make the copying generic, e.g. by calculating the real data size using the code from get_data(). Or simply comletely refuse shrinking by the above check. Best Regards, Petr > + } > + } > + > + blk_lpos->next = next_lpos; > + > + return &blk->data[0]; > +} > + > /* Return the number of bytes used by a data block. */ > static unsigned int space_used(struct prb_data_ring *data_ring, > struct prb_data_blk_lpos *blk_lpos) ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
state names: vas: Re: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
On Mon 2020-08-31 03:16:56, John Ogness wrote: > Add support for extending the newest data block. For this, introduce > a new finalization state flag (DESC_FINAL_MASK) that denotes when a > descriptor may not be extended, i.e. is finalized. > --- a/kernel/printk/printk_ringbuffer.c > +++ b/kernel/printk/printk_ringbuffer.c > @@ -49,14 +49,16 @@ > * Descriptors have three states: > * > * reserved > - * A writer is modifying the record. > + * A writer is modifying the record. Internally represented as either "0" > + * or "DESC_COMMIT_MASK". We should explain the difference between the two values. It might be enough to add something like: See "Descriptor Finalization" section for more details." > @@ -79,6 +81,25 @@ > * committed or reusable queried state. This makes it possible that a valid > * sequence number of the tail is always available. > * > + * Descriptor Finalization > + * ~~~ > + * When a writer calls the commit function prb_commit(), the record may still > + * continue to be in the reserved queried state. In order for that record to > + * enter into the committed queried state, that record also must be > finalized. > + * A record can be finalized by three different scenarios: > + * > + * 1) A writer can finalize its record immediately by calling > + * prb_final_commit() instead of prb_commit(). > + * > + * 2) When a new record is reserved and the previous record has been > + * committed via prb_commit(), that previous record is finalized. > + * > + * 3) When a record is committed via prb_commit() and a newer record > + * already exists, the record being committed is finalized. > + * > + * Until a record is finalized (represented by "DESC_FINAL_MASK"), a writer > + * may "reopen" that record and extend it with more data. > + * > * Data Rings > * ~~ > * The two data rings (text and dictionary) function identically. They exist [...] > +/* > + * Attempt to remove the commit flag so that the record can be modified by a > + * writer again. This is only possible if the descriptor is not yet > finalized. > + * > + * Note that on success, the queried state did not change. A non-finalized > + * record (even with the commit flag set) is in the reserved queried state. > + */ > +static struct prb_desc *desc_reopen_last(struct prb_desc_ring *desc_ring, > + u32 caller_id, unsigned long *id_out) > +{ > + unsigned long prev_state_val; > + enum desc_state d_state; > + struct prb_desc desc; > + struct prb_desc *d; > + unsigned long id; > + > + id = atomic_long_read(&desc_ring->head_id); > + > + /* > + * To minimize unnecessarily reopening a descriptor, first check the > + * descriptor is in the correct state and has a matching caller ID. > + */ > + d_state = desc_read(desc_ring, id, &desc); > + if (d_state != desc_reserved || > + !(atomic_long_read(&desc.state_var) & DESC_COMMIT_MASK) || This looks like a hack. And similar extra check of the bit is needed also in desc_read(), see https://lore.kernel.org/r/878sdvq8kd@jogness.linutronix.de I has been actually getting less and less happy with the inconsistency between names of the bits and states. Sigh, you will hate me because this would mean a bigger change. IMHO, it would be much cleaner and help with long-term maintainability when we do the following two changes: First, define 5 desc_states, something like: enum desc_state { desc_miss = -1, /* ID mismatch */ desc_modified = 0x0, /* reserved, being modified by writer */ desc_committed = 0x1, /* committed by writer, could get reopened */ desc_finalized = 0x2, /* committed, could not longer get modified */ desc_reusable = 0x3, /* free, not yet used by any writer */ }; Second, only 4 variants of the 3 state bits are currently used. It means that two bits are enough and they might use exactly the above names: I mean to do something like: #define DESC_SV_BITS(sizeof(unsigned long) * 8) #define DESC_SV(desc_state) ((unsigned long)desc_state << (DESC_SV_BITS - 2)) #define DESC_ST(state_val) ((unsigned long)state_val >> (DESC_SV_BITS - 2)) Then we could have: static enum desc_state get_desc_state(unsigned long id, unsigned long state_val) { if (id != DESC_ID(state_val)) return desc_miss; return DESC_ST(state_val); } and use in the code: unsigned long val_committed = id | DESC_SV(desc_committed); or do #define DESC_SV(id, desc_state) (id | (unsigned long)desc_state << (DESC_SV_BITS - 2)) and then use DESC_SV(id, DESC_COMMITTED). I am sorry that I did not came up with this earlier. I know how painful it is to rework bigger patchsets. But it affects format of the ring buffer, so we should do it early. Best Regards, Petr PS: I am still middle of review. It looks good s
[PATCH] kexec: remove the 2GB size limit on initrd file
Enable loading initrd files exceeding the INT_MAX size. Remove the INT_MAX limit completely, and let any initrd load if it fits in the memory. Signed-off-by: Robi Buranyi --- kernel/kexec_file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index ca40bef75a61..659a9d165198 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -222,7 +222,7 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd, loff_t size; ret = kernel_read_file_from_fd(kernel_fd, &image->kernel_buf, - &size, INT_MAX, READING_KEXEC_IMAGE); + &size, 0, READING_KEXEC_IMAGE); if (ret) return ret; image->kernel_buf_len = size; @@ -242,7 +242,7 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd, /* It is possible that there no initramfs is being loaded */ if (!(flags & KEXEC_FILE_NO_INITRAMFS)) { ret = kernel_read_file_from_fd(initrd_fd, &image->initrd_buf, - &size, INT_MAX, + &size, 0, READING_KEXEC_INITRAMFS); if (ret) goto out; -- 2.28.0.402.g5ffc5be6b7-goog ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec