On 03.12.25 15:36, Oleksii Kurochko wrote:
> [You don't often get email from [email protected]. Learn why > this is important at https://aka.ms/LearnAboutSenderIdentification ] > > Hello Oleksandr, Hello Oleksii > > On 12/3/25 12:03 PM, Oleksandr Tyshchenko wrote: >> On 02.12.25 23:33, Grygorii Strashko wrote: >>> >>> On 02.12.25 21:32, Oleksandr Tyshchenko wrote: >>>> Creating a guest with a high vCPU count (e.g., >32) fails because >>>> the guest's device tree buffer (DOMU_DTB_SIZE) overflows during >>>> creation. >>>> The FDT nodes for each vCPU quickly exhaust the 4KiB buffer, >>>> causing a guest creation failure. >>>> >>>> Increase the buffer size to 16KiB to support guests up to >>>> the MAX_VIRT_CPUS limit (128). >>>> >>>> Signed-off-by: Oleksandr Tyshchenko <[email protected]> >>>> --- >>>> Noticed when testing the boundary conditions for dom0less guest >>>> creation on Arm64. >>>> >>>> Domain configuration: >>>> fdt mknod /chosen domU0 >>>> fdt set /chosen/domU0 compatible "xen,domain" >>>> fdt set /chosen/domU0 \#address-cells <0x2> >>>> fdt set /chosen/domU0 \#size-cells <0x2> >>>> fdt set /chosen/domU0 memory <0x0 0x10000 > >>>> fdt set /chosen/domU0 cpus <33> >>>> fdt set /chosen/domU0 vpl011 >>>> fdt mknod /chosen/domU0 module@40400000 >>>> fdt set /chosen/domU0/module@40400000 compatible "multiboot,kernel" >>>> "multiboot,module" >>>> fdt set /chosen/domU0/module@40400000 reg <0x0 0x40400000 0x0 0x16000 > >>>> fdt set /chosen/domU0/module@40400000 bootargs "console=ttyAMA0" >>>> >>>> Failure log: >>>> (XEN) Xen dom0less mode detected >>>> (XEN) *** LOADING DOMU cpus=33 memory=0x10000KB *** >>>> (XEN) Loading d1 kernel from boot module @ 0000000040400000 >>>> (XEN) Allocating mappings totalling 64MB for d1: >>>> (XEN) d1 BANK[0] 0x00000040000000-0x00000044000000 (64MB) >>>> (XEN) Device tree generation failed (-22). >>>> (XEN) >>>> (XEN) **************************************** >>>> (XEN) Panic on CPU 0: >>>> (XEN) Could not set up domain domU0 (rc = -22) >>>> (XEN) **************************************** >>>> --- >>>> --- >>>> xen/common/device-tree/dom0less-build.c | 8 +++++--- >>>> 1 file changed, 5 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/xen/common/device-tree/dom0less-build.c b/xen/common/ >>>> device-tree/dom0less-build.c >>>> index 3f5b987ed8..d7d0a47b97 100644 >>>> --- a/xen/common/device-tree/dom0less-build.c >>>> +++ b/xen/common/device-tree/dom0less-build.c >>>> @@ -461,10 +461,12 @@ static int __init >>>> domain_handle_dtb_boot_module(struct domain *d, >>>> /* >>>> * The max size for DT is 2MB. However, the generated DT is small >>>> (not including >>>> - * domU passthrough DT nodes whose size we account separately), 4KB >>>> are enough >>>> - * for now, but we might have to increase it in the future. >>>> + * domU passthrough DT nodes whose size we account separately). The >>>> size is >>>> + * primarily driven by the number of vCPU nodes. The previous 4KiB >>>> buffer was >>>> + * insufficient for guests with high vCPU counts, so it has been >>>> increased >>>> + * to support up to the MAX_VIRT_CPUS limit (128). >>>> */ >>>> -#define DOMU_DTB_SIZE 4096 >>>> +#define DOMU_DTB_SIZE (4096 * 4) >>> May be It wants Kconfig? >>> Or some formula which accounts MAX_VIRT_CPUS? >> >> I agree that using a formula that accounts for MAX_VIRT_CPUS is the most >> robust approach. > > One option could be to detect the size at runtime, essentially, try to > allocate > it, and if an error occurs, increase the fdtsize and try again. I don’t > really > like this approach, but I wanted to mention it in case someone finds it > useful. > The benefit of this approach is that if, in the future, something else such > as a CPU node contributes to the final FDT size, we won’t need to update > the > formula again. I got your point and understand the goal, but I see the following concerns with that: 1. Xen has to do all the work to build the device tree, fail, throw all that work away, and then start over again. This wastes time during the system's boot-up process. 2. Boot-time code should be as deterministic and predictable as possible. A static, worst-case calculation is highly predictable, whereas a retry loop is not. 3. It adds logical complexity (error handling, looping, size increments) to what should be a straightforward setup step. > >> >> Here is the empirical data (by testing with the maximum number of device >> tree nodes (e.g., hypervisor and reserved-memory nodes) and enabling all >> optional CPU properties (e.g., clock-frequency)): >> >> cpus=1 >> (XEN) Final compacted FDT size is: 1586 bytes >> >> cpus=2 >> (XEN) Final compacted FDT size is: 1698 bytes >> >> cpus=32 >> (XEN) Final compacted FDT size is: 5058 bytes >> >> cpus=128 >> (XEN) Final compacted FDT size is: 15810 bytes >> >> >> static int __init prepare_dtb_domU(struct domain *d, struct kernel_info >> *kinfo) >> { >> int addrcells, sizecells; >> @@ -569,6 +569,8 @@ static int __init prepare_dtb_domU(struct domain *d, >> struct kernel_info *kinfo) >> if ( ret < 0 ) >> goto err; >> >> + printk("Final compacted FDT size is: %d bytes\n", >> fdt_totalsize(kinfo->fdt)); >> + >> return 0; >> >> err: >> >> This data shows (assuming my testing/calculations are correct): >> >> - A marginal cost of 112 bytes per vCPU in the final, compacted device >> tree. >> - A fixed base size of 1474 bytes for all non-vCPU content. >> >> Based on that I would propose the following formula with the >> justification: >> >> /* >> * The size is calculated from a fixed baseline plus a scalable >> * portion for each potential vCPU node up to the system limit >> * (MAX_VIRT_CPUS), as the vCPU nodes are the primary consumer >> * of space. >> * >> * The baseline of 2KiB is a safe buffer for all non-vCPU FDT >> * content. The 128 bytes per vCPU is derived from a worst-case >> * analysis of the FDT construction-time size for a single >> * vCPU node. >> */ >> #define DOMU_DTB_SIZE (2048 + (MAX_VIRT_CPUS * 128)) >> >> ********************************************** >> >> Please tell me would you be happy with that? > > I would also like to note that we probably want to add a BUILD_BUG_ON() > check > to ensure that DOMU_DTB_SIZE is not larger than SZ_2M. Otherwise, we > would get > a runtime error instead of a build-time failure, since there is code > that limits > fdtsize to SZ_2M: > > /* Cap to max DT size if needed */ > fdt_size = min(fdt_size, SZ_2M); ok, sounds reasonable, will add: BUILD_BUG_ON(DOMU_DTB_SIZE > SZ_2M); > > Thanks. > > ~ Oleksii >
