On 03.12.25 15:36, Oleksii Kurochko wrote:


> [You don't often get email from [email protected]. Learn why 
> this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hello Oleksandr,

Hello Oleksii

> 
> On 12/3/25 12:03 PM, Oleksandr Tyshchenko wrote:
>> On 02.12.25 23:33, Grygorii Strashko wrote:
>>>
>>> On 02.12.25 21:32, Oleksandr Tyshchenko wrote:
>>>> Creating a guest with a high vCPU count (e.g., >32) fails because
>>>> the guest's device tree buffer (DOMU_DTB_SIZE) overflows during 
>>>> creation.
>>>> The FDT nodes for each vCPU quickly exhaust the 4KiB buffer,
>>>> causing a guest creation failure.
>>>>
>>>> Increase the buffer size to 16KiB to support guests up to
>>>> the MAX_VIRT_CPUS limit (128).
>>>>
>>>> Signed-off-by: Oleksandr Tyshchenko <[email protected]>
>>>> ---
>>>> Noticed when testing the boundary conditions for dom0less guest
>>>> creation on Arm64.
>>>>
>>>> Domain configuration:
>>>> fdt mknod /chosen domU0
>>>> fdt set /chosen/domU0 compatible "xen,domain"
>>>> fdt set /chosen/domU0 \#address-cells <0x2>
>>>> fdt set /chosen/domU0 \#size-cells <0x2>
>>>> fdt set /chosen/domU0 memory <0x0 0x10000 >
>>>> fdt set /chosen/domU0 cpus <33>
>>>> fdt set /chosen/domU0 vpl011
>>>> fdt mknod /chosen/domU0 module@40400000
>>>> fdt set /chosen/domU0/module@40400000 compatible  "multiboot,kernel"
>>>> "multiboot,module"
>>>> fdt set /chosen/domU0/module@40400000 reg <0x0 0x40400000 0x0 0x16000 >
>>>> fdt set /chosen/domU0/module@40400000 bootargs "console=ttyAMA0"
>>>>
>>>> Failure log:
>>>> (XEN) Xen dom0less mode detected
>>>> (XEN) *** LOADING DOMU cpus=33 memory=0x10000KB ***
>>>> (XEN) Loading d1 kernel from boot module @ 0000000040400000
>>>> (XEN) Allocating mappings totalling 64MB for d1:
>>>> (XEN) d1 BANK[0] 0x00000040000000-0x00000044000000 (64MB)
>>>> (XEN) Device tree generation failed (-22).
>>>> (XEN)
>>>> (XEN) ****************************************
>>>> (XEN) Panic on CPU 0:
>>>> (XEN) Could not set up domain domU0 (rc = -22)
>>>> (XEN) ****************************************
>>>> ---
>>>> ---
>>>>    xen/common/device-tree/dom0less-build.c | 8 +++++---
>>>>    1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/xen/common/device-tree/dom0less-build.c b/xen/common/
>>>> device-tree/dom0less-build.c
>>>> index 3f5b987ed8..d7d0a47b97 100644
>>>> --- a/xen/common/device-tree/dom0less-build.c
>>>> +++ b/xen/common/device-tree/dom0less-build.c
>>>> @@ -461,10 +461,12 @@ static int __init
>>>> domain_handle_dtb_boot_module(struct domain *d,
>>>>    /*
>>>>     * The max size for DT is 2MB. However, the generated DT is small
>>>> (not including
>>>> - * domU passthrough DT nodes whose size we account separately), 4KB
>>>> are enough
>>>> - * for now, but we might have to increase it in the future.
>>>> + * domU passthrough DT nodes whose size we account separately). The
>>>> size is
>>>> + * primarily driven by the number of vCPU nodes. The previous 4KiB
>>>> buffer was
>>>> + * insufficient for guests with high vCPU counts, so it has been
>>>> increased
>>>> + * to support up to the MAX_VIRT_CPUS limit (128).
>>>>     */
>>>> -#define DOMU_DTB_SIZE 4096
>>>> +#define DOMU_DTB_SIZE (4096 * 4)
>>> May be It wants Kconfig?
>>> Or some formula which accounts MAX_VIRT_CPUS?
>>
>> I agree that using a formula that accounts for MAX_VIRT_CPUS is the most
>> robust approach.
> 
> One option could be to detect the size at runtime, essentially, try to 
> allocate
> it, and if an error occurs, increase the fdtsize and try again. I don’t 
> really
> like this approach, but I wanted to mention it in case someone finds it 
> useful.
> The benefit of this approach is that if, in the future, something else such
> as a CPU node contributes to the final FDT size, we won’t need to update 
> the
> formula again.

I got your point and understand the goal, but I see the following 
concerns with that:

1. Xen has to do all the work to build the device tree, fail, throw all 
that work away, and then start over again. This wastes time during the 
system's boot-up process.

2. Boot-time code should be as deterministic and predictable as 
possible. A static, worst-case calculation is highly predictable, 
whereas a retry loop is not.

3. It adds logical complexity (error handling, looping, size increments) 
to what should be a straightforward setup step.

> 
>>
>> Here is the empirical data (by testing with the maximum number of device
>> tree nodes (e.g., hypervisor and reserved-memory nodes) and enabling all
>> optional CPU properties (e.g., clock-frequency)):
>>
>> cpus=1
>> (XEN) Final compacted FDT size is: 1586 bytes
>>
>> cpus=2
>> (XEN) Final compacted FDT size is: 1698 bytes
>>
>> cpus=32
>> (XEN) Final compacted FDT size is: 5058 bytes
>>
>> cpus=128
>> (XEN) Final compacted FDT size is: 15810 bytes
>>
>>
>> static int __init prepare_dtb_domU(struct domain *d, struct kernel_info
>> *kinfo)
>>    {
>>        int addrcells, sizecells;
>> @@ -569,6 +569,8 @@ static int __init prepare_dtb_domU(struct domain *d,
>> struct kernel_info *kinfo)
>>        if ( ret < 0 )
>>            goto err;
>>
>> +    printk("Final compacted FDT size is: %d bytes\n",
>> fdt_totalsize(kinfo->fdt));
>> +
>>        return 0;
>>
>>      err:
>>
>> This data shows (assuming my testing/calculations are correct):
>>
>> - A marginal cost of 112 bytes per vCPU in the final, compacted device 
>> tree.
>> - A fixed base size of 1474 bytes for all non-vCPU content.
>>
>> Based on that I would propose the following formula with the 
>> justification:
>>
>> /*
>>    * The size is calculated from a fixed baseline plus a scalable
>>    * portion for each potential vCPU node up to the system limit
>>    * (MAX_VIRT_CPUS), as the vCPU nodes are the primary consumer
>>    * of space.
>>    *
>>    * The baseline of 2KiB is a safe buffer for all non-vCPU FDT
>>    * content. The 128 bytes per vCPU is derived from a worst-case
>>    * analysis of the FDT construction-time size for a single
>>    * vCPU node.
>>    */
>> #define DOMU_DTB_SIZE (2048 + (MAX_VIRT_CPUS * 128))
>>
>> **********************************************
>>
>> Please tell me would you be happy with that?
> 
> I would also like to note that we probably want to add a BUILD_BUG_ON() 
> check
> to ensure that DOMU_DTB_SIZE is not larger than SZ_2M. Otherwise, we 
> would get
> a runtime error instead of a build-time failure, since there is code 
> that limits
> fdtsize to SZ_2M:
> 
>      /* Cap to max DT size if needed */
>      fdt_size = min(fdt_size, SZ_2M);


ok, sounds reasonable, will add:

BUILD_BUG_ON(DOMU_DTB_SIZE > SZ_2M);

> 
> Thanks.
> 
> ~ Oleksii
> 

Reply via email to