[Resend from an account which will let me...] On 11/09/2025 4:46 pm, Alejandro Vallejo wrote: > On Thu Sep 11, 2025 at 9:55 AM CEST, Jan Beulich wrote: >> On 10.09.2025 23:57, Andrew Cooper wrote: >>> On 10/09/2025 7:58 pm, Jason Andryuk wrote: >>>> Hi, >>>> >>>> We're running Android as a guest and it's running the Compatibility >>>> Test Suite. During the CTS, the Android domU is rebooted multiple times. >>>> >>>> In the middle of the CTS, we've seen reboot fail. xl -vvv shows: >>>> domainbuilder: detail: Could not allocate memory for HVM guest as we >>>> cannot claim memory! >>>> xc: error: panic: xg_dom_boot.c:119: xc_dom_boot_mem_init: can't >>>> allocate low memory for domain: Out of memory >>>> libxl: error: libxl_dom.c:581:libxl__build_dom: xc_dom_boot_mem_init >>>> failed: Cannot allocate memory >>>> domainbuilder: detail: xc_dom_release: called >>>> >>>> So the claim failed. The system has enough memory since we're just >>>> rebooting the same VM. As a work around, I added sleep(1) + retry, >>>> which works. >>>> >>>> The curious part is the memory allocation. For d2 to d5, we have: >>>> domainbuilder: detail: range: start=0x0 end=0xf0000000 >>>> domainbuilder: detail: range: start=0x100000000 end=0x1af000000 >>>> xc: detail: PHYSICAL MEMORY ALLOCATION: >>>> xc: detail: 4KB PAGES: 0x0000000000000000 >>>> xc: detail: 2MB PAGES: 0x00000000000006f8 >>>> xc: detail: 1GB PAGES: 0x0000000000000003 >>>> >>>> But when we have to retry the claim for d6, there are no 1GB pages used: >>>> domainbuilder: detail: range: start=0x0 end=0xf0000000 >>>> domainbuilder: detail: range: start=0x100000000 end=0x1af000000 >>>> domainbuilder: detail: HVM claim failed! attempt 0 >>>> xc: detail: PHYSICAL MEMORY ALLOCATION: >>>> xc: detail: 4KB PAGES: 0x0000000000002800 >>>> xc: detail: 2MB PAGES: 0x0000000000000ce4 >>>> xc: detail: 1GB PAGES: 0x0000000000000000 >>>> >>>> But subsequent reboots for d7 and d8 go back to using 1GB pages. >>>> >>>> Does the change in memory allocation stick out to anyone? >>>> >>>> Unfortunately, I don't have insight into what the failing test is doing. >>>> >>>> Xen doesn't seem set up to track the claim across reboot. Retrying >>>> the claim works in our scenario since we have a controlled configuration. >>> This looks to me like a known phenomenon. Ages back, a change was made >>> in how Xen scrubs memory, from being synchronous in domain_kill(), to >>> being asynchronous in the idle loop. >>> >>> The consequence being that, on an idle system, you can shutdown and >>> reboot the domain faster, but on a busy system you end up trying to >>> allocate the new domain while memory from the old domain is still dirty. >>> >>> It is a classic example of a false optimisation, which looks great on an >>> idle system only because the idle CPUs are swallowing the work. >> I wouldn't call this a "false optimization",
Sorry - I was referring to things more generally. There's a huge number of things that look like great ideas when you develop and demo them on an idle system, and then they fall off a cliff on a busy system. This is one. Releasing the domctl lock in domain_kill() was another (this one did get reverted IIRC). XenServer's attempt to compress the migrate stream, etc. Performance is hard, and definitely harder than functional fixes. All of these were reasonable hypotheses and a valid line of experimentation, but were not tested outside of idle conditions. All of these examples have behaviour on a busy system which is far worse than not having the improvement in the first place. Hence the "false" part of the optimisation. >> but rather one ... >> >>> This impacts the ability to find a 1G aligned block of free memory to >>> allocate a superpage with, and by the sounds of it, claims (which >>> predate this behaviour change) aren't aware of the "to be scrubbed" >>> queue and fail instead. >> ... which isn't sufficiently integrated with the rest of the allocator. >> >> Jan > That'd depend on the threat model. I'm pretty sure Kconfig post-dates the change in question here. > At the very least there ought to be a > Kconfig knob to control it. You can't really tell a customer "your data is > gone from our systems" unless it really is gone. I'm guessing part of the > rationale was speeding up the obnoxiously slow destroydomain, since it hogs > a dom0 vCPU until it's done and it can take many minutes in large domains. It was Oracle being unhappy at domain shutdown on a 2T VM taking 20 minutes. > IOW: It's a nice optimisation, but there's multiple use cases for specifically > not wanting something like that. My recommendation at some point under the fact was a parameter to domain_kill(). In a mixed system, you might care about it for some domains and not others. Although it occurs to me now that really it needs to be an input to domain_create(), because if you care about it on a VM, you care about anything that gets freed, not just the things freed right at the end of the domain's lifetime. ~Andrew