On Wed Sep 10, 2025 at 7:01 PM CEST, Alejandro Vallejo wrote:
> On Wed Sep 10, 2025 at 5:31 PM CEST, Jan Beulich wrote:
>> On 10.09.2025 17:16, Alejandro Vallejo wrote:
>>> On Wed Sep 10, 2025 at 5:02 PM CEST, Jan Beulich wrote:
>>>> On 10.09.2025 16:49, Alejandro Vallejo wrote:
>>>>> CPU hotplug relies on the guest having access to the legacy online CPU
>>>>> bitmap that QEMU provides at PIO 0xAF00. But PVH guests have no DM, so
>>>>> this causes the MADT to get corrupted due to spurious modifications of
>>>>> the "online" flag in MADT entries and the table checksum during the
>>>>> initial acpica passes.
>>>>
>>>> I don't understand this MADT corruption aspect, which - aiui - is why
>>>> there's a Fixes: tag here. The code change itself looks plausible.
>>> 
>>> When there's no DM to provide a real and honest online CPU bitmap on PIO 
>>> 0xAF00
>>> then we get all 1s (because there's no IOREQ server). Which confuses the GPE
>>> handler.
>>> 
>>> Somehow, the GPE handler is being triggered. Whether this is due to a real 
>>> SCI
>>> or just it being spuriously executed as part of the initial acpica pass, I 
>>> don't
>>> know.
>>> 
>>> Both statements combined means the checksum and online flags in the MADT get
>>> changed after initial parsing making it appear as-if all 128 CPUs were 
>>> plugged.
>>
>> I can follow this part (the online flags one, that is).
>>
>>> This patch makes the checksums be correct after acpica init.
>>
>> I'm still in trouble with this one. If MADT is modified in the process, 
>> there's
>> only one of two possible options:
>> 1) It's expected for the checksum to no longer be correct.
>> 2) The checksum is being fixed up in the process.
>> That's independent of being HVM or PVH and independent of guest boot or 
>> later.
>> (Of course there's a sub-variant of 2, where the adjusting of the checksum
>> would be broken, but that wouldn't be covered by your change.)
>>
>> Jan
>
> I see what you mean now. The checksum correction code LOOKS correct. But I
> wonder about the table length... We report a table as big as it needs to be,
> but the checksum update is done irrespective of FLG being inside the valid 
> range
> of the MADT. If a guest with 2 vCPUs (in max_vcpus) sees vCPU127 being 
> signalled
> that'd trigger the (unseen) online flag to be enabled and the checksum 
> adjusted,
> except the checksum must not being adjusted.
>
> I could add even more AML to cover that, but that'd be QEMU misbehaving (or
> being absent). This patch covers the latter case, but it might be good to
> change the commit message to reflect the real problem.
>
> Cheers,
> Alejandro

It doesn't quite add up in the mismatch though. There might be something else
lurking in there.

Regardless, I don't want this junk in PVH. Would a commit reword suffice to have
it acked?

Cheers,
Alejandro

Reply via email to