On 18.11.2022 15:27, Andrew Cooper wrote:
> On 18/11/2022 12:54, Jan Beulich wrote:
>> On 18.11.2022 13:33, Andrew Cooper wrote:
>>> On 18/11/2022 10:31, Jan Beulich wrote:
>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
>>>> exposed a problem with the marking of the respective vector as
>>>> pending: For quite some time Linux has been checking whether any stale
>>>> ISR or IRR bits would still be set while preparing the LAPIC for use.
>>>> This check is now triggering on the upcall vector, as the registration,
>>>> at least for APs, happens before the LAPIC is actually enabled.
>>>>
>>>> In software-disabled state an LAPIC would not accept any interrupt
>>>> requests and hence no IRR bit would newly become set while in this
>>>> state. As a result it is also wrong for us to mark the upcall vector as
>>>> having a pending request when the vLAPIC is in this state.
>>> I agree with this.
>>>
>>>> To compensate for the "enabled" check added to the assertion logic, add
>>>> logic to (conditionally) mark the upcall vector as having a request
>>>> pending at the time the LAPIC is being software-enabled by the guest.
>>> But this, I don't think is appropriate.
>>>
>>> The point of raising on enable is allegedly to work around setup race
>>> conditions.  I'm unconvinced by this reasoning, but it is what it is,
>>> and the stated behaviour is to raise there and then.
>>>
>>> If a guest enables evtchn while the LAPIC is disabled, then the
>>> interrupt is lost.  Like every other interrupt in an x86 system.
>> Edge triggered ones you mean, I suppose, but yes.
> 
> For IO-APIC systems, you mostly lose line interrupts too, don't you?
> 
> The line will remain pending at the IO-APIC, but nothing in the system
> will unwedge until someone polls the IO-APIC.
> 
> Either way...
> 
>>
>>> I don't think there is any credible way a guest kernel author can expect
>>> the weird evtchn edgecase to wait for an arbitrary point in the future,
>>> and it's a corner case that I think is worth not keeping.
>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event
>> after setting upcall vector"), referenced by the Fixes: tag? The issue is
>> that with evtchn_upcall_pending once set, there would never again be a
>> notification.
> 
> Ok, so we do need to do something.
> 
>>  So if what you say is to be the model we follow, then that
>> earlier change was perhaps wrong as well. Instead it should then have
>> been a guest change (as also implicit from your reply) to clear
>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC
>> enabling (here), perhaps by way of "manually" invoking the handling of
>> that pending event, or by issuing a self-IPI with that vector.
>> Especially the LAPIC enabling case would then be yet another Xen-specific
>> on a guest code path which better wouldn't have to be aware of Xen. 
> 
> Without trying to prescribe how to fix this specific issue, wherever
> possible we should be trying to limit the Xen-isms from non-Xen areas. 
> There's a whole lot of poorly described and surprising behaviours which
> have not stood the test of time.
> 
> In this case, it seems that we have yet another x86 PV-ism which hasn't
> translated well x86 HVM.  Specifically, we're trying to overlay an
> entirely shared-memory (and delayed return-to-guest) interrupt
> controller onto one which is properly constructed to handle events in
> realtime.
> 
> 
> I even got as far as writing that maybe leaving it as-is was the best
> option (principle of least surprise for Xen developers), but our
> "friend" apic acceleration strikes again.
> 
> Xen doesn't always get a VMExit when the guest clears SW_DISABLE,
> because microcode may have accelerated it.

But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write
VM exit. If we didn't, how would our internal accounting of APIC enabled
state (VLAPIC_SW_DISABLED) work? And the neighboring (to where I'm adding
the new code) pt_may_unmask_irq() call then also wouldn't occur.

I'm actually pretty sure we do too much in this case - in particular none
of the vlapic_set_reg() should be necessary. But we certainly can't get
away with doing nothing, and hence we depend on that VM exit to actually
occur. Plus simply making the vlapic_set_reg() conditional also likely
wouldn't do any good, so if anything we may want to split
vlapic_reg_write() and invoke only the "2nd half" from
vlapic_apicv_write().

Jan

> A consequence of this observation is that Xen cannot have
> non-LAPIC-archtiectural behaviour in the vlapic emulation.  So I think
> we need to find a solution to this problem that doesn't hook APIC_SPIV.
> 
> ~Andrew


Reply via email to