On Tue, Nov 24, 2020 at 03:42:28PM +0100, Jan Beulich wrote:
> On 24.11.2020 11:05, Jan Beulich wrote:
> > On 23.11.2020 18:39, Manuel Bouyer wrote:
> >> On Mon, Nov 23, 2020 at 06:06:10PM +0100, Roger Pau Monné wrote:
> >>> OK, I'm afraid this is likely too verbose and messes with the timings.
> >>>
> >>> I've been looking (again) into the code, and I found something weird
> >>> that I think could be related to the issue you are seeing, but haven't
> >>> managed to try to boot the NetBSD kernel provided in order to assert
> >>> whether it solves the issue or not (or even whether I'm able to
> >>> repro it). Would you mind giving the patch below a try?
> >>
> >> With this, I get the same hang but XEN outputs don't wake up the interrupt
> >> any more. The NetBSD counter shows only one interrupt for ioapic2 pin 2,
> >> while I would have about 8 at the time of the hang.
> >>
> >> So, now it looks like interrupts are blocked forever.
> > 
> > Which may be a good thing for debugging purposes, because now we have
> > a way to investigate what is actually blocking the interrupt's
> > delivery without having to worry about more output screwing the
> > overall picture.
> > 
> >> At
> >> http://www-soc.lip6.fr/~bouyer/xen-log5.txt
> >> you'll find the output of the 'i' key.
> > 
> > (XEN)    IRQ:  34 vec:59 IO-APIC-level   status=010 aff:{0}/{0-7} 
> > in-flight=1 d0: 34(-MM)
> > 
> > (XEN)     IRQ 34 Vec 89:
> > (XEN)       Apic 0x02, Pin  2: vec=59 delivery=LoPri dest=L status=1 
> > polarity=1 irr=1 trig=L mask=0 dest_id:00000001
> 
> Since it repeats in Manuel's latest dump, perhaps the odd combination
> of status=1 and irr=1 is to tell us something? It is my understanding
> that irr ought to become set only when delivery-status clears. Yet I
> don't know what to take from this...

My reading of this is that one interrupt was accepted by the lapic
(irr=1) and that there's a further interrupt pending that hasn't yet
been accepted by the lapic (status=1) because it's still serving the
previous one. But that's all weird because there's no matching
vector in ISR, and hence the IRR bit on the IO-APIC has somehow become
stale or out of sync with the lapic state?

I'm also unsure about how Xen has managed to reach this state, it
shouldn't be possible in the first place.

I don't think I can instrument the paths further with printfs because
it's likely to result in the behavior itself changing and console
spamming. I could however create a static buffer to trace relevant
actions and then dump all them together with the 'i' debug key output.

Sorry Manuel, you seem to have hit some kind of weird bug regarding
interrupt management. If you want to progress further with NetBSD PVH
dom0 it's likely to work on a different box, but I would ask if you
can keep the current box in order for us to continue debugging.

Roger.

Reply via email to