HPET: use single, global, low-priority vector for broadcast IRQ

Roger Pau Monné Tue, 21 Oct 2025 06:50:35 -0700

On Tue, Oct 21, 2025 at 08:42:13AM +0200, Jan Beulich wrote:
> On 20.10.2025 18:22, Roger Pau Monné wrote:
> > On Mon, Oct 20, 2025 at 01:18:34PM +0200, Jan Beulich wrote:
> >> Using dynamically allocated / maintained vectors has several downsides:
> >> - possible nesting of IRQs due to the effects of IRQ migration,
> >> - reduction of vectors available for devices,
> >> - IRQs not moving as intended if there's shortage of vectors,
> >> - higher runtime overhead.
> >>
> >> As the vector also doesn't need to be of any priority (first and foremost
> >> it really shouldn't be of higher or same priority as the timer IRQ, as
> >> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
> >> and use a vector from the 0x10...0x1f exception vector space. Exception vs
> >> interrupt can easily be distinguished by checking for the presence of an
> >> error code.
> >>
> >> With a fixed vector, less updating is now necessary in
> >> set_channel_irq_affinity(); in particular channels don't need transiently
> >> masking anymore, as the necessary update is now atomic. To fully leverage
> >> this, however, we want to stop using hpet_msi_set_affinity() there. With
> >> the transient masking dropped, we're no longer at risk of missing events.
> >>
> >> In principle a change to setup_vector_irq() would be necessary, but only
> >> if we used low-prio vectors as direct-APIC ones. Since the change would be
> >> at best benign here, it is being omitted.
> >>
> >> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
> >> Signed-off-by: Jan Beulich <[email protected]>
> >> Release-Acked-by: Oleksii Kurochko<[email protected]>
> >> ---
> >> This is an alternative proposal to
> >> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html.
> >>
> >> Should we keep hpet_msi_set_affinity() at all? We'd better not have the
> >> generic IRQ subsystem play with our IRQs' affinities ... (If so, this
> >> likely would want to be a separate patch, though.)
> > 
> > I think that needs to become a no-op, with possibly an ASSERT?  Is it
> > possibly for dom0 to try to balance this IRQ?  I would think not.
> 
> I'd consider it an error if that was possible. But then the same goes for
> other Xen-internal IRQs, like the IOMMU ones. They all implement a
> .set_affinity hook ...


We need such hook for fixup_irqs() at least, so that interrupts can be
evacuated when an AP goes offline.  However movement of Xen owned IRQs
should be limited to Xen (if it's not already).

> >> @@ -476,19 +486,50 @@ static struct hpet_event_channel *hpet_g
> >>  static void set_channel_irq_affinity(struct hpet_event_channel *ch)
> >>  {
> >>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >> +    struct msi_msg msg = ch->msi.msg;
> >>  
> >>      ASSERT(!local_irq_is_enabled());
> >>      spin_lock(&desc->lock);
> >> -    hpet_msi_mask(desc);
> >> -    hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >> -    hpet_msi_unmask(desc);
> >> +
> >> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> >> +
> >> +    /*
> >> +     * Open-coding a reduced form of hpet_msi_set_affinity() here.  With 
> >> the
> >> +     * actual update below (either of the IRTE or of [just] message 
> >> address;
> >> +     * with interrupt remapping message address/data don't change) now 
> >> being
> >> +     * atomic, we can avoid masking the IRQ around the update.  As a 
> >> result
> >> +     * we're no longer at risk of missing IRQs (provided 
> >> hpet_broadcast_enter()
> >> +     * keeps setting the new deadline only afterwards).
> >> +     */
> >> +    cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu));
> >> +
> >>      spin_unlock(&desc->lock);
> >>  
> >> -    spin_unlock(&ch->lock);
> >> +    msg.dest32 = cpu_physical_id(ch->cpu);
> >> +    msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
> >> +    msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
> >> +    if ( msg.dest32 != ch->msi.msg.dest32 )
> >> +    {
> >> +        ch->msi.msg = msg;
> >> +
> >> +        if ( iommu_intremap != iommu_intremap_off )
> >> +        {
> >> +            int rc = iommu_update_ire_from_msi(&ch->msi, &msg);
> >>  
> >> -    /* We may have missed an interrupt due to the temporary masking. */
> >> -    if ( ch->event_handler && ch->next_event < NOW() )
> >> -        ch->event_handler(ch);
> >> +            ASSERT(rc <= 0);
> >> +            if ( rc > 0 )
> >> +            {
> >> +                ASSERT(msg.data == hpet_read32(HPET_Tn_ROUTE(ch->idx)));
> >> +                ASSERT(msg.address_lo ==
> >> +                       hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4));
> >> +            }
> > 
> > The sequence of asserts seem wrong here, the asserts inside of the rc
> > > 0 check will never trigger, because there's an ASSERT(rc <= 0)
> > ahead of them?
> 
> Hmm. My way of thinking was that if we get back 1 (which we shouldn't),
> we ought to check (and presumably fail on) data or address having changed.

Right, but the ASSERT(rc <= 0) will prevent reaching any of the
followup ASSERTs if rc == 1?  IOW, we possibly want:

            if ( rc > 0 )
            {
                dprintk(XENLOG_ERR,
                        "Unexpected HPET MSI setup returned: data: %#x address: 
%#lx expected data %#x address %#lx\n",
                        msg.data, msg.address,
                        ch->msi.msg.data, ch->msi.msg.address);
                ASSERT_UNREACHABLE();
                hpet_msi_mask(desc);
                hpet_write32(msg.data, HPET_Tn_ROUTE(ch->idx));
                hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4);
                hpet_msi_unmask(desc);
            }
            ASSERT(!rc);

I'm unsure about attempting to propagate the returned values on release
builds, I guess it's slightly better than possibly using an outdated
RTE entry?  Albeit this should never happen.

Also, should the desc->arch.cpu_mask update only be done after the MSI
fields have correctly updated, so that in case of failure of
iommu_update_ire_from_msi(9 we could return early form the function
and avoid changing desc->arch.cpu_mask?

> Whereas when we get back 0, we're told "no change" anyway, and hence
> checking isn't even needed. Did I misunderstand the purpose of the zero
> vs positive return value here?
> 
> Of could I could switch to using "rc >= 0" anyway; I actually had it that
> way first, but then decided the extra checks would be redundant in the 0
> case.
> 
> >> +        }
> >> +        else
> >> +            hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4);
> > 
> > If you avoid the HPET register update here you possibly need to make
> > sure that both fields are unconditionally written on the first call
> > after resume from suspension.  hpet_resume() needs to somehow taint
> > the channels to signal that a re-write of the address and data fields
> > is mandatory regardless of what iommu_update_ire_from_msi() has
> > returned.
> 
> hpet_broadcast_resume() calls __hpet_setup_msi_irq() (and hence
> hpet_msi_write()), which I thought is enough?

Oh, sorry, I was looking at hpet_resume(), not
hpet_broadcast_resume().

Thanks, Roger.

Re: [PATCH v2 for-4.21 2/9] x86/HPET: use single, global, low-priority vector for broadcast IRQ

Reply via email to