HPET: use single, global, low-priority vector for broadcast IRQ

Roger Pau Monné Mon, 27 Oct 2025 04:34:20 -0700

On Mon, Oct 27, 2025 at 11:23:58AM +0100, Jan Beulich wrote:
> On 24.10.2025 15:24, Roger Pau Monné wrote:
> > On Thu, Oct 23, 2025 at 05:50:17PM +0200, Jan Beulich wrote:
> >> @@ -343,6 +347,12 @@ static int __init hpet_setup_msi_irq(str
> >>      u32 cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
> >>      irq_desc_t *desc = irq_to_desc(ch->msi.irq);
> >>  
> >> +    clear_irq_vector(ch->msi.irq);
> >> +    ret = bind_irq_vector(ch->msi.irq, HPET_BROADCAST_VECTOR, 
> >> &cpu_online_map);
> > 
> > By passing cpu_online_map here, it leads to _bind_irq_vector() doing:
> > 
> > cpumask_copy(desc->arch.cpu_mask, &cpu_online_map);
> > 
> > Which strictly speaking is wrong.  However this is just a cosmetic
> > issue until the irq is used for the first time, at which point it will
> > be assigned to a concrete CPU.
> > 
> > You could do:
> > 
> > cpumask_clear(desc->arch.cpu_mask);
> > cpumask_set_cpu(cpumask_any(&cpu_online_map), desc->arch.cpu_mask);
> > 
> > (Or equivalent)
> > 
> > To assign the interrupt to a concrete CPU and reflex it on the
> > cpu_mask after the bind_irq_vector() call, but I can live with it
> > being like this.  I have patches to adjust _bind_irq_vector() myself,
> > which I hope I will be able to post soon.
> 
> Hmm, I wrongly memorized hpet_broadcast_init() as being pre-SMP-init only.
> It has three call sites:
> - mwait_idle_init(), called from cpuidle_presmp_init(),
> - amd_cpuidle_init(), calling in only when invoked the very first time,
>   which is again from cpuidle_presmp_init(),
> - _disable_pit_irq(), called from the regular initcall disable_pit_irq().
> I.e. for the latter you're right that the CPU mask is too broad (in only a
> cosmetic way though). Would be you okay if I used cpumask_of(0) in place
> of &cpu_online_map?


Using cpumask_of(0) would be OK, as the per-cpu vector_irq array will
be updated ahead of assigning the interrupt to a CPU, and hence it
doesn't need to be done for all possible online CPUs in
_bind_irq_vector().

In the context here it would be more accurate to provide an empty CPU
mask, as the interrupt is not yet targeting any CPU.  Using CPU 0
would be a placeholder, which seems fine for the purpose.

> >> @@ -472,19 +482,50 @@ static struct hpet_event_channel *hpet_g
> >>  static void set_channel_irq_affinity(struct hpet_event_channel *ch)
> >>  {
> >>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >> +    struct msi_msg msg = ch->msi.msg;
> >>  
> >>      ASSERT(!local_irq_is_enabled());
> >>      spin_lock(&desc->lock);
> >> -    hpet_msi_mask(desc);
> >> -    hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >> -    hpet_msi_unmask(desc);
> >> +
> >> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> >> +
> >> +    /*
> >> +     * Open-coding a reduced form of hpet_msi_set_affinity() here.  With 
> >> the
> >> +     * actual update below (either of the IRTE or of [just] message 
> >> address;
> >> +     * with interrupt remapping message address/data don't change) now 
> >> being
> >> +     * atomic, we can avoid masking the IRQ around the update.  As a 
> >> result
> >> +     * we're no longer at risk of missing IRQs (provided 
> >> hpet_broadcast_enter()
> >> +     * keeps setting the new deadline only afterwards).
> >> +     */
> >> +    cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu));
> >> +
> >>      spin_unlock(&desc->lock);
> >>  
> >> -    spin_unlock(&ch->lock);
> >> +    msg.dest32 = cpu_physical_id(ch->cpu);
> >> +    msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
> >> +    msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
> >> +    if ( msg.dest32 != ch->msi.msg.dest32 )
> >> +    {
> >> +        ch->msi.msg = msg;
> >>  
> >> -    /* We may have missed an interrupt due to the temporary masking. */
> >> -    if ( ch->event_handler && ch->next_event < NOW() )
> >> -        ch->event_handler(ch);
> >> +        if ( iommu_intremap != iommu_intremap_off )
> >> +        {
> >> +            int rc = iommu_update_ire_from_msi(&ch->msi, &msg);
> >> +
> >> +            ASSERT(rc <= 0);
> >> +            if ( rc >= 0 )
> > 
> > I don't think the rc > 0 part of this check is meaningful, as any rc
> > value > 0 will trigger the ASSERT(rc <= 0) ahead of it.  The code
> > inside of the if block itself only contains ASSERTs, so it's only
> > relevant for debug=y builds that will also have the rc <= 0 ASSERT.
> > 
> > You could possibly use:
> > 
> > ASSERT(rc <= 0);
> > if ( !rc )
> > {
> >     ASSERT(...
> > 
> > And achieve the same result?
> 
> Yes, except that I'd like to keep the >= to cover the case if the first
> assertion was dropped / commented out, as well as to have a doc effect.

Oh, OK.  Fair enough, I wasn't taking into account that this could be
done in case code is modified.

> >> @@ -991,6 +997,13 @@ void alloc_direct_apic_vector(uint8_t *v
> >>      spin_unlock(&lock);
> >>  }
> >>  
> >> +/* This could free any vectors, but is needed only for low-prio ones. */
> >> +void __init free_lopriority_vector(uint8_t vector)
> >> +{
> >> +    ASSERT(vector < FIRST_HIPRIORITY_VECTOR);
> >> +    clear_bit(vector, used_vectors);
> >> +}
> > 
> > I'm undecided whether we want to have such helper.  This is all very
> > specific to the single use by the HPET vector, and hence might be best
> > to simply put the clear_bit() inside of hpet_broadcast_late_init()
> > itself.
> 
> I wanted to avoid making used_vectors non-static.
> 
> > I could see for example other callers wanting to use this also
> > requiring cleanup of the per cpu vector_irq arrays.  Given it's (so
> > far) very limited usage it might be clearer to open-code the
> > clear_bit().
> 
> Dealing with vector_irq[] is a separate thing, though, isn't it?

Possibly, that's part of the binding, rather than the allocation
itself (which is what you cover here).

> >> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> >> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> >> @@ -551,6 +551,13 @@ int cf_check amd_iommu_msi_msg_update_ir
> >>          for ( i = 1; i < nr; ++i )
> >>              msi_desc[i].remap_index = msi_desc->remap_index + i;
> >>          msg->data = data;
> >> +        /*
> >> +         * While the low address bits don't matter, "canonicalize" the 
> >> address
> >> +         * by zapping the bits that were transferred to the IRTE.  This 
> >> way
> >> +         * callers can check for there actually needing to be an update to
> >> +         * wherever the address is put.
> >> +         */
> >> +        msg->address_lo &= ~(MSI_ADDR_DESTMODE_MASK | 
> >> MSI_ADDR_DEST_ID_MASK);
> > 
> > You might want to mention this change on the commit message also, as
> > it could look unrelated to the rest of the code?
> 
> I thought the comment here provided enough context and detail. I've added
> "AMD interrupt remapping code so far didn't "return" a consistent MSI
>  address when translating an MSI message. Clear respective fields there, to
>  keep the respective assertion in set_channel_irq_affinity() from
>  triggering."

LGTM, I would possibly remove the last "respective" for being
repetitive given the previous one in the sentence.

Thanks, Roger.

Re: [PATCH v3 for-4.21 2/9] x86/HPET: use single, global, low-priority vector for broadcast IRQ

Reply via email to