On Tue, Sep 07, 2021 at 05:16:52PM +0200, Martin Pieuchot wrote:
> On 07/09/21(Tue) 15:48, Stefan Sperling wrote:
> > This patch makes iwx(4) resume reliably for me.
> > 
> > There were missing splnet() calls which leads to an obvious race
> > between the interrupt handler and the code which triggers firmware
> > loading and then sleeps to wait for confirmation.
> > This patch adds the missing splnet().
> > 
> > However, even with splnet() protection added I need to add the
> > following two lines of code to make the patch work reliably:
> > 
> >     /* wait for the firmware to load */
> >     for (w = 0; !sc->sc_uc.uc_intr && w < 10; w++) {
> >             err = tsleep_nsec(&sc->sc_uc, 0, "iwxuc", MSEC_TO_NSEC(100));
> > +           /* XXX This should not be needed, should it: */
> > +           if (err == EWOULDBLOCK && sc->sc_uc.uc_intr)
> > +                   err = 0;
> >     }
> >     if (err || !sc->sc_uc.uc_ok)
> >             printf("%s: could not load firmware, %d\n", DEVNAME(sc), err);
> > 
> > Which seems odd. I would expect tsleep to return EWOULDBLOCK only when
> > the interrupt handler did not already set uc_intr and call wakeup().
> 
> That suggests the timeout fires before the wakeup(9).

Yes, it does.

But how could uc_intr already be set to 1 in that case?
I observe that when tsleep returns with EWOULDBLOCK above, the value
of uc_intr is already set to 1.

This situation is what is causing iwx resume failures.
The interrupt did fire, set uc_intr = 1, and it called wakeup().
Now, why is tsleep still returning EWOULDBLOCK?

I assume that if tsleep behaved as expected then the above loop should
work, provided splnet() is applied correctly.

So far, I have only seen evidence of the above loop breaking when running
with msix. This same loop has worked fine since forever in iwm(4) on msi.
I don't know for certain whether this issue is msix-specific.
But suppose msix had a bug in IPL handling? Could that explain the problem?

> That suggests 100msec might be too small.  Did you try with a bigger
> value?  Or is there something special happening during DVACT_WAKEUP?

A higher value avoids the issue. But that feels like a workaround since it
hides the unexpected behaviour of tsleep which triggers the resume issue.

> > Am I missing something? Am I adding splnet() in the wrong place?
> 
> The splnet() you're adding ensure (with the KERNEL_LOCK()) no wakeup(9)
> will happen until you go to sleep.  I believe that's what you want.

Yes, that is what I would expect.

Reply via email to