On Tue, Sep 07, 2021 at 05:16:52PM +0200, Martin Pieuchot wrote: > On 07/09/21(Tue) 15:48, Stefan Sperling wrote: > > This patch makes iwx(4) resume reliably for me. > > > > There were missing splnet() calls which leads to an obvious race > > between the interrupt handler and the code which triggers firmware > > loading and then sleeps to wait for confirmation. > > This patch adds the missing splnet(). > > > > However, even with splnet() protection added I need to add the > > following two lines of code to make the patch work reliably: > > > > /* wait for the firmware to load */ > > for (w = 0; !sc->sc_uc.uc_intr && w < 10; w++) { > > err = tsleep_nsec(&sc->sc_uc, 0, "iwxuc", MSEC_TO_NSEC(100)); > > + /* XXX This should not be needed, should it: */ > > + if (err == EWOULDBLOCK && sc->sc_uc.uc_intr) > > + err = 0; > > } > > if (err || !sc->sc_uc.uc_ok) > > printf("%s: could not load firmware, %d\n", DEVNAME(sc), err); > > > > Which seems odd. I would expect tsleep to return EWOULDBLOCK only when > > the interrupt handler did not already set uc_intr and call wakeup(). > > That suggests the timeout fires before the wakeup(9).
Yes, it does. But how could uc_intr already be set to 1 in that case? I observe that when tsleep returns with EWOULDBLOCK above, the value of uc_intr is already set to 1. This situation is what is causing iwx resume failures. The interrupt did fire, set uc_intr = 1, and it called wakeup(). Now, why is tsleep still returning EWOULDBLOCK? I assume that if tsleep behaved as expected then the above loop should work, provided splnet() is applied correctly. So far, I have only seen evidence of the above loop breaking when running with msix. This same loop has worked fine since forever in iwm(4) on msi. I don't know for certain whether this issue is msix-specific. But suppose msix had a bug in IPL handling? Could that explain the problem? > That suggests 100msec might be too small. Did you try with a bigger > value? Or is there something special happening during DVACT_WAKEUP? A higher value avoids the issue. But that feels like a workaround since it hides the unexpected behaviour of tsleep which triggers the resume issue. > > Am I missing something? Am I adding splnet() in the wrong place? > > The splnet() you're adding ensure (with the KERNEL_LOCK()) no wakeup(9) > will happen until you go to sleep. I believe that's what you want. Yes, that is what I would expect.