On Thu, Feb 06, 2020 at 12:45:38PM +0100, Stefan Sperling wrote: > At 36c3 I noticed roaming failures with iwm(4) where we would get stuck > trying to roam to a different AP. Debugging this with bluhm@ we found > that the reason it gets stuck is a non-zero refcount on the ic_bss node. > > When roaming, we wait for this reference count to hit zero before switching > the new AP. A non-zero reference count implies that the driver still has > outstanding frames destined for the old AP queued to hardware. > What we observed was that the reference count never went back to zero > so roaming never completed and no further data frames could be sent. > ifconfig iwm0 down/up was required to get the interface working again. > > iwm(4) decrements the refcount whenever hardware signals Tx completion > for a frame at Tx queue index 'N'. We observed that we sometimes get an > interrupt for frame 'N - 2' followed by an interrupt for frame 'N', > with no interrupt being received for frame 'N - 1'. > Whenever this had occurred a later decision to roam to another AP would > fail as described above. A side-effect of this is that an mbuf gets leaked. > > This diff implements a workaround in iwm's interrupt handler. > It's not pretty but I don't know the root cause. > Given that this happens very rarely, and all we lose is success/failure > information for the affected frames, this workaround seems acceptable to me. > > So far I only managed to trigger the problem at 36c3. > If you want to see a printf when it happens, compile with 'option IWM_DEBUG'. > > ok?
I've been testing the diff for the last two days and haven't seen any regressions. Unfortunately it seems I also didn't manage to trigger the actual bug so I can not say for sure whether your workaround actually fixes it. I think your solution makes sense until we found the underlying problem.
