On Thu, Feb 06, 2020 at 12:45:38PM +0100, Stefan Sperling wrote:
> At 36c3 I noticed roaming failures with iwm(4) where we would get stuck
> trying to roam to a different AP. Debugging this with bluhm@ we found
> that the reason it gets stuck is a non-zero refcount on the ic_bss node.
> 
> When roaming, we wait for this reference count to hit zero before switching
> the new AP. A non-zero reference count implies that the driver still has
> outstanding frames destined for the old AP queued to hardware.
> What we observed was that the reference count never went back to zero
> so roaming never completed and no further data frames could be sent.
> ifconfig iwm0 down/up was required to get the interface working again.
> 
> iwm(4) decrements the refcount whenever hardware signals Tx completion
> for a frame at Tx queue index 'N'. We observed that we sometimes get an
> interrupt for frame 'N - 2' followed by an interrupt for frame 'N',
> with no interrupt being received for frame 'N - 1'.
> Whenever this had occurred a later decision to roam to another AP would
> fail as described above. A side-effect of this is that an mbuf gets leaked.
> 
> This diff implements a workaround in iwm's interrupt handler.
> It's not pretty but I don't know the root cause.
> Given that this happens very rarely, and all we lose is success/failure
> information for the affected frames, this workaround seems acceptable to me.
> 
> So far I only managed to trigger the problem at 36c3.
> If you want to see a printf when it happens, compile with 'option IWM_DEBUG'.
> 
> ok?

I've been testing the diff for the last two days and haven't seen any
regressions. Unfortunately it seems I also didn't manage to trigger the
actual bug so I can not say for sure whether your workaround actually
fixes it. 

I think your solution makes sense until we found the underlying problem.

Reply via email to