On Mon, Feb 10, 2020 at 01:14:25PM +0100, Mark Kettenis wrote: > Stefan Sperling schreef op 2020-02-06 12:45: > > At 36c3 I noticed roaming failures with iwm(4) where we would get stuck > > trying to roam to a different AP. Debugging this with bluhm@ we found > > that the reason it gets stuck is a non-zero refcount on the ic_bss node. > > > > When roaming, we wait for this reference count to hit zero before > > switching > > the new AP. A non-zero reference count implies that the driver still has > > outstanding frames destined for the old AP queued to hardware. > > What we observed was that the reference count never went back to zero > > so roaming never completed and no further data frames could be sent. > > ifconfig iwm0 down/up was required to get the interface working again. > > > > iwm(4) decrements the refcount whenever hardware signals Tx completion > > for a frame at Tx queue index 'N'. We observed that we sometimes get an > > interrupt for frame 'N - 2' followed by an interrupt for frame 'N', > > with no interrupt being received for frame 'N - 1'. > > I don't really know how the rings work on this hardware, but coalescing > interrupts like that doesn't sound unreasonable.
Interrupt coalescing is indeed enabled so it could be related to that. I don't know how it is supposed to work either. We will never know for sure until somebody leaks datasheets. You could be correct that "missed Tx completion" is what coalescing looks like for individually transmitted frames. I would expect coalescing to signal one interrupt with multiple 'packets', which contain firmware command responses and/or 802.11 frames, delivered for processing by the driver. But that's just a guess based on how interrupts behave when receiving frames. The proposed fix is about Tx. Since we do not yet aggregrate on Tx it seems reasonable to assume that each frame will be completed separately which is what happens virtually 99.9% of the time. It is also possible that the amount of frames in the air at 36c3 triggered errors in the firmware or hardware that lead to the symptoms we observed.
