Hi,

This is mostly a "I should really document this for others to follow" post.

Hiren and I are seeing situations where ath(4) in hostap mode will
hang. It'll hang, but only to one station (ie, the NIC and driver
isn't TX hung; it's something busted in the software queue handling.)

Rui changed the debugging so you don't get packet printing unless the
netif "debug" flag is enabled and silenced the flood of packet
printf's that go on when a client disassociates with frames in the
software queue. I erm, put that debugging back, under ATH_DEBUG_RESET
(sysctl dev.ath.X.debug=0x20).

So, with that in mind, I turned it on, started some traffic, and
waited for it to hang.

What I saw was. Annoying.

The log is on another laptop; I'll post it here. But the summarised output is:

ath1: ath_tx_tid_drain_print: norm: <mac>: ... seqno=3764, retry=0
ath1: ath_tx_tid_drain_print: norm: <mac>: txq[1] axq_depth=0 axq_aggr_depth=0
ath1: ath_tx_tid_drain_print: norm: <mac>: tid txq_depth=129
hwq_depth=0 bar_wait=0 isfiltered=0
ath1: ath_tx_tid_drain_print: norm: <mac>: tid 0: sched=1, paused=0,
incomp=0, baw_head=125, baw_tail=61, txa_start=3700, ni_txseqs=3893

There's 128 more of them, for each of the packets in the software queue.

Now, the basics:

* seqno = assigned sequence number for this frame
* txq[1] = this TID maps to the hardware txq[1]
* the axq_depth / axq_aggr_depth shows that no hardware frames are
queued to the device. One thing that used to happen is that a frame
would get "stuck" in the MAC (mostly because of long standing DMA
setup bugs) and if axq_depth or axq_aggr_depth > 2, the software queue
code won't queue anything further.
* the tid txq_depth=129, so there's 129 frames in the queue
* hwq_depth=0, there's no frames from this TID queued to the hardware
* bar_wait=0, so the TID isn't waiting for a BAR frame to complete
before the queue is unpaused
* isfiltered=0, so there's no filtered traffic waiting to complete
before we can continue (eg power save handling)

Now the fun bits:

* sched=1 - the software scheduler is actually trying to send frames
* paused=0, incomp=0 - the TID isn't paused (eg wiating for BAR to
complete, ADDBA negotiation, etc)
* incomp=0 - ie, there's no frames that need to complete before the
queue is unpaused (see above)

So, ok. it all looks fine.

Next, I waited until the thing again hung, and this time I logged in
via a different laptop and set up software TX debugging:

# sysctl dev.ath.1.debug=0x1fc000020

And this is what I got:

<unpause>
ath1: ath_txq_sched: tid=1, paused=0
ath1: ath_tx_tid_hw_queue_aggr: tid=1
ath1: ath_tx_rate_fill_rcflags: i=0, rate=0x8b, flags=0x19, max4ms=25708
ath1: ath_tx_rate_fill_rcflags: i=1, rate=0x8a, flags=0x19, max4ms=19280
ath1: ath_tx_rate_fill_rcflags: i=2, rate=0x89, flags=0x19, max4ms=12852
ath1: ath_get_aggr_limit: max frame len= 12852
ath1: ath_tx_tid_hw_queue_aggr: ath_tx_form_aggr() status=1
<pause>

.. so, the software scheduler was running, it tried queuing something
out, and ath_tx_form_aggr() returned 1 which is ATH_AGGR_BAW_CLOSED.

_that_ happens in this condition:

                /*
                 * If the packet has a sequence number, do not
                 * step outside of the block-ack window.
                 */
                if (! BAW_WITHIN(tap->txa_start, tap->txa_wnd,
                    SEQNO(bf->bf_state.bfs_seqno))) {
                        status = ATH_AGGR_BAW_CLOSED;
                        break;
                }

.. now, let's look at what's going on.

the first frame seqno here is (say) 3764. The window is 64 frames.
txa_start is 3700. So, the first frame is definitely outside the
sequence window and ath_tx_form_aggr() is quite rightly saying "nope,
we can't schedule this yet."

Now, I have no idea right now why this is happening.

It may be some off-by-one in some comparison somewhere. The check here
isn't an off-by-one as the seqno in question is plainly out of
sequence.

It _could_ be that there was a hole in the sequence numbers being
assigned; that would certainly mess things up. I'll go and review the
various places where sequence numbers are assigned and whether all the
locking is correct. Unfortunately (!) I made a point of using the
net80211 per-TID sequence number field which in the case of net80211
didn't use to be locked at all. I may have to just ditch that and have
the driver set its own sequence numbers if aggregation is used. (Maybe
for debugging I can track it privately too in the ath driver and
complain if they drift out of sync.)

I'll file a PR with this so it's being tracked.

Anyway - this wifi stuff is mystical but there's plenty of logging and
debugging going on right now. What I do note however is that I really
wish the MAC addresses of the nodes were logged, as it's a pain to
figure out which logging entries match up with which node.

As a side note - if anyone would like a smallish project - it'd be
nice to introduce a DPRINTF() for the ath driver code that took a node
parameter too - so I can do per-node debugging as well as global
driver debugging. That way I can do things like turn on debugging only
for a given node and still use the VAP for other devices.


-a
_______________________________________________
freebsd-wireless@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-wireless
To unsubscribe, send any mail to "freebsd-wireless-unsubscr...@freebsd.org"

Reply via email to