Re: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-23 Thread Rafał Miłecki
On 22 September 2016 at 16:09, Rafał Miłecki  wrote:
> On 22 September 2016 at 14:24, Rafał Miłecki  wrote:
>> I got the same problem again, but this time there was only 1 skb in my
>> flowring.
>> That resulted in less serial console messages and no reboot.
>>
>> It doesn't look good. brcmf_txfinalize was called 11 times after deleting
>> flowring. Everytime it was working with the same skb that was already freed.
>
> I'm wondering if there is some bug in flowring code. Please take a
> look if my following understanding it correct.
>
> 1) Every skb that is supposed to reach the device gets queued on a flowring.
> 2) Dequeuing means skb is not available in a flowring anymore
> 3a) brcmf_msgbuf_txflow dequeues skbs to pass them to the firmware
> 3b) brcmf_flowring_delete dequeues skbs to delete (free) them
>
> If that simple description is correct, it shouldn't be possible to skb to:
> 1) Reach the firmware
> *and*
> 2) Get freed as part of flowring deletion
>
> But this is what seems to be happening in my case. We get one skb
> freed on flowring deletion and then reported as transmitted (sometimes
> multiple times) by the firmware.
>
> Maybe there is simply a race between brcmf_flowring_delete and
> brcmf_msgbuf_txflow?

I can see a problem with my debugging code. It may happen brcmfmac
gets skb allocated at the same address. Due to brcmf_flowring_delete
bypassing brcmf_txfinalize we don't drop tracing of the old one and
that results in the same packet listed twice in my debugging code.

I got caught into the same trap that breaks pend_8021x_cnt counting.

I believe pend_8021x_cnt related code should be moved to the place
that really queues skb for the firmware to access.

-- 
Rafał


Re: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-22 Thread Rafał Miłecki
On 22 September 2016 at 14:24, Rafał Miłecki  wrote:
> On 09/22/2016 01:59 PM, Rafał Miłecki wrote:
>>
>> And again...
>>
>> I decided to focus on brcmf_flowring_delete a bit more.
>>
>> As I can see flowrings are created and removed from time to time, in most
>> cases
>> they are empty when being deleted. When they are not, things go wrong. In
>> below
>> log you can see brcmfmac removing flowring that got 8 skb packets.
>>
>> First some summary, of what was already partially noticed in previous
>> e-mail:
>> 1) brcmf_flowring_delete calls brcmu_pkt_buf_free_skb directly bypassing
>>brcmf_txfinalize. It means tracing 802.1x packets and counting TX
>> errors is
>>broken.
>> 2) After freeing skb (as the result of brcmf_flowring_delete) there will
>> still
>>an IRQ, MSGBUF_TYPE_TX_STATUS and call to the brcmf_txfinalize. That's
>>definitely unexpected as brcmf_txfinalize will be operating on skb that
>> was
>>already freed.
>>
>> Now, there is something new that looks even worse. Sometimes
>> brcmf_txfinalize
>> may be called multiple times for the same skb! Take a close look at my
>> log.
>> It was called twice for skb c64a5a80. Twice for c64a5480. Twice for
>> c64a5f00.
>> Then Linux/device rebooted itself due to amount of console messages.
>
>
> I got the same problem again, but this time there was only 1 skb in my
> flowring.
> That resulted in less serial console messages and no reboot.
>
> It doesn't look good. brcmf_txfinalize was called 11 times after deleting
> flowring. Everytime it was working with the same skb that was already freed.

I'm wondering if there is some bug in flowring code. Please take a
look if my following understanding it correct.

1) Every skb that is supposed to reach the device gets queued on a flowring.
2) Dequeuing means skb is not available in a flowring anymore
3a) brcmf_msgbuf_txflow dequeues skbs to pass them to the firmware
3b) brcmf_flowring_delete dequeues skbs to delete (free) them

If that simple description is correct, it shouldn't be possible to skb to:
1) Reach the firmware
*and*
2) Get freed as part of flowring deletion

But this is what seems to be happening in my case. We get one skb
freed on flowring deletion and then reported as transmitted (sometimes
multiple times) by the firmware.

Maybe there is simply a race between brcmf_flowring_delete and
brcmf_msgbuf_txflow?


Re: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-22 Thread Rafał Miłecki

On 09/22/2016 01:59 PM, Rafał Miłecki wrote:

And again...

I decided to focus on brcmf_flowring_delete a bit more.

As I can see flowrings are created and removed from time to time, in most cases
they are empty when being deleted. When they are not, things go wrong. In below
log you can see brcmfmac removing flowring that got 8 skb packets.

First some summary, of what was already partially noticed in previous e-mail:
1) brcmf_flowring_delete calls brcmu_pkt_buf_free_skb directly bypassing
   brcmf_txfinalize. It means tracing 802.1x packets and counting TX errors is
   broken.
2) After freeing skb (as the result of brcmf_flowring_delete) there will still
   an IRQ, MSGBUF_TYPE_TX_STATUS and call to the brcmf_txfinalize. That's
   definitely unexpected as brcmf_txfinalize will be operating on skb that was
   already freed.

Now, there is something new that looks even worse. Sometimes brcmf_txfinalize
may be called multiple times for the same skb! Take a close look at my log.
It was called twice for skb c64a5a80. Twice for c64a5480. Twice for c64a5f00.
Then Linux/device rebooted itself due to amount of console messages.


I got the same problem again, but this time there was only 1 skb in my flowring.
That resulted in less serial console messages and no reboot.

It doesn't look good. brcmf_txfinalize was called 11 times after deleting
flowring. Everytime it was working with the same skb that was already freed.

After my DIR-885L and brcmfmac managed to survive that, my smartphone remained
connected but I couldn't contact it (no reply to ping). I had to reconnect and
then it was back OK.

[ 2927.500112] brcmfmac: CONSOLE: 029651.820 wl0: Proxy STA 78:d6:f0:9b:ba:bc 
link is already gone !!??

[ 2927.509473] brcmfmac: [brcmf_msgbuf_process_flow_ring_delete_response -> 
__brcmf_msgbuf_remove_flowring] flowid:3
[ 2927.519800] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] flowid:3 ring:c7850540
[ 2927.529318] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] count:0

[ 2927.537526] brcmfmac: [brcmf_msgbuf_process_flow_ring_delete_response -> 
__brcmf_msgbuf_remove_flowring] flowid:34
[ 2927.547929] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] flowid:34 ring:c7850180
[ 2927.557533] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] count:0

[ 2927.559735] brcmfmac: [brcmf_msgbuf_flowring_create -> 
__brcmf_flowring_create] da:78:d6:f0:9b:ba:bc prio:0 ifidx:1
[ 2927.559739] brcmfmac: [brcmf_msgbuf_flowring_create -> 
__brcmf_flowring_create] flowid:3
[ 2927.584463] brcmfmac: [brcmf_msgbuf_process_flow_ring_create_response] 
flowid:3 status:0x

[ 2927.593211] brcmfmac: [brcmf_msgbuf_process_flow_ring_delete_response -> 
__brcmf_msgbuf_remove_flowring] flowid:3
[ 2927.603538] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] flowid:3 ring:c658cb80
[ 2927.613059] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] [ifp:  
(null)] Freeing skb:c64d1240 skb->dev:c6b3f800 skb->dev->name:wlan1-1
[ 2927.626750] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] count:1

[ 2928.348341] brcmfmac: CONSOLE: 029652.655 wl0: wlc_ampdu_resp_timeout: 
cleaning up resp tid 0 waiting forseq 0xbb9 for 1000 ms

[ 2930.887857] brcmfmac: [brcmf_msgbuf_flowring_create -> 
__brcmf_flowring_create] da:78:d6:f0:9b:ba:bc prio:0 ifidx:1
[ 2930.898340] brcmfmac: [brcmf_msgbuf_flowring_create -> 
__brcmf_flowring_create] flowid:3
[ 2930.906686] brcmfmac: [brcmf_msgbuf_process_flow_ring_create_response] 
flowid:3 status:0x

[ 2933.982285] brcmfmac: [__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb] [ifp:c6b3fc80] 
***BUG*** skb:c64d1240 skb->dev:c6b3f800 skb->dev->name:wlan1-1
[ 2933.995705] [ cut here ]
[ 2934.000359] WARNING: CPU: 1 PID: 615 at 
compat-wireless-2016-06-20/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c:70
 __brcmu_pkt_buf_free_skb+0x98/0x1ec [brcmfmac]()
[ 2934.015949] Modules linked in: pppoe ppp_async iptable_nat brcmfmac pppox 
ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT 
ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark 
xt_mac xt_limit xt_id xt_conntrack xt_commed
[ 2934.088284] CPU: 1 PID: 615 Comm: irq/33-brcmf_pc Not tainted 4.4.21 #0
[ 2934.094913] Hardware name: BCM5301X
[ 2934.098403] Backtrace:
[ 2934.100877] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[ 2934.108465]  r7:0046 r6:bf1d1146 r5:6093 r4:
[ 2934.114183] [] (show_stack) from [] 
(dump_stack+0x84/0xa4)
[ 2934.121421] [] (dump_stack) from [] 
(warn_slowpath_common+0x8c/0xb8)
[ 2934.129527]  r5:0009 r4:
[ 2934.133126] [] (warn_slowpath_common) from [] 
(warn_slowpath_null+0x24/0x2c)
[ 2934.141931]  r8:a013 r7:c6b3fdd0 r6:bf1c9f2d r5:c6b3fc80 r4:c64d1240
[ 2934.148707] [] (warn_slowpath_null) from [] 
(__brcmu_pkt_buf_free_skb+0x98/0x1ec [brcmfmac])
[ 2934.158923] [] 

Re: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-22 Thread Rafał Miłecki
And again...

I decided to focus on brcmf_flowring_delete a bit more.

As I can see flowrings are created and removed from time to time, in most cases
they are empty when being deleted. When they are not, things go wrong. In below
log you can see brcmfmac removing flowring that got 8 skb packets.

First some summary, of what was already partially noticed in previous e-mail:
1) brcmf_flowring_delete calls brcmu_pkt_buf_free_skb directly bypassing
   brcmf_txfinalize. It means tracing 802.1x packets and counting TX errors is
   broken.
2) After freeing skb (as the result of brcmf_flowring_delete) there will still
   an IRQ, MSGBUF_TYPE_TX_STATUS and call to the brcmf_txfinalize. That's
   definitely unexpected as brcmf_txfinalize will be operating on skb that was
   already freed.

Now, there is something new that looks even worse. Sometimes brcmf_txfinalize
may be called multiple times for the same skb! Take a close look at my log.
It was called twice for skb c64a5a80. Twice for c64a5480. Twice for c64a5f00.
Then Linux/device rebooted itself due to amount of console messages.

[  242.242012] brcmfmac: [brcmf_msgbuf_process_flow_ring_delete_response -> 
__brcmf_msgbuf_remove_flowring] flowid:36
[  242.252440] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] flowid:36 ring:c667fd40
[  242.262043] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5780 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.275733] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5540 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.289420] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5180 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.303112] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5f00 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.316802] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a50c0 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.330488] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5480 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.344177] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5a80 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.357867] brcmfmac: [__brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c64a5900 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.371548] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] count:8

[  242.379758] brcmfmac: [brcmf_msgbuf_process_flow_ring_delete_response -> 
__brcmf_msgbuf_remove_flowring] flowid:37
[  242.390159] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] flowid:37 ring:c667fc80
[  242.399752] brcmfmac: [__brcmf_msgbuf_remove_flowring -> 
__brcmf_flowring_delete] count:0

[  242.433468] brcmfmac: [brcmf_msgbuf_flowring_create -> 
__brcmf_flowring_create] da:78:d6:f0:9b:ba:bc prio:0 ifidx:1
[  242.443974] brcmfmac: [brcmf_msgbuf_flowring_create -> 
__brcmf_flowring_create] flowid:36
[  242.452580] brcmfmac: [brcmf_msgbuf_process_flow_ring_create_response] 
flowid:36 status:0x

[  242.765056] brcmfmac: [__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb] 
[ifp:c647a480] ***BUG*** skb:c64a5900 skb->dev:c647a000 skb->dev->name:wlan1-1
[  242.778479] [ cut here ]
[  242.783133] WARNING: CPU: 1 PID: 616 at 
compat-wireless-2016-06-20/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c:70
 __brcmu_pkt_buf_free_skb+0x98/0x1ec [brcmfmac]()
[  242.798724] Modules linked in: pppoe ppp_async iptable_nat brcmfmac pppox 
ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT 
ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark 
xt_mac xt_limit xt_id xt_conntrack xt_commed
[  242.871057] CPU: 1 PID: 616 Comm: irq/33-brcmf_pc Not tainted 4.4.21 #0
[  242.877686] Hardware name: BCM5301X
[  242.881176] Backtrace: 
[  242.883651] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[  242.891230]  r7:0046 r6:bf1d1146 r5:6093 r4:
[  242.896946] [] (show_stack) from [] 
(dump_stack+0x84/0xa4)
[  242.904186] [] (dump_stack) from [] 
(warn_slowpath_common+0x8c/0xb8)
[  242.912293]  r5:0009 r4:
[  242.915891] [] (warn_slowpath_common) from [] 
(warn_slowpath_null+0x24/0x2c)
[  242.924696]  r8:a013 r7:c647a5d0 r6:bf1c9f2d r5:c647a480 r4:c64a5900
[  242.931468] [] (warn_slowpath_null) from [] 
(__brcmu_pkt_buf_free_skb+0x98/0x1ec [brcmfmac])
[  242.941680] [] (__brcmu_pkt_buf_free_skb [brcmfmac]) from 
[] (__brcmf_txfinalize+0x19c/0x1d0 [brcmfmac])
[  242.952914]  r8:c04903c8 r7:c647a5d0 r6:c64a5900 r5:c648ca00 r4:c647a480
[  242.959687] [] (__brcmf_txfinalize [brcmfmac]) from [] 
(brcmf_msgbuf_txdata+0x53c/0x6e8 [brcmfmac])
[  242.970488]  r10:c64a5900 

Re: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-20 Thread Rafał Miłecki
Hi Hante,

I hit this problem again and I'm afraid it's getting even more complex. Last
time you were suspecting flowring deletion but it didn't make much sense to me.
It was because I didn't see brcmf_flowring_delete anywhere in my log.

Well, today it was different. I saw brcmf_flowring_delete which makes me wonder
if there is more than 1 source of this problem.

Good news is that today I got few extra debugging messages. Bad news is I was
experimenting with MAX_WAIT_FOR_8021X_TX. I added brcmf_netdev_wait_pend8021x to
the brcmf_cfg80211_get_station and I was running
while [ 1 ]; do iw dev wlan1-1 station get 88:53:2e:50:50:00 > /dev/null; done

I hope this log may be a bit helpful anyway.

So this time brcmf_flowring_delete was called indeed and I think there was a
race in brcmfmac code. It seems brcmu_pkt_buf_free_skb was called twice for the
same skb!

For the first time it was called from brcmf_flowring_delete. We called
dev_kfree_skb_any for that skb which means we shouldn't access it anymore.

For the second time it (brcmu_pkt_buf_free_skb) was call from brcmf_txfinalize.
Unfortunately when brcmf_txfinalize was analyzing that skb (that was already
freed and invalid) it didn't contain ETH_P_PAE in the ethhdr anymore so
atomic_dec(>pend_8021x_cnt);
wasn't called.

So my guess is that:
1) We should fix brcmf_flowring_delete to use brcmf_txfinalize
2) We should avoid freeing the same skb twice

[  167.596719] brcmfmac: [brcmf_cfg80211_del_key -> __send_key_to_dongle] 
ifp:c64c6480 brcmf_ifname(ifp):wlan1-1
[  167.611257] brcmfmac: CONSOLE: 026896.770 wl0: Proxy STA 78:d6:f0:9b:ba:bc 
link is already gone !!??
[  167.623375] brcmfmac: [brcmf_flowring_delete -> __brcmu_pkt_buf_free_skb] 
[ifp:  (null)] Freeing skb:c7240240 skb->dev:c64c6000 skb->dev->name:wlan1-1
[  167.640721] brcmfmac: [__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb] 
[ifp:c64c6480] ***BUG*** skb:c7240240 skb->dev:c64c6000 skb->dev->name:wlan1-1
[  167.654247] brcmfmac: [__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb] 
[ifp:c64c6480] Original data: 78 d6 f0 9b ba bc  92 8d 78 66 3a 57  88 8e
[  167.658643] brcmfmac: [brcmf_netdev_wait_pend8021x] ***TIMEOUT WARNING*** 
ifp:c64c6480 brcmf_ifname(ifp):wlan1-1 brcmf_get_pend_8021x_cnt(ifp):1
[  167.680249] brcmfmac: [__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb] 
[ifp:c64c6480]  Current data: 88 53 2e 50 50 00  ba 87 01 e2 06 f8  08 00
[  167.693239] [ cut here ]
[  167.697893] WARNING: CPU: 0 PID: 616 at 
compat-wireless-2016-06-20/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c:73
 __brcmu_pkt_buf_free_skb+0x1a0/0x2f8 [brcmfmac]()
[  167.713587] Modules linked in: pppoe ppp_async iptable_nat brcmfmac pppox 
ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT 
ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark 
xt_mac xt_limit xt_id xt_conntrack xt_commed
[  167.785932] CPU: 0 PID: 616 Comm: irq/33-brcmf_pc Not tainted 4.4.21 #0
[  167.792560] Hardware name: BCM5301X
[  167.796050] Backtrace: 
[  167.798525] [] (dump_backtrace) from [] 
(show_stack+0x18/0x1c)
[  167.806112]  r7:0049 r6:bf1d1255 r5:6013 r4:
[  167.811831] [] (show_stack) from [] 
(dump_stack+0x84/0xa4)
[  167.819069] [] (dump_stack) from [] 
(warn_slowpath_common+0x8c/0xb8)
[  167.827175]  r5:0009 r4:
[  167.830774] [] (warn_slowpath_common) from [] 
(warn_slowpath_null+0x24/0x2c)
[  167.839579]  r8:c64c653c r7:bf1c9f39 r6:c64c6480 r5:c7240240 r4:c69571c0
[  167.846357] [] (warn_slowpath_null) from [] 
(__brcmu_pkt_buf_free_skb+0x1a0/0x2f8 [brcmfmac])
[  167.856658] [] (__brcmu_pkt_buf_free_skb [brcmfmac]) from 
[] (__brcmf_txfinalize+0x184/0x1b4 [brcmfmac])
[  167.867893]  r8:cacd1720 r7:c723d180 r6:c7240240 r5:0001 r4:c64c6480
[  167.874665] [] (__brcmf_txfinalize [brcmfmac]) from [] 
(brcmf_msgbuf_txdata+0x508/0x6b0 [brcmfmac])
[  167.885466]  r10:c7240240 r9:cacd1720 r8:cacd1720 r7:c723d180 r6:0003 
r5:0001
[  167.893352]  r4:c6801c00
[  167.895912] [] (brcmf_msgbuf_txdata [brcmfmac]) from [] 
(brcmf_proto_msgbuf_rx_trigger+0x3c/0xd0 [brcmfmac])
[  167.907496]  r10: r9:c0493882 r8:c68005e4 r7:c0057d98 r6:c78850c0 
r5:0001
[  167.915382]  r4:c6801c00
[  167.917941] [] (brcmf_proto_msgbuf_rx_trigger [brcmfmac]) from 
[] (brcmf_pcie_isr_thread+0x1c4/0x238 [brcmfmac])
[  167.929874]  r7:c0057d98 r6:c78850c0 r5:0001 r4:c6970800
[  167.935587] [] (brcmf_pcie_isr_thread [brcmfmac]) from 
[] (irq_thread_fn+0x24/0x3c)
[  167.945002]  r9:c0493882 r8:c68005e4 r7:c0057d98 r6:c78850c0 r5:c68005c0 
r4:c68005c0
[  167.952803] [] (irq_thread_fn) from [] 
(irq_thread+0xf8/0x1e4)
[  167.960381]  r7:c0057d98 r6:c73f6000 r5:c68005c0 r4:c78850c0
[  167.966089] [] (irq_thread) from [] (kthread+0xe0/0xf4)
[  167.973064]  r10: r9: r8: r7:c0057fa8 r6:c68005c0 
r5:
[  167.980948]  r4:c6800b80
[  167.983497] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  167.990732]  

Re: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-15 Thread Rafał Miłecki

On 15 September 2016 at 11:20, Hante Meuleman  
wrote:
> Thank you for the extensive debugging. We are looking into this. Arend wrote
> yesterday to ask for detailed timing on wen eapol is inserted. We want this
> so we can increase the timeout. This is not a "nice" way to solve the
> problem, and it should be solved in firmware, but in the meanwhile we do
> want to increase timer, because we think that ampdu issues can rise at any
> given moment and even with changes/updates in firmware it might be necessary
> to increase timeout.

I'm kindly asking to keep replies in related threads :) I'm pretty sure above is
about problem described in "AMPDU stalls with brcmfmac4366b-pcie.bin triggering
WARNINGs".


> Second problem is harder, it is good to see that the frame gets returned to
> driver at some point. Our biggest worry is that a frame remains indefinitely
> in the firmware, but that appears not to be the case. Now why could this
> fail. There is one possible reason I found, and that is when a flowring is
> deleted while it holds the eapol, see flowring.c. It does not call the
> brcmf_txfinalize, but frees the packet directly. I think this is wrong but
> need to investigate this in more detail. In the meanwhile, if you keep doing
> tests I would like to ask you to add a WARN_ON() call to the function
> __brcmu_pkt_buf_free_skb where you print ***BUG*** so we know where the
> packet got freed from.

Please take a look at my e-mail & log (& maybe diff) once again. You really
quite missed the point.

The function brcmf_txfinalize *was* called. I was describing it in my e-mail
and there is a log:
[ 1440.414653] brcmfmac: [__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb] [ifp:c72e7c80] 
***BUG*** skb:c70ddc00 skb->dev:c72e7800 skb->dev->name:wlan1-1
Above means that brcmf_txfinalize was called for skb c70ddc00 and it called
brcmu_pkt_buf_free_skb.

My debugging code noticed that it wasn't alright as this packet was still
pending and pend_8021x_cnt wasn't decreased for him. Please note it was
brcmf_txfinalize's fault (which was called for 100% sure). For some reason it
didn't pass if (type == ETH_P_PAE) condition. I already described it and I
shared my guess of firmware corrupting skb data. I'm now using debugging patch
which prints copied and current content of skb data in case of fault.

You're right I should have used WARN in my ***BUG*** place. It's a stupid habit
from MIPS devices where backtraces aren't reliable. I printed mini call chain
on my own instead. I mean this part:
[__brcmf_txfinalize -> __brcmu_pkt_buf_free_skb]

So please take a look at my e-mail again and let me know if it makes more sense
now.
What do you think about my guess of firmware corrupting skb data?


RE: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-15 Thread Hante Meuleman
Hi Rafal,

Thank you for the extensive debugging. We are looking into this. Arend wrote
yesterday to ask for detailed timing on wen eapol is inserted. We want this
so we can increase the timeout. This is not a "nice" way to solve the
problem, and it should be solved in firmware, but in the meanwhile we do
want to increase timer, because we think that ampdu issues can rise at any
given moment and even with changes/updates in firmware it might be necessary
to increase timeout.
Second problem is harder, it is good to see that the frame gets returned to
driver at some point. Our biggest worry is that a frame remains indefinitely
in the firmware, but that appears not to be the case. Now why could this
fail. There is one possible reason I found, and that is when a flowring is
deleted while it holds the eapol, see flowring.c. It does not call the
brcmf_txfinalize, but frees the packet directly. I think this is wrong but
need to investigate this in more detail. In the meanwhile, if you keep doing
tests I would like to ask you to add a WARN_ON() call to the function
__brcmu_pkt_buf_free_skb where you print ***BUG*** so we know where the
packet got freed from.

Regards,
Hante

-Original Message-
From: Rafał Miłecki [mailto:zaj...@gmail.com]
Sent: Thursday, September 15, 2016 10:12 AM
To: Hante Meuleman; Arend van Spriel; brcm80211-dev-l...@broadcom.com
Cc: linux-wireless@vger.kernel.org; Rafał Miłecki
Subject: brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

Hi,

Yesterday I explained on OpenWrt forum [0] that there are 2 problems
leading to WARNINGs triggered by brcmf_netdev_wait_pend8021x.

The first one is firmware problem with A-MPDU implementation. I already
reported this in "AMPDU stalls with brcmfmac4366b-pcie.bin triggering
WARNINGs" e-mail thread [1].

Another one (I'm reporting right now) is related to brcmfmac and its
counting of 802.1x packets. The idea is simple:
1) In ndo_start_xmit callback there is check for ETH_P_PAE and code
   increasing 802.1x counter by 1.
2) In brcmf_txfinalize there is check for ETH_P_PAE and code decreasing
   802.1x counter by 1.
This is needed as some operations have to be handled without any 802.1x
packet pending.

Thanks to my debugging code (you can find it at the end) I just noticed
that:
1) Sometimes (very rarely) brcmf_txfinalize doesn't detect 802.1x packet
2) brcmu_pkt_buf_free_skb gets called and skb gets freed
3) Counter remains not-decreased and brcmf_netdev_wait_pend8021x will
   always time out.

I'm not sure why this could be happening. The check for ETH_P_PAE looks
exactly the same in both places. My only idea so far is firmware
corrupting skb data. This fools brcmfmac code and my debugging code was
lucky enough to keep tracing skbs (by their addresses) to notice that.

I'm going to extend my debugging patch by making a copy of eth header
and comparing it with corrupted one. As this bug occurs very rarely it
make take days or weeks to get any update.

I'd love to hear any comment meanwhile.

[0] https://forum.openwrt.org/viewtopic.php?pid=338235#p338235
[1] https://marc.info/?t=14738321621=1=2

[ 1438.965889] brcmfmac: CONSOLE: 028168.028 wl0.3: wlc_send_bar: seq 0xee
tid 0
[ 1438.993255] brcmfmac: CONSOLE: 028168.055 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.000743] brcmfmac: CONSOLE: 028168.060 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.008060] brcmfmac: CONSOLE: 028168.067 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.015568] brcmfmac: CONSOLE: 028168.073 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.022786] brcmfmac: CONSOLE: 028168.077 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.030150] brcmfmac: CONSOLE: 028168.078 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.037342] brcmfmac: CONSOLE: 028168.081 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.044554] brcmfmac: CONSOLE: 028168.088 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.051756] brcmfmac: CONSOLE: 028168.090 wl0.3: wlc_send_bar: seq 0xef
tid 0
[ 1439.362540] brcmfmac: CONSOLE: 028168.425 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.369827] brcmfmac: CONSOLE: 028168.431 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.378367] brcmfmac: CONSOLE: 028168.432 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.385590] brcmfmac: CONSOLE: 028168.434 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.392839] brcmfmac: CONSOLE: 028168.435 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.400041] brcmfmac: CONSOLE: 028168.438 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.407226] brcmfmac: CONSOLE: 028168.440 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.414848] brcmfmac: CONSOLE: 028168.447 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.422069] brcmfmac: CONSOLE: 028168.451 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.429269] brcmfmac: CONSOLE: 028168.454 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.436454] brcmfmac: CONSOLE: 028168.455 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.443653] brcmfmac: CONSOLE: 028168.457 wl0.3: wlc_send_bar: seq 0xf0
tid 0
[ 1439.450848] brcmfmac: CONSOLE: 028168.459 wl0.3: wlc_send

brcmf_txfinalize misses 802.1x packet leading to infinite WARNINGs

2016-09-15 Thread Rafał Miłecki
Hi,

Yesterday I explained on OpenWrt forum [0] that there are 2 problems
leading to WARNINGs triggered by brcmf_netdev_wait_pend8021x.

The first one is firmware problem with A-MPDU implementation. I already
reported this in "AMPDU stalls with brcmfmac4366b-pcie.bin triggering
WARNINGs" e-mail thread [1].

Another one (I'm reporting right now) is related to brcmfmac and its
counting of 802.1x packets. The idea is simple:
1) In ndo_start_xmit callback there is check for ETH_P_PAE and code
   increasing 802.1x counter by 1.
2) In brcmf_txfinalize there is check for ETH_P_PAE and code decreasing
   802.1x counter by 1.
This is needed as some operations have to be handled without any 802.1x
packet pending.

Thanks to my debugging code (you can find it at the end) I just noticed
that:
1) Sometimes (very rarely) brcmf_txfinalize doesn't detect 802.1x packet
2) brcmu_pkt_buf_free_skb gets called and skb gets freed
3) Counter remains not-decreased and brcmf_netdev_wait_pend8021x will
   always time out.

I'm not sure why this could be happening. The check for ETH_P_PAE looks
exactly the same in both places. My only idea so far is firmware
corrupting skb data. This fools brcmfmac code and my debugging code was
lucky enough to keep tracing skbs (by their addresses) to notice that.

I'm going to extend my debugging patch by making a copy of eth header
and comparing it with corrupted one. As this bug occurs very rarely it
make take days or weeks to get any update.

I'd love to hear any comment meanwhile.

[0] https://forum.openwrt.org/viewtopic.php?pid=338235#p338235
[1] https://marc.info/?t=14738321621=1=2

[ 1438.965889] brcmfmac: CONSOLE: 028168.028 wl0.3: wlc_send_bar: seq 0xee tid 0
[ 1438.993255] brcmfmac: CONSOLE: 028168.055 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.000743] brcmfmac: CONSOLE: 028168.060 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.008060] brcmfmac: CONSOLE: 028168.067 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.015568] brcmfmac: CONSOLE: 028168.073 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.022786] brcmfmac: CONSOLE: 028168.077 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.030150] brcmfmac: CONSOLE: 028168.078 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.037342] brcmfmac: CONSOLE: 028168.081 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.044554] brcmfmac: CONSOLE: 028168.088 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.051756] brcmfmac: CONSOLE: 028168.090 wl0.3: wlc_send_bar: seq 0xef tid 0
[ 1439.362540] brcmfmac: CONSOLE: 028168.425 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.369827] brcmfmac: CONSOLE: 028168.431 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.378367] brcmfmac: CONSOLE: 028168.432 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.385590] brcmfmac: CONSOLE: 028168.434 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.392839] brcmfmac: CONSOLE: 028168.435 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.400041] brcmfmac: CONSOLE: 028168.438 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.407226] brcmfmac: CONSOLE: 028168.440 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.414848] brcmfmac: CONSOLE: 028168.447 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.422069] brcmfmac: CONSOLE: 028168.451 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.429269] brcmfmac: CONSOLE: 028168.454 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.436454] brcmfmac: CONSOLE: 028168.455 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.443653] brcmfmac: CONSOLE: 028168.457 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.450848] brcmfmac: CONSOLE: 028168.459 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.458031] brcmfmac: CONSOLE: 028168.462 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.465229] brcmfmac: CONSOLE: 028168.465 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.472426] brcmfmac: CONSOLE: 028168.468 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.479623] brcmfmac: CONSOLE: 028168.471 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.486807] brcmfmac: CONSOLE: 028168.472 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.494008] brcmfmac: CONSOLE: 028168.474 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.501200] brcmfmac: CONSOLE: 028168.476 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.510826] brcmfmac: CONSOLE: 028168.479 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.518032] brcmfmac: CONSOLE: 028168.487 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.525252] brcmfmac: CONSOLE: 028168.489 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.532456] brcmfmac: CONSOLE: 028168.492 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.539647] brcmfmac: CONSOLE: 028168.497 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.755188] brcmfmac: CONSOLE: 028168.815 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.762642] brcmfmac: CONSOLE: 028168.819 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.769854] brcmfmac: CONSOLE: 028168.821 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.777087] brcmfmac: CONSOLE: 028168.823 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.784293] brcmfmac: CONSOLE: 028168.824 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.791861] brcmfmac: CONSOLE: 028168.826 wl0.3: wlc_send_bar: seq 0xf0 tid 0
[ 1439.799057] brcmfmac: CONSOLE: 028168.829