Re: [OpenWrt-Devel] [BUG] kernel crash in br_netfilter

2016-03-08 Thread Felix Fietkau
On 2016-03-08 12:06, Florian Westphal wrote:
>> My hot-fix to prevent the crash is to instead of passing the skb to NF_HOOK
>> directly pass it to br_handle_local_finish(). But having insufficient 
>> insight into
>> what is going on there, this is fighting the symptoms rather than solving 
>> the root
>> cause. Maybe it is even better to drop patch 120 (not tested yet)?
> 
> Sorry, I don't know why this patch was not merged upstream and do not know 
> why its
> in openwrt.
This patch exists, because it's otherwise impossible to bridge a client
mode (4addr) WLAN interface when encryption is enabled.

wpa_supplicant needs to receive EAP packets before it will change the
operstate to allow the bridge and the rest of the network stack to do
their thing.

This used to work in a while back, and I think it got broken by this
commit:

commit 576eb62598f10c8c7fd75703fe89010cdcfff596
Author: stephen hemminger 
Date:   Fri Dec 28 18:15:22 2012 +

 bridge: respect RFC2863 operational state

 The bridge link detection should follow the operational state
 of the lower device, rather than the carrier bit. This allows devices
 like tunnels that are controlled by userspace control plane to work
 with bridge STP link management.

 Signed-off-by: Stephen Hemminger 
 Reviewed-by: Flavio Leitner 
 Signed-off-by: David S. Miller 

Back then I proposed a patch for upstream inclusion, got some feedback,
Stephen sent me this patch and I fixed it up a bit and re-submitted it.
I think it got lost somewhere in the process and after that I lost track
and didn't get around to re-submitting it.

So we kept the patch in OpenWrt because as far as I know, the regression
still exists in current kernels.

- Felix
___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel


Re: [OpenWrt-Devel] [BUG] kernel crash in br_netfilter

2016-03-07 Thread Zefir Kurtisi
On 02/29/2016 01:33 PM, Zefir Kurtisi wrote:
> I've been fighting a kernel bug that is producing random crashes around 
> network /
> skb_layer for a long time and was able to isolate it (or one of its 
> components) to
> the br_netfilter module.
> 
> I am reproducing the bug with PowerPC (TL-WDR4900v1.3) and MIPS (DB120, 
> ar71xx)
> based systems. Florian Westphal did not see it on kvm/x86, it is unclear 
> whether
> this requires a physical system or is CPU specific. This bug is in the latest
> OpenWRT (tested HEAD is 03b15ae9), as it happens with firmwares built 2+ years
> ago, so it is no current regression but something that was there for a long 
> time.
> 
> 
> Reproducing the crash
> 1. build the firmware for the system to test
>* use default configuration
>* ensure to select CONFIG_BRIDGE_NETFILTER in kernel_menuconfig
> 2. boot the device and access it over serial
> 3. ensure br-lan bridge has at least two active ports
>* tested with ath9k + Ethernet (gianfar and ag71xx)
>* if not enabled, enable radio0 and ensure wlan0 is in bridge
> 4. run: sysctl -w net.bridge.bridge-nf-call-iptables=1
> 5. from your host, continuously ping the device over Ethernet
> 6. run: ifconfig br-lan down
> 
> The next ingress packet causes a fatal crash.
> 
> Trace logs for MIPS and PPC are attached and hint to __nf_conntrack_confirm
> 
> 
> Let me know if I could provide more information to further isolate the 
> problem.
> 
> 
Got forward with that issue and after wondering why the netfilter folks were
unable to reproduce, it finally turned out the problematic code is OWRT private 
in
target/linux/generic/patches-X/120-bridge_allow_receiption_on_disabled_port.patch

This is causing reproducible kernel crashes under the conditions given before. 
In
essence, it leads to a double-free (de-reference of poisoned list) or
use-after-destruction. For more details please check the manually collected
execution trace below. The tldr; version is this: an ingress packet to the
Ethernet port of a disabled bridge
1. gets passed to br_handle_frame()
2. enters the BR_STATE_DISABLED case in the mentioned patch
3. gets passed to the related NF_HOOK
   a) in br_nf_pre_routing() a conntrack context ct is created
   b) that in the same nf_iterate() is destroyed in br_nf_pre_routing_finish()
4. in the br_pass_frame_up() following the NF_HOOK
   a) ipv4_confirm() runs __nf_conntrack_confirm(ct) with invalid ct
   b) which attempts to nf_ct_del_from_dying_or_unconfirmed_list(ct)
   c) and with that de-references and writes to LIST_POISON2 in pprev

This is reproducible with different kernels (tested: 3.18 and 4.4), both with 
PPC
and MIPS systems (should basically do on any platform).

My hot-fix to prevent the crash is to instead of passing the skb to NF_HOOK
directly pass it to br_handle_local_finish(). But having insufficient insight 
into
what is going on there, this is fighting the symptoms rather than solving the 
root
cause. Maybe it is even better to drop patch 120 (not tested yet)?



Cheers,
Zefir

---

br_handle_frame() with p->state = BR_STATE_DISABLED {
  NF_HOOK(br_handle_local_finish) {
br_nf_pre_routing() {
  NF_HOOK(br_nf_pre_routing_finish) {
ipv4_conntrack_in() {
  nf_conntrack_in()
init_conntrack()
  ct = __nf_conntrack_alloc()
}
br_nf_pre_routing_finish() [=okfn] {
  NF_HOOK(br_handle_frame_finish) {
br_handle_frame_finish() { frees skb and returns 0 }
nf_conntrack_destroy(ct) {
  destroy_conntrack(ct) {
nf_ct_del_from_dying_or_unconfirmed_list(ct) {
  ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = LIST_POISON2
}
nf_conntrack_free(ct)
  }
}
  }
}
  }
  return NF_STOLEN
}
/* nf_iterate() returns NF_STOLEN, but nf_hook_slow() does not
   handle NF_STOLEN and returns 0 */
  }
  br_pass_frame_up() {
ipv4_confirm() {
  __nf_conntrack_confirm(ct) {
/* this one attempts to write to LIST_POISON2 and causes the oops */
nf_ct_del_from_dying_or_unconfirmed_list(ct)
  }
}
  }
}
___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel


[OpenWrt-Devel] [BUG] kernel crash in br_netfilter

2016-02-29 Thread Zefir Kurtisi
I've been fighting a kernel bug that is producing random crashes around network 
/
skb_layer for a long time and was able to isolate it (or one of its components) 
to
the br_netfilter module.

I am reproducing the bug with PowerPC (TL-WDR4900v1.3) and MIPS (DB120, ar71xx)
based systems. Florian Westphal did not see it on kvm/x86, it is unclear whether
this requires a physical system or is CPU specific. This bug is in the latest
OpenWRT (tested HEAD is 03b15ae9), as it happens with firmwares built 2+ years
ago, so it is no current regression but something that was there for a long 
time.


Reproducing the crash
1. build the firmware for the system to test
   * use default configuration
   * ensure to select CONFIG_BRIDGE_NETFILTER in kernel_menuconfig
2. boot the device and access it over serial
3. ensure br-lan bridge has at least two active ports
   * tested with ath9k + Ethernet (gianfar and ag71xx)
   * if not enabled, enable radio0 and ensure wlan0 is in bridge
4. run: sysctl -w net.bridge.bridge-nf-call-iptables=1
5. from your host, continuously ping the device over Ethernet
6. run: ifconfig br-lan down

The next ingress packet causes a fatal crash.

Trace logs for MIPS and PPC are attached and hint to __nf_conntrack_confirm


Let me know if I could provide more information to further isolate the problem.


Thanks,
Zefir


[  191.321163] br-lan: port 1(eth0.1) entered disabled state
[  192.646656] CPU 0 Unable to handle kernel paging request at virtual address 
00200200, epc == 87000670, ra == 870018f4
[  192.657446] Oops[#1]:
[  192.659761] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.16 #1
[  192.665593] task: 803ce958 ti: 803c8000 task.ti: 803c8000
[  192.671069] $ 0   :   8001 00200200
[  192.676410] $ 4   : 86c0fa20 0001  a44465b9
[  192.681742] $ 8   : 86c0fa78 86c0fa78  
[  192.687075] $12   : 115f0002   c0a80114
[  192.692408] $16   : 86c0fa20 06cc 07b6 803e5af0
[  192.697742] $20   : 06cc 0004 803e5af0 
[  192.703082] $24   :  871367d4  
[  192.708416] $28   : 803c8000 803c9a28 86c0fa60 870018f4
[  192.713750] Hi: 07b6
[  192.716670] Lo: b5a74800
[  192.719628] epc   : 87000670 nf_conntrack_find_get+0x68/0x88 [nf_conntrack]
[  192.726698] ra: 870018f4 __nf_conntrack_confirm+0xc0/0x364 [nf_conntrack]
[  192.733927] Status: 1100fc03 KERNEL EXL IE 
[  192.738196] Cause : 808c
[  192.741117] BadVA : 00200200
[  192.744040] PrId  : 0001974c (MIPS 74Kc)
[  192.748015] Modules linked in: ath9k ath9k_common pppoe ppp_async 
iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 
nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp 
xt_state xt_nan
[  192.816284] Process swapper (pid: 0, threadinfo=803c8000, task=803ce958, 
tls=)
[  192.824311] Stack : 87342240 87135744 0001 0200 803c9aac 803c9aec 
803cabac 87342240
  0001 0004 0003 ff62  8026efb0 86c0fa20 
87342240
   8731f000 86c18100 87137058  87342240 803c9aec 
87342240
  803cab24 fffb 0001 8026f090 8734ca80 87865b7c  
0008
   87137058 803caba4 87342240 0001 8731f000 87342240 
8731f05c
  ...
[  192.860643] Call Trace:
[  192.863133] [<87000670>] nf_conntrack_find_get+0x68/0x88 [nf_conntrack]
[  192.869850] 
[  192.871356] 
Code: 00020336  8c820008  30450001 <14a2> ac62  ac430004  3c020020  
24420200  ac82000c 
[  192.881512] ---[ end trace 1e716eb17e40af8b ]---
[  192.888247] Kernel panic - not syncing: Fatal exception in interrupt
[  192.895654] Rebooting in 3 seconds..

[   69.834129] br0: port 3(eth1) entered disabled state
[   69.835427] br0: port 1(wlan0) entered disabled state
[   77.493530] Unable to handle kernel paging request for data at address 
0x00200200
[   77.495415] Faulting instruction address: 0xd32ce874
[   77.496669] Oops: Kernel access of bad area, sig: 11 [#1]
[   77.498027] DT50
[   77.498493] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath 
nf_nat_ipv4 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 
xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_quota 
xt_pkh
[   77.522830] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.23 #10
[   77.524323] task: c035b300 ti: cffe6000 task.ti: c037
[   77.525684] NIP: d32ce874 LR: d32cffec CTR: d35b13c4
[   77.526936] REGS: cffe7c60 TRAP: 0300   Not tainted  (3.18.23)
[   77.528403] MSR: 00029000   CR: 42002082  XER: 2000
[   77.529951] DEAR: 00200200 ESR: 0080 
GPR00: c762b218 cffe7d10 c035b300 c762b1c0 8db8d32d d72044d0   
GPR08: 0001 8001 00200200 332f4b8b 22002082 10025420 0020 c7654080 
GPR16: c7504d80 c7b7e540 cfba4678 c7b4a000 86dd  8000 0002 
GPR24: c762b200 0e25 02b1 c0366fc8 0225 06b1  c762b1c0 
[   77.538144] NIP [d32ce874] 0xd32ce874
[