Re: [OpenWrt-Devel] [BUG] kernel crash in br_netfilter
On 2016-03-08 12:06, Florian Westphal wrote: >> My hot-fix to prevent the crash is to instead of passing the skb to NF_HOOK >> directly pass it to br_handle_local_finish(). But having insufficient >> insight into >> what is going on there, this is fighting the symptoms rather than solving >> the root >> cause. Maybe it is even better to drop patch 120 (not tested yet)? > > Sorry, I don't know why this patch was not merged upstream and do not know > why its > in openwrt. This patch exists, because it's otherwise impossible to bridge a client mode (4addr) WLAN interface when encryption is enabled. wpa_supplicant needs to receive EAP packets before it will change the operstate to allow the bridge and the rest of the network stack to do their thing. This used to work in a while back, and I think it got broken by this commit: commit 576eb62598f10c8c7fd75703fe89010cdcfff596 Author: stephen hemminger Date: Fri Dec 28 18:15:22 2012 + bridge: respect RFC2863 operational state The bridge link detection should follow the operational state of the lower device, rather than the carrier bit. This allows devices like tunnels that are controlled by userspace control plane to work with bridge STP link management. Signed-off-by: Stephen Hemminger Reviewed-by: Flavio Leitner Signed-off-by: David S. Miller Back then I proposed a patch for upstream inclusion, got some feedback, Stephen sent me this patch and I fixed it up a bit and re-submitted it. I think it got lost somewhere in the process and after that I lost track and didn't get around to re-submitting it. So we kept the patch in OpenWrt because as far as I know, the regression still exists in current kernels. - Felix ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel
Re: [OpenWrt-Devel] [BUG] kernel crash in br_netfilter
On 02/29/2016 01:33 PM, Zefir Kurtisi wrote: > I've been fighting a kernel bug that is producing random crashes around > network / > skb_layer for a long time and was able to isolate it (or one of its > components) to > the br_netfilter module. > > I am reproducing the bug with PowerPC (TL-WDR4900v1.3) and MIPS (DB120, > ar71xx) > based systems. Florian Westphal did not see it on kvm/x86, it is unclear > whether > this requires a physical system or is CPU specific. This bug is in the latest > OpenWRT (tested HEAD is 03b15ae9), as it happens with firmwares built 2+ years > ago, so it is no current regression but something that was there for a long > time. > > > Reproducing the crash > 1. build the firmware for the system to test >* use default configuration >* ensure to select CONFIG_BRIDGE_NETFILTER in kernel_menuconfig > 2. boot the device and access it over serial > 3. ensure br-lan bridge has at least two active ports >* tested with ath9k + Ethernet (gianfar and ag71xx) >* if not enabled, enable radio0 and ensure wlan0 is in bridge > 4. run: sysctl -w net.bridge.bridge-nf-call-iptables=1 > 5. from your host, continuously ping the device over Ethernet > 6. run: ifconfig br-lan down > > The next ingress packet causes a fatal crash. > > Trace logs for MIPS and PPC are attached and hint to __nf_conntrack_confirm > > > Let me know if I could provide more information to further isolate the > problem. > > Got forward with that issue and after wondering why the netfilter folks were unable to reproduce, it finally turned out the problematic code is OWRT private in target/linux/generic/patches-X/120-bridge_allow_receiption_on_disabled_port.patch This is causing reproducible kernel crashes under the conditions given before. In essence, it leads to a double-free (de-reference of poisoned list) or use-after-destruction. For more details please check the manually collected execution trace below. The tldr; version is this: an ingress packet to the Ethernet port of a disabled bridge 1. gets passed to br_handle_frame() 2. enters the BR_STATE_DISABLED case in the mentioned patch 3. gets passed to the related NF_HOOK a) in br_nf_pre_routing() a conntrack context ct is created b) that in the same nf_iterate() is destroyed in br_nf_pre_routing_finish() 4. in the br_pass_frame_up() following the NF_HOOK a) ipv4_confirm() runs __nf_conntrack_confirm(ct) with invalid ct b) which attempts to nf_ct_del_from_dying_or_unconfirmed_list(ct) c) and with that de-references and writes to LIST_POISON2 in pprev This is reproducible with different kernels (tested: 3.18 and 4.4), both with PPC and MIPS systems (should basically do on any platform). My hot-fix to prevent the crash is to instead of passing the skb to NF_HOOK directly pass it to br_handle_local_finish(). But having insufficient insight into what is going on there, this is fighting the symptoms rather than solving the root cause. Maybe it is even better to drop patch 120 (not tested yet)? Cheers, Zefir --- br_handle_frame() with p->state = BR_STATE_DISABLED { NF_HOOK(br_handle_local_finish) { br_nf_pre_routing() { NF_HOOK(br_nf_pre_routing_finish) { ipv4_conntrack_in() { nf_conntrack_in() init_conntrack() ct = __nf_conntrack_alloc() } br_nf_pre_routing_finish() [=okfn] { NF_HOOK(br_handle_frame_finish) { br_handle_frame_finish() { frees skb and returns 0 } nf_conntrack_destroy(ct) { destroy_conntrack(ct) { nf_ct_del_from_dying_or_unconfirmed_list(ct) { ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = LIST_POISON2 } nf_conntrack_free(ct) } } } } } return NF_STOLEN } /* nf_iterate() returns NF_STOLEN, but nf_hook_slow() does not handle NF_STOLEN and returns 0 */ } br_pass_frame_up() { ipv4_confirm() { __nf_conntrack_confirm(ct) { /* this one attempts to write to LIST_POISON2 and causes the oops */ nf_ct_del_from_dying_or_unconfirmed_list(ct) } } } } ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel
[OpenWrt-Devel] [BUG] kernel crash in br_netfilter
I've been fighting a kernel bug that is producing random crashes around network / skb_layer for a long time and was able to isolate it (or one of its components) to the br_netfilter module. I am reproducing the bug with PowerPC (TL-WDR4900v1.3) and MIPS (DB120, ar71xx) based systems. Florian Westphal did not see it on kvm/x86, it is unclear whether this requires a physical system or is CPU specific. This bug is in the latest OpenWRT (tested HEAD is 03b15ae9), as it happens with firmwares built 2+ years ago, so it is no current regression but something that was there for a long time. Reproducing the crash 1. build the firmware for the system to test * use default configuration * ensure to select CONFIG_BRIDGE_NETFILTER in kernel_menuconfig 2. boot the device and access it over serial 3. ensure br-lan bridge has at least two active ports * tested with ath9k + Ethernet (gianfar and ag71xx) * if not enabled, enable radio0 and ensure wlan0 is in bridge 4. run: sysctl -w net.bridge.bridge-nf-call-iptables=1 5. from your host, continuously ping the device over Ethernet 6. run: ifconfig br-lan down The next ingress packet causes a fatal crash. Trace logs for MIPS and PPC are attached and hint to __nf_conntrack_confirm Let me know if I could provide more information to further isolate the problem. Thanks, Zefir [ 191.321163] br-lan: port 1(eth0.1) entered disabled state [ 192.646656] CPU 0 Unable to handle kernel paging request at virtual address 00200200, epc == 87000670, ra == 870018f4 [ 192.657446] Oops[#1]: [ 192.659761] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.16 #1 [ 192.665593] task: 803ce958 ti: 803c8000 task.ti: 803c8000 [ 192.671069] $ 0 : 8001 00200200 [ 192.676410] $ 4 : 86c0fa20 0001 a44465b9 [ 192.681742] $ 8 : 86c0fa78 86c0fa78 [ 192.687075] $12 : 115f0002 c0a80114 [ 192.692408] $16 : 86c0fa20 06cc 07b6 803e5af0 [ 192.697742] $20 : 06cc 0004 803e5af0 [ 192.703082] $24 : 871367d4 [ 192.708416] $28 : 803c8000 803c9a28 86c0fa60 870018f4 [ 192.713750] Hi: 07b6 [ 192.716670] Lo: b5a74800 [ 192.719628] epc : 87000670 nf_conntrack_find_get+0x68/0x88 [nf_conntrack] [ 192.726698] ra: 870018f4 __nf_conntrack_confirm+0xc0/0x364 [nf_conntrack] [ 192.733927] Status: 1100fc03 KERNEL EXL IE [ 192.738196] Cause : 808c [ 192.741117] BadVA : 00200200 [ 192.744040] PrId : 0001974c (MIPS 74Kc) [ 192.748015] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nan [ 192.816284] Process swapper (pid: 0, threadinfo=803c8000, task=803ce958, tls=) [ 192.824311] Stack : 87342240 87135744 0001 0200 803c9aac 803c9aec 803cabac 87342240 0001 0004 0003 ff62 8026efb0 86c0fa20 87342240 8731f000 86c18100 87137058 87342240 803c9aec 87342240 803cab24 fffb 0001 8026f090 8734ca80 87865b7c 0008 87137058 803caba4 87342240 0001 8731f000 87342240 8731f05c ... [ 192.860643] Call Trace: [ 192.863133] [<87000670>] nf_conntrack_find_get+0x68/0x88 [nf_conntrack] [ 192.869850] [ 192.871356] Code: 00020336 8c820008 30450001 <14a2> ac62 ac430004 3c020020 24420200 ac82000c [ 192.881512] ---[ end trace 1e716eb17e40af8b ]--- [ 192.888247] Kernel panic - not syncing: Fatal exception in interrupt [ 192.895654] Rebooting in 3 seconds.. [ 69.834129] br0: port 3(eth1) entered disabled state [ 69.835427] br0: port 1(wlan0) entered disabled state [ 77.493530] Unable to handle kernel paging request for data at address 0x00200200 [ 77.495415] Faulting instruction address: 0xd32ce874 [ 77.496669] Oops: Kernel access of bad area, sig: 11 [#1] [ 77.498027] DT50 [ 77.498493] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_quota xt_pkh [ 77.522830] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.23 #10 [ 77.524323] task: c035b300 ti: cffe6000 task.ti: c037 [ 77.525684] NIP: d32ce874 LR: d32cffec CTR: d35b13c4 [ 77.526936] REGS: cffe7c60 TRAP: 0300 Not tainted (3.18.23) [ 77.528403] MSR: 00029000 CR: 42002082 XER: 2000 [ 77.529951] DEAR: 00200200 ESR: 0080 GPR00: c762b218 cffe7d10 c035b300 c762b1c0 8db8d32d d72044d0 GPR08: 0001 8001 00200200 332f4b8b 22002082 10025420 0020 c7654080 GPR16: c7504d80 c7b7e540 cfba4678 c7b4a000 86dd 8000 0002 GPR24: c762b200 0e25 02b1 c0366fc8 0225 06b1 c762b1c0 [ 77.538144] NIP [d32ce874] 0xd32ce874 [