Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > >> > >> > You might build a kernel with KASAN support to get maybe more chances to >> > trigger the bug. >> > >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) >> > >> >> Ah. Doesn't seem to be supported on arm(32) unfortunately. > > Then you could at least use standard debugging features : > > CONFIG_SLAB=y > CONFIG_SLABINFO=y > CONFIG_DEBUG_SLAB=y > CONFIG_DEBUG_SLAB_LEAK=y > > (Or equivalent SLUB options) > > and > > CONFIG_DEBUG_PAGEALLOC=y > > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) I tried with those enabled and while toggling power on the Bluetooth interface I usually get this after a few iterations: kernel: Bluetooth: Unable to push skb to HCI core(-6) kernel: (stc): proto stack 4's ->recv failed kernel: Slab corruption (Not tainted): skbuff_head_cache start=c08a8a00, len=176 kernel: 0a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 jkk. kernel: Prev obj: start=c08a8940, len=176 kernel: 000: 00 00 00 00 00 00 00 00 31 73 52 00 43 17 2b 14 1sR.C.+. kernel: 010: 00 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 kernel: Next obj: start=c08a8ac0, len=176 kernel: 000: 00 00 00 00 00 00 00 00 01 42 f6 50 36 17 2b 14 .B.P6.+. kernel: 010: 00 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 The "Unable to push skb" and "recv failed" lines always appear before the corruption. Unfortunately, the corruptions occur also with your patch.
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Jan 20, 2016 at 4:48 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Wed, 2016-01-20 at 16:06 +0100, Jacob Siverskog wrote: >> On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: >> >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.duma...@gmail.com> >> >> wrote: >> > >> >> > >> >> > You might build a kernel with KASAN support to get maybe more chances to >> >> > trigger the bug. >> >> > >> >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) >> >> > >> >> >> >> Ah. Doesn't seem to be supported on arm(32) unfortunately. >> > >> > Then you could at least use standard debugging features : >> > >> > CONFIG_SLAB=y >> > CONFIG_SLABINFO=y >> > CONFIG_DEBUG_SLAB=y >> > CONFIG_DEBUG_SLAB_LEAK=y >> > >> > (Or equivalent SLUB options) >> > >> > and >> > >> > CONFIG_DEBUG_PAGEALLOC=y >> > >> > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) >> >> I tried with those enabled and while toggling power on the Bluetooth >> interface I usually get this after a few iterations: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > > Well, this code seems to be quite buggy. > > I do not have time to audit it, but 5 minutes are enough to spot 2 > issues. > > skb, once given to another queue/layer should not be accessed anymore. > Ok. Unfortunately I still see the slab corruption even with your changes.
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: >> On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >> >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangc...@gmail.com> >> >> wrote: >> >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> >> > <jacob@teenage.engineering> wrote: >> >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <eduma...@google.com> >> >> >> wrote: >> >> >>> How often can you trigger this bug ? >> >> >> >> >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen >> >> >> it just a >> >> >> few times when bringing up/down network interfaces. Does the trace >> >> >> give any clue? >> >> >> >> >> > >> >> > A little bit. You need to help people to narrow down the problem >> >> > because there are too many places using skb->next and skb->prev. >> >> > >> >> > Since you mentioned it seems related to network interface flip, >> >> > what network interfaces are you using? What's is your TC setup? >> >> > >> >> > Thanks. >> >> >> >> The system contains only one physical network interface (TI WL1837, >> >> wl18xx module). >> >> The state prior to the crash was as follows: >> >> - One virtual network interface active (as STA, associated with access >> >> point) >> >> - Bluetooth (BLE only) active (same physical chip, co-existence, >> >> btwilink/st_drv modules) >> >> >> >> Actions made around the time of the crash: >> >> - Bluetooth disabled >> >> - One additional virtual network interface brought up (also as STA) >> >> >> >> I believe the crash occurred between these two actions. I just saw >> >> that there are some interesting events in the log prior to the crash: >> >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> >> kernel: (stc): proto stack 4's ->recv failed >> >> kernel: (stc): remove_channel_from_table: id 3 >> >> kernel: (stc): remove_channel_from_table: id 2 >> >> kernel: (stc): remove_channel_from_table: id 4 >> >> kernel: (stc): all chnl_ids unregistered >> >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> >> >> The first print is from btwilink.c. However, I can't see the >> >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> >> 6LoWPAN or anything similar). >> >> >> >> Thanks, Jacob >> > >> > Definitely these details are useful ;) >> > >> > Could you try : >> > >> > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c >> > index 6e3af8b42cdd..0c99a74fb895 100644 >> > --- a/drivers/misc/ti-st/st_core.c >> > +++ b/drivers/misc/ti-st/st_core.c >> > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) >> > skb_queue_purge(_gdata->txq); >> > skb_queue_purge(_gdata->tx_waitq); >> > kfree_skb(st_gdata->rx_skb); >> > + st_gdata->rx_skb = NULL; >> > kfree_skb(st_gdata->tx_skb); >> > + st_gdata->tx_skb = NULL; >> > /* TTY ldisc cleanup */ >> > err = tty_unregister_ldisc(N_TI_WL); >> > if (err) >> > >> > >> >> Sure. Since I don't have a good way to trigger the initial issue, I >> can't really know if there is a difference with your patch. However, >> normal usage seems to work as expected with your patch. I've tried to >> reproduce the initial issue with and without your patch repeatedly for >> hours and have not seen any crash in any of the runs so far. >> -- > > You might build a kernel with KASAN support to get maybe more chances to > trigger the bug. > > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > Ah. Doesn't seem to be supported on arm(32) unfortunately. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote: >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> > <jacob@teenage.engineering> wrote: >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <eduma...@google.com> wrote: >> >>> How often can you trigger this bug ? >> >> >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it >> >> just a >> >> few times when bringing up/down network interfaces. Does the trace >> >> give any clue? >> >> >> > >> > A little bit. You need to help people to narrow down the problem >> > because there are too many places using skb->next and skb->prev. >> > >> > Since you mentioned it seems related to network interface flip, >> > what network interfaces are you using? What's is your TC setup? >> > >> > Thanks. >> >> The system contains only one physical network interface (TI WL1837, >> wl18xx module). >> The state prior to the crash was as follows: >> - One virtual network interface active (as STA, associated with access point) >> - Bluetooth (BLE only) active (same physical chip, co-existence, >> btwilink/st_drv modules) >> >> Actions made around the time of the crash: >> - Bluetooth disabled >> - One additional virtual network interface brought up (also as STA) >> >> I believe the crash occurred between these two actions. I just saw >> that there are some interesting events in the log prior to the crash: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> kernel: (stc): proto stack 4's ->recv failed >> kernel: (stc): remove_channel_from_table: id 3 >> kernel: (stc): remove_channel_from_table: id 2 >> kernel: (stc): remove_channel_from_table: id 4 >> kernel: (stc): all chnl_ids unregistered >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> The first print is from btwilink.c. However, I can't see the >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> 6LoWPAN or anything similar). >> >> Thanks, Jacob > > Definitely these details are useful ;) > > Could you try : > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > index 6e3af8b42cdd..0c99a74fb895 100644 > --- a/drivers/misc/ti-st/st_core.c > +++ b/drivers/misc/ti-st/st_core.c > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > skb_queue_purge(_gdata->txq); > skb_queue_purge(_gdata->tx_waitq); > kfree_skb(st_gdata->rx_skb); > + st_gdata->rx_skb = NULL; > kfree_skb(st_gdata->tx_skb); > + st_gdata->tx_skb = NULL; > /* TTY ldisc cleanup */ > err = tty_unregister_ldisc(N_TI_WL); > if (err) > > Sure. Since I don't have a good way to trigger the initial issue, I can't really know if there is a difference with your patch. However, normal usage seems to work as expected with your patch. I've tried to reproduce the initial issue with and without your patch repeatedly for hours and have not seen any crash in any of the runs so far. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote: > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog > <jacob@teenage.engineering> wrote: >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <eduma...@google.com> wrote: >>> How often can you trigger this bug ? >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it just >> a >> few times when bringing up/down network interfaces. Does the trace >> give any clue? >> > > A little bit. You need to help people to narrow down the problem > because there are too many places using skb->next and skb->prev. > > Since you mentioned it seems related to network interface flip, > what network interfaces are you using? What's is your TC setup? > > Thanks. The system contains only one physical network interface (TI WL1837, wl18xx module). The state prior to the crash was as follows: - One virtual network interface active (as STA, associated with access point) - Bluetooth (BLE only) active (same physical chip, co-existence, btwilink/st_drv modules) Actions made around the time of the crash: - Bluetooth disabled - One additional virtual network interface brought up (also as STA) I believe the crash occurred between these two actions. I just saw that there are some interesting events in the log prior to the crash: kernel: Bluetooth: Unable to push skb to HCI core(-6) kernel: (stc): proto stack 4's ->recv failed kernel: (stc): remove_channel_from_table: id 3 kernel: (stc): remove_channel_from_table: id 2 kernel: (stc): remove_channel_from_table: id 4 kernel: (stc): all chnl_ids unregistered kernel: (stk) :ldisc_install = 0(stc): st_tty_close The first print is from btwilink.c. However, I can't see the connection between Bluetooth (BLE) and UDP/IPv6 (we're not using 6LoWPAN or anything similar). Thanks, Jacob -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, Dec 29, 2015 at 9:08 PM, David Miller <da...@davemloft.net> wrote: > From: Rainer Weikusat <rweiku...@mobileactivedefense.com> > Date: Tue, 29 Dec 2015 19:42:36 + > >> Jacob Siverskog <jacob@teenage.engineering> writes: >>> This should fix a NULL pointer dereference I encountered (dump >>> below). Since __skb_unlink is called while walking, >>> skb_queue_walk_safe should be used. >> >> The code in question is: > ... >> __skb_unlink is only called prior to returning from the function. >> Consequently, it won't affect the skb_queue_walk code. > > Agreed, this patch doesn't fix anything. Ok. Thanks for your feedback. How do you believe the issue could be solved? Investigating it gives: static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list) { struct sk_buff *next, *prev; list->qlen--; 51c: e2433001 sub r3, r3, #1 520: e58b3074 str r3, [fp, #116] ; 0x74 next = skb->next; prev = skb->prev; 524: e894000c ldm r4, {r2, r3} skb->next = skb->prev = NULL; 528: e5841000 str r1, [r4] 52c: e5841004 str r1, [r4, #4] next->prev = prev; 530: e5823004 str r3, [r2, #4] <-- trapping instruction (r2 NULL) Register contents: r7 : c58cfe1c r6 : c06351d0 r5 : c77810ac r4 : c583eac0 r3 : r2 : r1 : r0 : 2013 If I understand this correctly, then r4 = skb, r2 = next, r3 = prev. Should there be a check for this in __skb_try_recv_datagram? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <eduma...@google.com> wrote: > On Wed, Dec 30, 2015 at 6:14 AM, Jacob Siverskog > <jacob@teenage.engineering> wrote: > >> Ok. Thanks for your feedback. How do you believe the issue could be >> solved? Investigating it gives: >> >> static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head >> *list) >> { >> struct sk_buff *next, *prev; >> >> list->qlen--; >> 51c: e2433001 sub r3, r3, #1 >> 520: e58b3074 str r3, [fp, #116] ; 0x74 >> next = skb->next; >> prev = skb->prev; >> 524: e894000c ldm r4, {r2, r3} >> skb->next = skb->prev = NULL; >> 528: e5841000 str r1, [r4] >> 52c: e5841004 str r1, [r4, #4] >> next->prev = prev; >> 530: e5823004 str r3, [r2, #4] <-- >> trapping instruction (r2 NULL) >> >> Register contents: >> r7 : c58cfe1c r6 : c06351d0 r5 : c77810ac r4 : c583eac0 >> r3 : r2 : r1 : r0 : 2013 >> >> If I understand this correctly, then r4 = skb, r2 = next, r3 = prev. >> >> Should there be a check for this in __skb_try_recv_datagram? > > At this point corruption already happened. > We can not possibly detect every possible corruption caused by bugs > elsewhere in the kernel and just 'recover' at this point. > We must indeed find the root cause and fix it, instead of trying to hide it. > > How often can you trigger this bug ? Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a few times when bringing up/down network interfaces. Does the trace give any clue? [] (__skb_recv_datagram) from [] (udpv6_recvmsg+0x1d0/0x6d0) [] (udpv6_recvmsg) from [] (inet_recvmsg+0x38/0x4c) [] (inet_recvmsg) from [] (___sys_recvmsg+0x94/0x170) [] (___sys_recvmsg) from [] (__sys_recvmsg+0x3c/0x6c) [] (__sys_recvmsg) from [] (ret_fast_syscall+0x0/0x3c) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
033 c0002029 c0087514 fc40: 0004 0817 c5a74c80 c58cfd48 0004 c58ce000 fc60: c7781040 c0019aa4 0817 c00164f8 fc80: 0817 c0016288 0004 c0607b14 c58cfd48 c58ce000 fca0: c7781040 c00092c0 c5957a00 c7738cf0 c7744940 c7158000 c7738cf0 fcc0: c0321e3c c5aa287c 004c9b81 c7744940 fce0: c58ce000 c03080a8 c5957a68 0001 c06351d0 fd00: c5941340 c5aa2800 c5aa287c c7744940 000e c58ce000 fd20: c5aa287c c037e5f0 0065 c02fc0a8 6093 c58cfd7c fd40: c58cfe20 c0013120 2013 c583eac0 c77810ac fd60: c06351d0 c58cfe1c c58cfe20 c58ce000 c7781040 2013 c58cfd98 fd80: c0398f1c c02fc0a8 6093 0051 0010 0029 0008 fda0: c02fbc64 0040 c778115c c03a9ed8 c58cfdc4 0065 fdc0: 005d 00ff fb00 0004 c0398d4c fde0: c7781040 c58cff6c 22f8 be8383ec c06351d0 c06351d0 c0398f1c fe00: c58cfe24 c58cfe70 be8383e4 0040 c77812a8 fe20: c58cfe2c 0004 c0398d4c c58cff6c c6a25c00 c58cfeb8 be8383ec fe40: be838408 c0367a2c c58cfe5c c58cff6c fe60: c6a25c00 c58cff6c 0040 c02efff4 be838874 be8388c0 22f8 fe80: c0618d78 ff14338a 9abb1afe b2b617e7 0ab4 b2b99e16 0ab4 fea0: c6a1fcc0 c0608374 c0608374 0001 e914000a fec0: 80fe ff14338a 9abb1afe 0004 c00499ac fee0: c77ac000 0002 c063a848 c0049c9c c5a74c80 ff00: c77ac018 c7123040 2dfb 0c95f107 c58cff78 05106300 2dfb 0051 ff20: be83ac08 0001 0018 be83ac08 0008 0004 be8383ec ff40: c6a25c00 0129 c000f3a4 c58ce000 c02f0d74 be83ac08 ff60: c58cff78 fff7 c58cfeb8 22f8 ff80: c58cfe78 0001 be838408 0400 be838890 be83883f ffa0: c000f1e0 be838890 be83883f 000e be8383ec ffc0: be838890 be83883f 0129 be838848 be838844 ffe0: 0006a228 be8383c8 ea5c b6f39fd8 6010 000e [] (inet_sock_destruct) from [] (sk_destruct+0x18/0xe0) [] (sk_destruct) from [] (inet_release+0x44/0x70) [] (inet_release) from [] (sock_release+0x20/0x98) [] (sock_release) from [] (sock_close+0xc/0x14) [] (sock_close) from [] (__fput+0x80/0x1fc) [] (__fput) from [] (task_work_run+0x6c/0x9c) [] (task_work_run) from [] (do_exit+0x2ac/0x95c) [] (do_exit) from [] (die+0x180/0x3b0) [] (die) from [] (__do_kernel_fault.part.0+0x64/0x1e4) [] (__do_kernel_fault.part.0) from [] (do_page_fault+0x270/0x298) [] (do_page_fault) from [] (do_DataAbort+0x38/0xb4) [] (do_DataAbort) from [] (__dabt_svc+0x40/0x60) Exception stack(0xc58cfd48 to 0xc58cfd90) fd40: 2013 c583eac0 c77810ac fd60: c06351d0 c58cfe1c c58cfe20 c58ce000 c7781040 2013 c58cfd98 fd80: c0398f1c c02fc0a8 6093 [] (__dabt_svc) from [] (__skb_recv_datagram+0x428/0x598) [] (__skb_recv_datagram) from [] (udpv6_recvmsg+0x1d0/0x6d0) [] (udpv6_recvmsg) from [] (inet_recvmsg+0x38/0x4c) [] (inet_recvmsg) from [] (___sys_recvmsg+0x94/0x170) [] (___sys_recvmsg) from [] (__sys_recvmsg+0x3c/0x6c) [] (__sys_recvmsg) from [] (ret_fast_syscall+0x0/0x3c) Code: e5842074 e8930006 e5835000 e5835004 (e5812004) Signed-off-by: Jacob Siverskog <jacob@teenage.engineering> --- net/core/datagram.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index fa9dc64..e8b3cab 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -201,7 +201,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags, struct sk_buff **last) { struct sk_buff_head *queue = >sk_receive_queue; - struct sk_buff *skb; + struct sk_buff *skb, *next; unsigned long cpu_flags; /* * Caller is allowed not to check sk->sk_err before skb_recv_datagram() @@ -222,7 +222,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags, *last = (struct sk_buff *)queue; spin_lock_irqsave(>lock, cpu_flags); - skb_queue_walk(queue, skb) { + skb_queue_walk_safe(queue, skb, next) { *last = skb; *peeked = skb->peeked; if (flags & MSG_PEEK) { -- 2.6.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html