Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazetwrote: > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet wrote: > >> > >> > You might build a kernel with KASAN support to get maybe more chances to >> > trigger the bug. >> > >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) >> > >> >> Ah. Doesn't seem to be supported on arm(32) unfortunately. > > Then you could at least use standard debugging features : > > CONFIG_SLAB=y > CONFIG_SLABINFO=y > CONFIG_DEBUG_SLAB=y > CONFIG_DEBUG_SLAB_LEAK=y > > (Or equivalent SLUB options) > > and > > CONFIG_DEBUG_PAGEALLOC=y > > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) I tried with those enabled and while toggling power on the Bluetooth interface I usually get this after a few iterations: kernel: Bluetooth: Unable to push skb to HCI core(-6) kernel: (stc): proto stack 4's ->recv failed kernel: Slab corruption (Not tainted): skbuff_head_cache start=c08a8a00, len=176 kernel: 0a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 jkk. kernel: Prev obj: start=c08a8940, len=176 kernel: 000: 00 00 00 00 00 00 00 00 31 73 52 00 43 17 2b 14 1sR.C.+. kernel: 010: 00 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 kernel: Next obj: start=c08a8ac0, len=176 kernel: 000: 00 00 00 00 00 00 00 00 01 42 f6 50 36 17 2b 14 .B.P6.+. kernel: 010: 00 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 The "Unable to push skb" and "recv failed" lines always appear before the corruption. Unfortunately, the corruptions occur also with your patch.
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, 2016-01-20 at 17:17 +0100, Jacob Siverskog wrote: > On Wed, Jan 20, 2016 at 4:48 PM, Eric Dumazetwrote: > > On Wed, 2016-01-20 at 16:06 +0100, Jacob Siverskog wrote: > >> On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazet > >> wrote: > >> > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: > >> >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet > >> >> wrote: > >> > > >> >> > > >> >> > You might build a kernel with KASAN support to get maybe more chances > >> >> > to > >> >> > trigger the bug. > >> >> > > >> >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > >> >> > > >> >> > >> >> Ah. Doesn't seem to be supported on arm(32) unfortunately. > >> > > >> > Then you could at least use standard debugging features : > >> > > >> > CONFIG_SLAB=y > >> > CONFIG_SLABINFO=y > >> > CONFIG_DEBUG_SLAB=y > >> > CONFIG_DEBUG_SLAB_LEAK=y > >> > > >> > (Or equivalent SLUB options) > >> > > >> > and > >> > > >> > CONFIG_DEBUG_PAGEALLOC=y > >> > > >> > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) > >> > >> I tried with those enabled and while toggling power on the Bluetooth > >> interface I usually get this after a few iterations: > >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > > > > Well, this code seems to be quite buggy. > > > > I do not have time to audit it, but 5 minutes are enough to spot 2 > > issues. > > > > skb, once given to another queue/layer should not be accessed anymore. > > > > Ok. Unfortunately I still see the slab corruption even with your changes. Patch was only showing potential _reads_ after free, which do not generally corrupt memory. As I said, a full audit is needed, and I don't have time for this.
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
Hi Jacob, On 01/05/2016 06:34 AM, Jacob Siverskog wrote: > On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazetwrote: >> On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: >>> On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet wrote: On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: > On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang > wrote: >> On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> wrote: >>> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet >>> wrote: How often can you trigger this bug ? >>> >>> Ok. I don't have a good repro to trigger it unfortunately, I've seen it >>> just a >>> few times when bringing up/down network interfaces. Does the trace >>> give any clue? >>> >> >> A little bit. You need to help people to narrow down the problem >> because there are too many places using skb->next and skb->prev. >> >> Since you mentioned it seems related to network interface flip, >> what network interfaces are you using? What's is your TC setup? >> >> Thanks. > > The system contains only one physical network interface (TI WL1837, > wl18xx module). > The state prior to the crash was as follows: > - One virtual network interface active (as STA, associated with access > point) > - Bluetooth (BLE only) active (same physical chip, co-existence, > btwilink/st_drv modules) > > Actions made around the time of the crash: > - Bluetooth disabled > - One additional virtual network interface brought up (also as STA) > > I believe the crash occurred between these two actions. I just saw > that there are some interesting events in the log prior to the crash: > kernel: Bluetooth: Unable to push skb to HCI core(-6) > kernel: (stc): proto stack 4's ->recv failed > kernel: (stc): remove_channel_from_table: id 3 > kernel: (stc): remove_channel_from_table: id 2 > kernel: (stc): remove_channel_from_table: id 4 > kernel: (stc): all chnl_ids unregistered > kernel: (stk) :ldisc_install = 0(stc): st_tty_close > > The first print is from btwilink.c. However, I can't see the > connection between Bluetooth (BLE) and UDP/IPv6 (we're not using > 6LoWPAN or anything similar). > > Thanks, Jacob Definitely these details are useful ;) Could you try : diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c index 6e3af8b42cdd..0c99a74fb895 100644 --- a/drivers/misc/ti-st/st_core.c +++ b/drivers/misc/ti-st/st_core.c @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) skb_queue_purge(_gdata->txq); skb_queue_purge(_gdata->tx_waitq); kfree_skb(st_gdata->rx_skb); + st_gdata->rx_skb = NULL; kfree_skb(st_gdata->tx_skb); + st_gdata->tx_skb = NULL; /* TTY ldisc cleanup */ err = tty_unregister_ldisc(N_TI_WL); if (err) FWIW, You don't need that ti-st junk to get the WL1837 working; the WL1837 only has BT channels. Unfortunately, that's really all I can say about it; sorry. Regards, Peter Hurley
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, 2016-01-20 at 16:06 +0100, Jacob Siverskog wrote: > On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazetwrote: > > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: > >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet > >> wrote: > > > >> > > >> > You might build a kernel with KASAN support to get maybe more chances to > >> > trigger the bug. > >> > > >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > >> > > >> > >> Ah. Doesn't seem to be supported on arm(32) unfortunately. > > > > Then you could at least use standard debugging features : > > > > CONFIG_SLAB=y > > CONFIG_SLABINFO=y > > CONFIG_DEBUG_SLAB=y > > CONFIG_DEBUG_SLAB_LEAK=y > > > > (Or equivalent SLUB options) > > > > and > > > > CONFIG_DEBUG_PAGEALLOC=y > > > > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) > > I tried with those enabled and while toggling power on the Bluetooth > interface I usually get this after a few iterations: > kernel: Bluetooth: Unable to push skb to HCI core(-6) Well, this code seems to be quite buggy. I do not have time to audit it, but 5 minutes are enough to spot 2 issues. skb, once given to another queue/layer should not be accessed anymore. diff --git a/drivers/bluetooth/btwilink.c b/drivers/bluetooth/btwilink.c index 24a652f9252b..2d3092aa6cfe 100644 --- a/drivers/bluetooth/btwilink.c +++ b/drivers/bluetooth/btwilink.c @@ -98,6 +98,7 @@ static void st_reg_completion_cb(void *priv_data, char data) static long st_receive(void *priv_data, struct sk_buff *skb) { struct ti_st *lhst = priv_data; + unsigned int len; int err; if (!skb) @@ -109,13 +110,14 @@ static long st_receive(void *priv_data, struct sk_buff *skb) } /* Forward skb to HCI core layer */ + len = skb->len; err = hci_recv_frame(lhst->hdev, skb); if (err < 0) { BT_ERR("Unable to push skb to HCI core(%d)", err); return err; } - lhst->hdev->stat.byte_rx += skb->len; + lhst->hdev->stat.byte_rx += len; return 0; } @@ -245,6 +247,7 @@ static int ti_st_send_frame(struct hci_dev *hdev, struct sk_buff *skb) { struct ti_st *hst; long len; + u8 pkt_type; hst = hci_get_drvdata(hdev); @@ -258,6 +261,7 @@ static int ti_st_send_frame(struct hci_dev *hdev, struct sk_buff *skb) * Freeing skb memory is taken care in shared transport layer, * so don't free skb memory here. */ + pkt_type = hci_skb_pkt_type(skb); len = hst->st_write(skb); if (len < 0) { kfree_skb(skb); @@ -268,7 +272,7 @@ static int ti_st_send_frame(struct hci_dev *hdev, struct sk_buff *skb) /* ST accepted our skb. So, Go ahead and do rest */ hdev->stat.byte_tx += len; - ti_st_tx_complete(hst, hci_skb_pkt_type(skb)); + ti_st_tx_complete(hst, pkt_type); return 0; }
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Jan 20, 2016 at 4:48 PM, Eric Dumazetwrote: > On Wed, 2016-01-20 at 16:06 +0100, Jacob Siverskog wrote: >> On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazet wrote: >> > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: >> >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet >> >> wrote: >> > >> >> > >> >> > You might build a kernel with KASAN support to get maybe more chances to >> >> > trigger the bug. >> >> > >> >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) >> >> > >> >> >> >> Ah. Doesn't seem to be supported on arm(32) unfortunately. >> > >> > Then you could at least use standard debugging features : >> > >> > CONFIG_SLAB=y >> > CONFIG_SLABINFO=y >> > CONFIG_DEBUG_SLAB=y >> > CONFIG_DEBUG_SLAB_LEAK=y >> > >> > (Or equivalent SLUB options) >> > >> > and >> > >> > CONFIG_DEBUG_PAGEALLOC=y >> > >> > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) >> >> I tried with those enabled and while toggling power on the Bluetooth >> interface I usually get this after a few iterations: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > > Well, this code seems to be quite buggy. > > I do not have time to audit it, but 5 minutes are enough to spot 2 > issues. > > skb, once given to another queue/layer should not be accessed anymore. > Ok. Unfortunately I still see the slab corruption even with your changes.
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazetwrote: > On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: >> On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet wrote: >> > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >> >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang >> >> wrote: >> >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> >> > wrote: >> >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet >> >> >> wrote: >> >> >>> How often can you trigger this bug ? >> >> >> >> >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen >> >> >> it just a >> >> >> few times when bringing up/down network interfaces. Does the trace >> >> >> give any clue? >> >> >> >> >> > >> >> > A little bit. You need to help people to narrow down the problem >> >> > because there are too many places using skb->next and skb->prev. >> >> > >> >> > Since you mentioned it seems related to network interface flip, >> >> > what network interfaces are you using? What's is your TC setup? >> >> > >> >> > Thanks. >> >> >> >> The system contains only one physical network interface (TI WL1837, >> >> wl18xx module). >> >> The state prior to the crash was as follows: >> >> - One virtual network interface active (as STA, associated with access >> >> point) >> >> - Bluetooth (BLE only) active (same physical chip, co-existence, >> >> btwilink/st_drv modules) >> >> >> >> Actions made around the time of the crash: >> >> - Bluetooth disabled >> >> - One additional virtual network interface brought up (also as STA) >> >> >> >> I believe the crash occurred between these two actions. I just saw >> >> that there are some interesting events in the log prior to the crash: >> >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> >> kernel: (stc): proto stack 4's ->recv failed >> >> kernel: (stc): remove_channel_from_table: id 3 >> >> kernel: (stc): remove_channel_from_table: id 2 >> >> kernel: (stc): remove_channel_from_table: id 4 >> >> kernel: (stc): all chnl_ids unregistered >> >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> >> >> The first print is from btwilink.c. However, I can't see the >> >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> >> 6LoWPAN or anything similar). >> >> >> >> Thanks, Jacob >> > >> > Definitely these details are useful ;) >> > >> > Could you try : >> > >> > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c >> > index 6e3af8b42cdd..0c99a74fb895 100644 >> > --- a/drivers/misc/ti-st/st_core.c >> > +++ b/drivers/misc/ti-st/st_core.c >> > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) >> > skb_queue_purge(_gdata->txq); >> > skb_queue_purge(_gdata->tx_waitq); >> > kfree_skb(st_gdata->rx_skb); >> > + st_gdata->rx_skb = NULL; >> > kfree_skb(st_gdata->tx_skb); >> > + st_gdata->tx_skb = NULL; >> > /* TTY ldisc cleanup */ >> > err = tty_unregister_ldisc(N_TI_WL); >> > if (err) >> > >> > >> >> Sure. Since I don't have a good way to trigger the initial issue, I >> can't really know if there is a difference with your patch. However, >> normal usage seems to work as expected with your patch. I've tried to >> reproduce the initial issue with and without your patch repeatedly for >> hours and have not seen any crash in any of the runs so far. >> -- > > You might build a kernel with KASAN support to get maybe more chances to > trigger the bug. > > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > Ah. Doesn't seem to be supported on arm(32) unfortunately. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: > On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazetwrote: > > > > You might build a kernel with KASAN support to get maybe more chances to > > trigger the bug. > > > > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > > > > Ah. Doesn't seem to be supported on arm(32) unfortunately. Then you could at least use standard debugging features : CONFIG_SLAB=y CONFIG_SLABINFO=y CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y (Or equivalent SLUB options) and CONFIG_DEBUG_PAGEALLOC=y (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: > On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazetwrote: > > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: > >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang > >> wrote: > >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog > >> > wrote: > >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet > >> >> wrote: > >> >>> How often can you trigger this bug ? > >> >> > >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it > >> >> just a > >> >> few times when bringing up/down network interfaces. Does the trace > >> >> give any clue? > >> >> > >> > > >> > A little bit. You need to help people to narrow down the problem > >> > because there are too many places using skb->next and skb->prev. > >> > > >> > Since you mentioned it seems related to network interface flip, > >> > what network interfaces are you using? What's is your TC setup? > >> > > >> > Thanks. > >> > >> The system contains only one physical network interface (TI WL1837, > >> wl18xx module). > >> The state prior to the crash was as follows: > >> - One virtual network interface active (as STA, associated with access > >> point) > >> - Bluetooth (BLE only) active (same physical chip, co-existence, > >> btwilink/st_drv modules) > >> > >> Actions made around the time of the crash: > >> - Bluetooth disabled > >> - One additional virtual network interface brought up (also as STA) > >> > >> I believe the crash occurred between these two actions. I just saw > >> that there are some interesting events in the log prior to the crash: > >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > >> kernel: (stc): proto stack 4's ->recv failed > >> kernel: (stc): remove_channel_from_table: id 3 > >> kernel: (stc): remove_channel_from_table: id 2 > >> kernel: (stc): remove_channel_from_table: id 4 > >> kernel: (stc): all chnl_ids unregistered > >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close > >> > >> The first print is from btwilink.c. However, I can't see the > >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using > >> 6LoWPAN or anything similar). > >> > >> Thanks, Jacob > > > > Definitely these details are useful ;) > > > > Could you try : > > > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > > index 6e3af8b42cdd..0c99a74fb895 100644 > > --- a/drivers/misc/ti-st/st_core.c > > +++ b/drivers/misc/ti-st/st_core.c > > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > > skb_queue_purge(_gdata->txq); > > skb_queue_purge(_gdata->tx_waitq); > > kfree_skb(st_gdata->rx_skb); > > + st_gdata->rx_skb = NULL; > > kfree_skb(st_gdata->tx_skb); > > + st_gdata->tx_skb = NULL; > > /* TTY ldisc cleanup */ > > err = tty_unregister_ldisc(N_TI_WL); > > if (err) > > > > > > Sure. Since I don't have a good way to trigger the initial issue, I > can't really know if there is a difference with your patch. However, > normal usage seems to work as expected with your patch. I've tried to > reproduce the initial issue with and without your patch repeatedly for > hours and have not seen any crash in any of the runs so far. > -- You might build a kernel with KASAN support to get maybe more chances to trigger the bug. ( https://www.kernel.org/doc/Documentation/kasan.txt ) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazetwrote: > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang wrote: >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> > wrote: >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet wrote: >> >>> How often can you trigger this bug ? >> >> >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it >> >> just a >> >> few times when bringing up/down network interfaces. Does the trace >> >> give any clue? >> >> >> > >> > A little bit. You need to help people to narrow down the problem >> > because there are too many places using skb->next and skb->prev. >> > >> > Since you mentioned it seems related to network interface flip, >> > what network interfaces are you using? What's is your TC setup? >> > >> > Thanks. >> >> The system contains only one physical network interface (TI WL1837, >> wl18xx module). >> The state prior to the crash was as follows: >> - One virtual network interface active (as STA, associated with access point) >> - Bluetooth (BLE only) active (same physical chip, co-existence, >> btwilink/st_drv modules) >> >> Actions made around the time of the crash: >> - Bluetooth disabled >> - One additional virtual network interface brought up (also as STA) >> >> I believe the crash occurred between these two actions. I just saw >> that there are some interesting events in the log prior to the crash: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> kernel: (stc): proto stack 4's ->recv failed >> kernel: (stc): remove_channel_from_table: id 3 >> kernel: (stc): remove_channel_from_table: id 2 >> kernel: (stc): remove_channel_from_table: id 4 >> kernel: (stc): all chnl_ids unregistered >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> The first print is from btwilink.c. However, I can't see the >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> 6LoWPAN or anything similar). >> >> Thanks, Jacob > > Definitely these details are useful ;) > > Could you try : > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > index 6e3af8b42cdd..0c99a74fb895 100644 > --- a/drivers/misc/ti-st/st_core.c > +++ b/drivers/misc/ti-st/st_core.c > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > skb_queue_purge(_gdata->txq); > skb_queue_purge(_gdata->tx_waitq); > kfree_skb(st_gdata->rx_skb); > + st_gdata->rx_skb = NULL; > kfree_skb(st_gdata->tx_skb); > + st_gdata->tx_skb = NULL; > /* TTY ldisc cleanup */ > err = tty_unregister_ldisc(N_TI_WL); > if (err) > > Sure. Since I don't have a good way to trigger the initial issue, I can't really know if there is a difference with your patch. However, normal usage seems to work as expected with your patch. I've tried to reproduce the initial issue with and without your patch repeatedly for hours and have not seen any crash in any of the runs so far. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 11:30 PM, Cong Wangwrote: > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog > wrote: >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet wrote: >>> How often can you trigger this bug ? >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it just >> a >> few times when bringing up/down network interfaces. Does the trace >> give any clue? >> > > A little bit. You need to help people to narrow down the problem > because there are too many places using skb->next and skb->prev. > > Since you mentioned it seems related to network interface flip, > what network interfaces are you using? What's is your TC setup? > > Thanks. The system contains only one physical network interface (TI WL1837, wl18xx module). The state prior to the crash was as follows: - One virtual network interface active (as STA, associated with access point) - Bluetooth (BLE only) active (same physical chip, co-existence, btwilink/st_drv modules) Actions made around the time of the crash: - Bluetooth disabled - One additional virtual network interface brought up (also as STA) I believe the crash occurred between these two actions. I just saw that there are some interesting events in the log prior to the crash: kernel: Bluetooth: Unable to push skb to HCI core(-6) kernel: (stc): proto stack 4's ->recv failed kernel: (stc): remove_channel_from_table: id 3 kernel: (stc): remove_channel_from_table: id 2 kernel: (stc): remove_channel_from_table: id 4 kernel: (stc): all chnl_ids unregistered kernel: (stk) :ldisc_install = 0(stc): st_tty_close The first print is from btwilink.c. However, I can't see the connection between Bluetooth (BLE) and UDP/IPv6 (we're not using 6LoWPAN or anything similar). Thanks, Jacob -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: > On Wed, Dec 30, 2015 at 11:30 PM, Cong Wangwrote: > > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog > > wrote: > >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet wrote: > >>> How often can you trigger this bug ? > >> > >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it > >> just a > >> few times when bringing up/down network interfaces. Does the trace > >> give any clue? > >> > > > > A little bit. You need to help people to narrow down the problem > > because there are too many places using skb->next and skb->prev. > > > > Since you mentioned it seems related to network interface flip, > > what network interfaces are you using? What's is your TC setup? > > > > Thanks. > > The system contains only one physical network interface (TI WL1837, > wl18xx module). > The state prior to the crash was as follows: > - One virtual network interface active (as STA, associated with access point) > - Bluetooth (BLE only) active (same physical chip, co-existence, > btwilink/st_drv modules) > > Actions made around the time of the crash: > - Bluetooth disabled > - One additional virtual network interface brought up (also as STA) > > I believe the crash occurred between these two actions. I just saw > that there are some interesting events in the log prior to the crash: > kernel: Bluetooth: Unable to push skb to HCI core(-6) > kernel: (stc): proto stack 4's ->recv failed > kernel: (stc): remove_channel_from_table: id 3 > kernel: (stc): remove_channel_from_table: id 2 > kernel: (stc): remove_channel_from_table: id 4 > kernel: (stc): all chnl_ids unregistered > kernel: (stk) :ldisc_install = 0(stc): st_tty_close > > The first print is from btwilink.c. However, I can't see the > connection between Bluetooth (BLE) and UDP/IPv6 (we're not using > 6LoWPAN or anything similar). > > Thanks, Jacob Definitely these details are useful ;) Could you try : diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c index 6e3af8b42cdd..0c99a74fb895 100644 --- a/drivers/misc/ti-st/st_core.c +++ b/drivers/misc/ti-st/st_core.c @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) skb_queue_purge(_gdata->txq); skb_queue_purge(_gdata->tx_waitq); kfree_skb(st_gdata->rx_skb); + st_gdata->rx_skb = NULL; kfree_skb(st_gdata->tx_skb); + st_gdata->tx_skb = NULL; /* TTY ldisc cleanup */ err = tty_unregister_ldisc(N_TI_WL); if (err) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Mon, 2016-01-04 at 16:14 +, Rainer Weikusat wrote: > Eric Dumazetwrites: > > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: > > [...] > > >> I believe the crash occurred between these two actions. I just saw > >> that there are some interesting events in the log prior to the crash: > >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > >> kernel: (stc): proto stack 4's ->recv failed > >> kernel: (stc): remove_channel_from_table: id 3 > >> kernel: (stc): remove_channel_from_table: id 2 > >> kernel: (stc): remove_channel_from_table: id 4 > >> kernel: (stc): all chnl_ids unregistered > >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close > >> > >> The first print is from btwilink.c. However, I can't see the > >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using > >> 6LoWPAN or anything similar). > >> > >> Thanks, Jacob > > > > Definitely these details are useful ;) > > > > Could you try : > > > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > > index 6e3af8b42cdd..0c99a74fb895 100644 > > --- a/drivers/misc/ti-st/st_core.c > > +++ b/drivers/misc/ti-st/st_core.c > > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > > skb_queue_purge(_gdata->txq); > > skb_queue_purge(_gdata->tx_waitq); > > kfree_skb(st_gdata->rx_skb); > > + st_gdata->rx_skb = NULL; > > kfree_skb(st_gdata->tx_skb); > > + st_gdata->tx_skb = NULL; > > /* TTY ldisc cleanup */ > > err = tty_unregister_ldisc(N_TI_WL); > > if (err) > > Hmm ... the code continues with > > err = tty_unregister_ldisc(N_TI_WL); > if (err) > pr_err("unable to un-register ldisc %ld", err); > /* free the global data pointer */ > kfree(st_gdata); > > So who would ever see that the rx_skb and tx_skb pointers were cleared > prior to freeing the data structure containing them? This is the theory, but I suspect a use after free. kfree(st_gdata) does not clear all content with 0, unless you use special SLUB/SLAB debugging features. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
Eric Dumazetwrites: > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: [...] >> I believe the crash occurred between these two actions. I just saw >> that there are some interesting events in the log prior to the crash: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> kernel: (stc): proto stack 4's ->recv failed >> kernel: (stc): remove_channel_from_table: id 3 >> kernel: (stc): remove_channel_from_table: id 2 >> kernel: (stc): remove_channel_from_table: id 4 >> kernel: (stc): all chnl_ids unregistered >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> The first print is from btwilink.c. However, I can't see the >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> 6LoWPAN or anything similar). >> >> Thanks, Jacob > > Definitely these details are useful ;) > > Could you try : > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > index 6e3af8b42cdd..0c99a74fb895 100644 > --- a/drivers/misc/ti-st/st_core.c > +++ b/drivers/misc/ti-st/st_core.c > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > skb_queue_purge(_gdata->txq); > skb_queue_purge(_gdata->tx_waitq); > kfree_skb(st_gdata->rx_skb); > + st_gdata->rx_skb = NULL; > kfree_skb(st_gdata->tx_skb); > + st_gdata->tx_skb = NULL; > /* TTY ldisc cleanup */ > err = tty_unregister_ldisc(N_TI_WL); > if (err) Hmm ... the code continues with err = tty_unregister_ldisc(N_TI_WL); if (err) pr_err("unable to un-register ldisc %ld", err); /* free the global data pointer */ kfree(st_gdata); So who would ever see that the rx_skb and tx_skb pointers were cleared prior to freeing the data structure containing them? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskogwrote: > On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet wrote: >> How often can you trigger this bug ? > > Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a > few times when bringing up/down network interfaces. Does the trace > give any clue? > A little bit. You need to help people to narrow down the problem because there are too many places using skb->next and skb->prev. Since you mentioned it seems related to network interface flip, what network interfaces are you using? What's is your TC setup? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Tue, Dec 29, 2015 at 9:08 PM, David Millerwrote: > From: Rainer Weikusat > Date: Tue, 29 Dec 2015 19:42:36 + > >> Jacob Siverskog writes: >>> This should fix a NULL pointer dereference I encountered (dump >>> below). Since __skb_unlink is called while walking, >>> skb_queue_walk_safe should be used. >> >> The code in question is: > ... >> __skb_unlink is only called prior to returning from the function. >> Consequently, it won't affect the skb_queue_walk code. > > Agreed, this patch doesn't fix anything. Ok. Thanks for your feedback. How do you believe the issue could be solved? Investigating it gives: static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list) { struct sk_buff *next, *prev; list->qlen--; 51c: e2433001 sub r3, r3, #1 520: e58b3074 str r3, [fp, #116] ; 0x74 next = skb->next; prev = skb->prev; 524: e894000c ldm r4, {r2, r3} skb->next = skb->prev = NULL; 528: e5841000 str r1, [r4] 52c: e5841004 str r1, [r4, #4] next->prev = prev; 530: e5823004 str r3, [r2, #4] <-- trapping instruction (r2 NULL) Register contents: r7 : c58cfe1c r6 : c06351d0 r5 : c77810ac r4 : c583eac0 r3 : r2 : r1 : r0 : 2013 If I understand this correctly, then r4 = skb, r2 = next, r3 = prev. Should there be a check for this in __skb_try_recv_datagram? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazetwrote: > On Wed, Dec 30, 2015 at 6:14 AM, Jacob Siverskog > wrote: > >> Ok. Thanks for your feedback. How do you believe the issue could be >> solved? Investigating it gives: >> >> static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head >> *list) >> { >> struct sk_buff *next, *prev; >> >> list->qlen--; >> 51c: e2433001 sub r3, r3, #1 >> 520: e58b3074 str r3, [fp, #116] ; 0x74 >> next = skb->next; >> prev = skb->prev; >> 524: e894000c ldm r4, {r2, r3} >> skb->next = skb->prev = NULL; >> 528: e5841000 str r1, [r4] >> 52c: e5841004 str r1, [r4, #4] >> next->prev = prev; >> 530: e5823004 str r3, [r2, #4] <-- >> trapping instruction (r2 NULL) >> >> Register contents: >> r7 : c58cfe1c r6 : c06351d0 r5 : c77810ac r4 : c583eac0 >> r3 : r2 : r1 : r0 : 2013 >> >> If I understand this correctly, then r4 = skb, r2 = next, r3 = prev. >> >> Should there be a check for this in __skb_try_recv_datagram? > > At this point corruption already happened. > We can not possibly detect every possible corruption caused by bugs > elsewhere in the kernel and just 'recover' at this point. > We must indeed find the root cause and fix it, instead of trying to hide it. > > How often can you trigger this bug ? Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a few times when bringing up/down network interfaces. Does the trace give any clue? [] (__skb_recv_datagram) from [] (udpv6_recvmsg+0x1d0/0x6d0) [] (udpv6_recvmsg) from [] (inet_recvmsg+0x38/0x4c) [] (inet_recvmsg) from [] (___sys_recvmsg+0x94/0x170) [] (___sys_recvmsg) from [] (__sys_recvmsg+0x3c/0x6c) [] (__sys_recvmsg) from [] (ret_fast_syscall+0x0/0x3c) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 6:14 AM, Jacob Siverskogwrote: > Ok. Thanks for your feedback. How do you believe the issue could be > solved? Investigating it gives: > > static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head > *list) > { > struct sk_buff *next, *prev; > > list->qlen--; > 51c: e2433001 sub r3, r3, #1 > 520: e58b3074 str r3, [fp, #116] ; 0x74 > next = skb->next; > prev = skb->prev; > 524: e894000c ldm r4, {r2, r3} > skb->next = skb->prev = NULL; > 528: e5841000 str r1, [r4] > 52c: e5841004 str r1, [r4, #4] > next->prev = prev; > 530: e5823004 str r3, [r2, #4] <-- > trapping instruction (r2 NULL) > > Register contents: > r7 : c58cfe1c r6 : c06351d0 r5 : c77810ac r4 : c583eac0 > r3 : r2 : r1 : r0 : 2013 > > If I understand this correctly, then r4 = skb, r2 = next, r3 = prev. > > Should there be a check for this in __skb_try_recv_datagram? At this point corruption already happened. We can not possibly detect every possible corruption caused by bugs elsewhere in the kernel and just 'recover' at this point. We must indeed find the root cause and fix it, instead of trying to hide it. How often can you trigger this bug ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
On Wed, Dec 30, 2015 at 9:30 AM, Jacob Siverskogwrote: > On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet wrote: >> At this point corruption already happened. >> We can not possibly detect every possible corruption caused by bugs >> elsewhere in the kernel and just 'recover' at this point. >> We must indeed find the root cause and fix it, instead of trying to hide it. >> >> How often can you trigger this bug ? > > Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a > few times when bringing up/down network interfaces. Does the trace > give any clue? > > [] (__skb_recv_datagram) from [] > (udpv6_recvmsg+0x1d0/0x6d0) > [] (udpv6_recvmsg) from [] (inet_recvmsg+0x38/0x4c) > [] (inet_recvmsg) from [] (___sys_recvmsg+0x94/0x170) > [] (___sys_recvmsg) from [] (__sys_recvmsg+0x3c/0x6c) > [] (__sys_recvmsg) from [] (ret_fast_syscall+0x0/0x3c) Not really : it only shows the point where the corruption is detected, not the point where the corruption happened. This might be caused by a netfilter module, a buggy driver... it is hard to know. You might add some traces on the skb itself, like its length or/and content. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
Jacob Siverskogwrites: > On Tue, Dec 29, 2015 at 9:08 PM, David Miller wrote: >> From: Rainer Weikusat >> Date: Tue, 29 Dec 2015 19:42:36 + >> >>> Jacob Siverskog writes: This should fix a NULL pointer dereference I encountered (dump below). Since __skb_unlink is called while walking, skb_queue_walk_safe should be used. >>> >>> The code in question is: >> ... >>> __skb_unlink is only called prior to returning from the function. >>> Consequently, it won't affect the skb_queue_walk code. >> >> Agreed, this patch doesn't fix anything. > > Ok. Thanks for your feedback. How do you believe the issue could be > solved? Investigating it gives: > > static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head > *list) > { [...] > next->prev = prev; > 530: e5823004 str r3, [r2, #4] <-- > trapping instruction (r2 NULL) > > Register contents: > r7 : c58cfe1c r6 : c06351d0 r5 : c77810ac r4 : c583eac0 > r3 : r2 : r1 : r0 : 2013 > > If I understand this correctly, then r4 = skb, r2 = next, r3 = prev. Some additional information which may be helpful: The next->prev = prev was pretty obvious from the original error message alone: The invalid access happened at 4 but no register contained 4. Considering that this is for ARM, this must have been caused by an instruction using an address of the form [Rx, #4] ie, value of register x + 4. And the next->prev = prev is the only access to something located 4 bytes beyond something else. > Should there be a check for this in __skb_try_recv_datagram? These lists are supposed to be circular, ie, the next pointer of the last element should point to the first and the prev pointer of the first to the last. If there's an element with ->next == NULL on the list, something either didn't do inserts correctly or corrupted an originally intact list. General advice: The original error occurred with 4.3.0. Had this happened to me, I'd either tried to locate the error in the same kernel version or to reproduce the bug with the one I was planning to modify. Trying to fix a 'strange memory access' error which was observed with version x.y by modifying version x.z is IMHO needlessly moving on shaky ground. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
From: Rainer WeikusatDate: Tue, 29 Dec 2015 19:42:36 + > Jacob Siverskog writes: >> This should fix a NULL pointer dereference I encountered (dump >> below). Since __skb_unlink is called while walking, >> skb_queue_walk_safe should be used. > > The code in question is: ... > __skb_unlink is only called prior to returning from the function. > Consequently, it won't affect the skb_queue_walk code. Agreed, this patch doesn't fix anything. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram
Jacob Siverskogwrites: > This should fix a NULL pointer dereference I encountered (dump > below). Since __skb_unlink is called while walking, > skb_queue_walk_safe should be used. The code in question is: skb_queue_walk(queue, skb) { *last = skb; *peeked = skb->peeked; if (flags & MSG_PEEK) { if (_off >= skb->len && (skb->len || _off || skb->peeked)) { _off -= skb->len; continue; } skb = skb_set_peeked(skb); error = PTR_ERR(skb); if (IS_ERR(skb)) { spin_unlock_irqrestore(>lock, cpu_flags); goto no_packet; } atomic_inc(>users); } else __skb_unlink(skb, queue); spin_unlock_irqrestore(>lock, cpu_flags); *off = _off; return skb; } __skb_unlink is only called prior to returning from the function. Consequently, it won't affect the skb_queue_walk code. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html