Re: net: hang in unregister_netdevice: waiting for lo to become free
On Fri, May 11, 2018 at 5:19 AM, Dmitry Vyukovwrote: > On Thu, May 10, 2018 at 12:23 PM, Dan Streetman wrote: wrote: > On 20.02.2018 18:26, Neil Horman wrote: >> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >>> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >>> wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: > > Is this meant to be fixed already? I am still seeing this on the > latest upstream tree. > These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. >>> >>> >>> Hi Tommi, >>> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >>> another one then. But I am still seeing these: >>> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> >>> on upstream tree pulled ~12 hours ago. >>> >> Can you write a systemtap script to probe dev_hold, and dev_put, >> printing >> out a >> backtrace if the device name matches "lo". That should tell us >> definitively if >> the problem is in the same location or not > > > Hi Dmitry, I tested with the reproducer and the kernel .config file > that you > sent in the first email in this thread: > > With 4.16-rc2 unable to reproduce. > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: > waiting for > lo to become free. Usage count = 3" > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in > sctp_v6_get_dst()" > cherry-picked on top, unable to reproduce. > > > Is syzkaller doing something else now to trigger the bug...? > Can you still trigger the bug with the same reproducer? Hi Neil, Tommi, Reviving this old thread about "unregister_netdevice: waiting for lo to become free. Usage count = 3" hangs. I still did not have time to deep dive into what happens there (too many bugs coming from syzbot). But this still actively happens and I suspect accounts to a significant portion of various hang reports, which are quite unpleasant. One idea that could make it all simpler: Is this wait loop in netdev_wait_allrefs() supposed to wait for any prolonged periods of time under any non-buggy conditions? E.g. more than 1-2 minutes? If it only supposed to wait briefly for things that already supposed to be shutting down, and we add a WARNING there after some timeout, then syzbot will report all info how/when it happens, hopefully extracting reproducers, and all the nice things. But this WARNING should not have any false positives under any realistic conditions (e.g. waiting for arrival of remote packets with large timeouts). Looking at some task hung reports, it seems that this code holds some mutexes, takes workqueue thread and prevents any progress with destruction of other devices (and net namespace creation/destruction), so I guess it should not wait for any indefinite periods of time? >>> >>> I'm working on
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Fri, May 11, 2018 at 5:19 AM, Dmitry Vyukov wrote: > On Thu, May 10, 2018 at 12:23 PM, Dan Streetman wrote: wrote: > On 20.02.2018 18:26, Neil Horman wrote: >> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >>> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >>> wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: > > Is this meant to be fixed already? I am still seeing this on the > latest upstream tree. > These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. >>> >>> >>> Hi Tommi, >>> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >>> another one then. But I am still seeing these: >>> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> >>> on upstream tree pulled ~12 hours ago. >>> >> Can you write a systemtap script to probe dev_hold, and dev_put, >> printing >> out a >> backtrace if the device name matches "lo". That should tell us >> definitively if >> the problem is in the same location or not > > > Hi Dmitry, I tested with the reproducer and the kernel .config file > that you > sent in the first email in this thread: > > With 4.16-rc2 unable to reproduce. > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: > waiting for > lo to become free. Usage count = 3" > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in > sctp_v6_get_dst()" > cherry-picked on top, unable to reproduce. > > > Is syzkaller doing something else now to trigger the bug...? > Can you still trigger the bug with the same reproducer? Hi Neil, Tommi, Reviving this old thread about "unregister_netdevice: waiting for lo to become free. Usage count = 3" hangs. I still did not have time to deep dive into what happens there (too many bugs coming from syzbot). But this still actively happens and I suspect accounts to a significant portion of various hang reports, which are quite unpleasant. One idea that could make it all simpler: Is this wait loop in netdev_wait_allrefs() supposed to wait for any prolonged periods of time under any non-buggy conditions? E.g. more than 1-2 minutes? If it only supposed to wait briefly for things that already supposed to be shutting down, and we add a WARNING there after some timeout, then syzbot will report all info how/when it happens, hopefully extracting reproducers, and all the nice things. But this WARNING should not have any false positives under any realistic conditions (e.g. waiting for arrival of remote packets with large timeouts). Looking at some task hung reports, it seems that this code holds some mutexes, takes workqueue thread and prevents any progress with destruction of other devices (and net namespace creation/destruction), so I guess it should not wait for any indefinite periods of time? >>> >>> I'm working on this currently: >>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >>> >>> I added a summary of what I've found to
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, May 10, 2018 at 12:23 PM, Dan Streetmanwrote: >>> wrote: On 20.02.2018 18:26, Neil Horman wrote: > > On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >> >> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >> wrote: >>> >>> On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. >>> >>> These two commits are in v4.16-rc1: >>> >>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >>> Author: Tommi Rantala >>> Date: Mon Feb 5 21:48:14 2018 +0200 >>> >>> sctp: fix dst refcnt leak in sctp_v4_get_dst >>> ... >>> Fixes: 410f03831 ("sctp: add routing output fallback") >>> Fixes: 0ca50d12f ("sctp: fix src address selection if using >>> secondary >>> addresses") >>> >>> >>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >>> Author: Alexey Kodanev >>> Date: Mon Feb 5 15:10:35 2018 +0300 >>> >>> sctp: fix dst refcnt leak in sctp_v6_get_dst() >>> ... >>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >>> secondary >>> addresses for ipv6") >>> >>> >>> I guess we missed something if it's still reproducible. >>> >>> I can check it later this week, unless someone else beat me to it. >> >> >> Hi Tommi, >> >> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >> another one then. But I am still seeing these: >> >> [ 58.799130] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 60.847138] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 62.895093] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 64.943103] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> >> on upstream tree pulled ~12 hours ago. >> > Can you write a systemtap script to probe dev_hold, and dev_put, > printing > out a > backtrace if the device name matches "lo". That should tell us > definitively if > the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? >>> >>> Hi Neil, Tommi, >>> >>> Reviving this old thread about "unregister_netdevice: waiting for lo >>> to become free. Usage count = 3" hangs. >>> I still did not have time to deep dive into what happens there (too >>> many bugs coming from syzbot). But this still actively happens and I >>> suspect accounts to a significant portion of various hang reports, >>> which are quite unpleasant. >>> >>> One idea that could make it all simpler: >>> >>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >>> prolonged periods of time under any non-buggy conditions? E.g. more >>> than 1-2 minutes? >>> If it only supposed to wait briefly for things that already supposed >>> to be shutting down, and we add a WARNING there after some timeout, >>> then syzbot will report all info how/when it happens, hopefully >>> extracting reproducers, and all the nice things. >>> But this WARNING should not have any false positives under any >>> realistic conditions (e.g. waiting for arrival of remote packets with >>> large timeouts). >>> >>> Looking at some task hung reports, it seems that this code holds some >>> mutexes, takes workqueue thread and prevents any progress with >>> destruction of other devices (and net namespace creation/destruction), >>> so I guess it should not wait for any indefinite periods of time? >> >> I'm working on this currently: >> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >> >> I added a summary of what I've found to be the cause (or at least, one >> possible
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, May 10, 2018 at 12:23 PM, Dan Streetman wrote: >>> wrote: On 20.02.2018 18:26, Neil Horman wrote: > > On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >> >> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >> wrote: >>> >>> On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. >>> >>> These two commits are in v4.16-rc1: >>> >>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >>> Author: Tommi Rantala >>> Date: Mon Feb 5 21:48:14 2018 +0200 >>> >>> sctp: fix dst refcnt leak in sctp_v4_get_dst >>> ... >>> Fixes: 410f03831 ("sctp: add routing output fallback") >>> Fixes: 0ca50d12f ("sctp: fix src address selection if using >>> secondary >>> addresses") >>> >>> >>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >>> Author: Alexey Kodanev >>> Date: Mon Feb 5 15:10:35 2018 +0300 >>> >>> sctp: fix dst refcnt leak in sctp_v6_get_dst() >>> ... >>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >>> secondary >>> addresses for ipv6") >>> >>> >>> I guess we missed something if it's still reproducible. >>> >>> I can check it later this week, unless someone else beat me to it. >> >> >> Hi Tommi, >> >> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >> another one then. But I am still seeing these: >> >> [ 58.799130] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 60.847138] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 62.895093] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 64.943103] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> >> on upstream tree pulled ~12 hours ago. >> > Can you write a systemtap script to probe dev_hold, and dev_put, > printing > out a > backtrace if the device name matches "lo". That should tell us > definitively if > the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? >>> >>> Hi Neil, Tommi, >>> >>> Reviving this old thread about "unregister_netdevice: waiting for lo >>> to become free. Usage count = 3" hangs. >>> I still did not have time to deep dive into what happens there (too >>> many bugs coming from syzbot). But this still actively happens and I >>> suspect accounts to a significant portion of various hang reports, >>> which are quite unpleasant. >>> >>> One idea that could make it all simpler: >>> >>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >>> prolonged periods of time under any non-buggy conditions? E.g. more >>> than 1-2 minutes? >>> If it only supposed to wait briefly for things that already supposed >>> to be shutting down, and we add a WARNING there after some timeout, >>> then syzbot will report all info how/when it happens, hopefully >>> extracting reproducers, and all the nice things. >>> But this WARNING should not have any false positives under any >>> realistic conditions (e.g. waiting for arrival of remote packets with >>> large timeouts). >>> >>> Looking at some task hung reports, it seems that this code holds some >>> mutexes, takes workqueue thread and prevents any progress with >>> destruction of other devices (and net namespace creation/destruction), >>> so I guess it should not wait for any indefinite periods of time? >> >> I'm working on this currently: >> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >> >> I added a summary of what I've found to be the cause (or at least, one >> possible cause) of this: >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 >> >> I'm working on a
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, May 10, 2018 at 2:46 AM, Dmitry Vyukovwrote: > On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman wrote: >> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >> wrote: >>> On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > > On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > wrote: >> >> On 19.02.2018 20:59, Dmitry Vyukov wrote: >>> >>> Is this meant to be fixed already? I am still seeing this on the >>> latest upstream tree. >>> >> >> These two commits are in v4.16-rc1: >> >> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >> Author: Tommi Rantala >> Date: Mon Feb 5 21:48:14 2018 +0200 >> >> sctp: fix dst refcnt leak in sctp_v4_get_dst >> ... >> Fixes: 410f03831 ("sctp: add routing output fallback") >> Fixes: 0ca50d12f ("sctp: fix src address selection if using >> secondary >> addresses") >> >> >> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >> Author: Alexey Kodanev >> Date: Mon Feb 5 15:10:35 2018 +0300 >> >> sctp: fix dst refcnt leak in sctp_v6_get_dst() >> ... >> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >> secondary >> addresses for ipv6") >> >> >> I guess we missed something if it's still reproducible. >> >> I can check it later this week, unless someone else beat me to it. > > > Hi Tommi, > > Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > another one then. But I am still seeing these: > > [ 58.799130] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 60.847138] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 62.895093] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 64.943103] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > > on upstream tree pulled ~12 hours ago. > Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not >>> >>> >>> Hi Dmitry, I tested with the reproducer and the kernel .config file >>> that you >>> sent in the first email in this thread: >>> >>> With 4.16-rc2 unable to reproduce. >>> >>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: >>> waiting for >>> lo to become free. Usage count = 3" >>> >>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in >>> sctp_v6_get_dst()" >>> cherry-picked on top, unable to reproduce. >>> >>> >>> Is syzkaller doing something else now to trigger the bug...? >>> Can you still trigger the bug with the same reproducer? >> >> Hi Neil, Tommi, >> >> Reviving this old thread about "unregister_netdevice: waiting for lo >> to become free. Usage count = 3" hangs. >> I still did not have time to deep dive into what happens there (too >> many bugs coming from syzbot). But this still actively happens and I >> suspect accounts to a significant portion of various hang reports, >> which are quite unpleasant. >> >> One idea that could make it all simpler: >> >> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >> prolonged periods of time under any non-buggy conditions? E.g. more >> than 1-2 minutes? >> If it only supposed to wait briefly for things that already supposed >> to be shutting down, and we add a WARNING there after some timeout, >> then syzbot will report all info how/when it happens, hopefully >> extracting reproducers, and all the nice things. >> But this WARNING should not have any false positives under any >> realistic conditions (e.g. waiting for arrival of remote packets with >> large timeouts). >> >> Looking at some task hung reports, it seems that this code holds some >> mutexes, takes workqueue thread and prevents any progress with >> destruction of other devices (and net namespace creation/destruction), >> so I guess it should not wait for any indefinite periods of time? > > I'm working on this currently: > https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 > > I added a summary of what I've found to be the cause (or at least, one
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, May 10, 2018 at 2:46 AM, Dmitry Vyukov wrote: > On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman wrote: >> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >> wrote: >>> On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > > On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > wrote: >> >> On 19.02.2018 20:59, Dmitry Vyukov wrote: >>> >>> Is this meant to be fixed already? I am still seeing this on the >>> latest upstream tree. >>> >> >> These two commits are in v4.16-rc1: >> >> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >> Author: Tommi Rantala >> Date: Mon Feb 5 21:48:14 2018 +0200 >> >> sctp: fix dst refcnt leak in sctp_v4_get_dst >> ... >> Fixes: 410f03831 ("sctp: add routing output fallback") >> Fixes: 0ca50d12f ("sctp: fix src address selection if using >> secondary >> addresses") >> >> >> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >> Author: Alexey Kodanev >> Date: Mon Feb 5 15:10:35 2018 +0300 >> >> sctp: fix dst refcnt leak in sctp_v6_get_dst() >> ... >> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >> secondary >> addresses for ipv6") >> >> >> I guess we missed something if it's still reproducible. >> >> I can check it later this week, unless someone else beat me to it. > > > Hi Tommi, > > Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > another one then. But I am still seeing these: > > [ 58.799130] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 60.847138] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 62.895093] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 64.943103] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > > on upstream tree pulled ~12 hours ago. > Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not >>> >>> >>> Hi Dmitry, I tested with the reproducer and the kernel .config file >>> that you >>> sent in the first email in this thread: >>> >>> With 4.16-rc2 unable to reproduce. >>> >>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: >>> waiting for >>> lo to become free. Usage count = 3" >>> >>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in >>> sctp_v6_get_dst()" >>> cherry-picked on top, unable to reproduce. >>> >>> >>> Is syzkaller doing something else now to trigger the bug...? >>> Can you still trigger the bug with the same reproducer? >> >> Hi Neil, Tommi, >> >> Reviving this old thread about "unregister_netdevice: waiting for lo >> to become free. Usage count = 3" hangs. >> I still did not have time to deep dive into what happens there (too >> many bugs coming from syzbot). But this still actively happens and I >> suspect accounts to a significant portion of various hang reports, >> which are quite unpleasant. >> >> One idea that could make it all simpler: >> >> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >> prolonged periods of time under any non-buggy conditions? E.g. more >> than 1-2 minutes? >> If it only supposed to wait briefly for things that already supposed >> to be shutting down, and we add a WARNING there after some timeout, >> then syzbot will report all info how/when it happens, hopefully >> extracting reproducers, and all the nice things. >> But this WARNING should not have any false positives under any >> realistic conditions (e.g. waiting for arrival of remote packets with >> large timeouts). >> >> Looking at some task hung reports, it seems that this code holds some >> mutexes, takes workqueue thread and prevents any progress with >> destruction of other devices (and net namespace creation/destruction), >> so I guess it should not wait for any indefinite periods of time? > > I'm working on this currently: > https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 > > I added a summary of what I've found to be the cause (or at least, one > possible cause) of this: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 > > I'm working on a patch to
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetmanwrote: > On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala > wrote: >> On 20.02.2018 18:26, Neil Horman wrote: >>> >>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: >> >> Is this meant to be fixed already? I am still seeing this on the >> latest upstream tree. >> > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using > secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using > secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. >>> Can you write a systemtap script to probe dev_hold, and dev_put, >>> printing >>> out a >>> backtrace if the device name matches "lo". That should tell us >>> definitively if >>> the problem is in the same location or not >> >> >> Hi Dmitry, I tested with the reproducer and the kernel .config file that >> you >> sent in the first email in this thread: >> >> With 4.16-rc2 unable to reproduce. >> >> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting >> for >> lo to become free. Usage count = 3" >> >> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in >> sctp_v6_get_dst()" >> cherry-picked on top, unable to reproduce. >> >> >> Is syzkaller doing something else now to trigger the bug...? >> Can you still trigger the bug with the same reproducer? > > Hi Neil, Tommi, > > Reviving this old thread about "unregister_netdevice: waiting for lo > to become free. Usage count = 3" hangs. > I still did not have time to deep dive into what happens there (too > many bugs coming from syzbot). But this still actively happens and I > suspect accounts to a significant portion of various hang reports, > which are quite unpleasant. > > One idea that could make it all simpler: > > Is this wait loop in netdev_wait_allrefs() supposed to wait for any > prolonged periods of time under any non-buggy conditions? E.g. more > than 1-2 minutes? > If it only supposed to wait briefly for things that already supposed > to be shutting down, and we add a WARNING there after some timeout, > then syzbot will report all info how/when it happens, hopefully > extracting reproducers, and all the nice things. > But this WARNING should not have any false positives under any > realistic conditions (e.g. waiting for arrival of remote packets with > large timeouts). > > Looking at some task hung reports, it seems that this code holds some > mutexes, takes workqueue thread and prevents any progress with > destruction of other devices (and net namespace creation/destruction), > so I guess it should not wait for any indefinite periods of time? I'm working on this currently: https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 I added a summary of what I've found to be the cause (or at least, one possible cause) of this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 I'm working on a patch to work around the main side-effect of this, which
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman wrote: > On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala > wrote: >> On 20.02.2018 18:26, Neil Horman wrote: >>> >>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: >> >> Is this meant to be fixed already? I am still seeing this on the >> latest upstream tree. >> > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using > secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using > secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. >>> Can you write a systemtap script to probe dev_hold, and dev_put, >>> printing >>> out a >>> backtrace if the device name matches "lo". That should tell us >>> definitively if >>> the problem is in the same location or not >> >> >> Hi Dmitry, I tested with the reproducer and the kernel .config file that >> you >> sent in the first email in this thread: >> >> With 4.16-rc2 unable to reproduce. >> >> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting >> for >> lo to become free. Usage count = 3" >> >> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in >> sctp_v6_get_dst()" >> cherry-picked on top, unable to reproduce. >> >> >> Is syzkaller doing something else now to trigger the bug...? >> Can you still trigger the bug with the same reproducer? > > Hi Neil, Tommi, > > Reviving this old thread about "unregister_netdevice: waiting for lo > to become free. Usage count = 3" hangs. > I still did not have time to deep dive into what happens there (too > many bugs coming from syzbot). But this still actively happens and I > suspect accounts to a significant portion of various hang reports, > which are quite unpleasant. > > One idea that could make it all simpler: > > Is this wait loop in netdev_wait_allrefs() supposed to wait for any > prolonged periods of time under any non-buggy conditions? E.g. more > than 1-2 minutes? > If it only supposed to wait briefly for things that already supposed > to be shutting down, and we add a WARNING there after some timeout, > then syzbot will report all info how/when it happens, hopefully > extracting reproducers, and all the nice things. > But this WARNING should not have any false positives under any > realistic conditions (e.g. waiting for arrival of remote packets with > large timeouts). > > Looking at some task hung reports, it seems that this code holds some > mutexes, takes workqueue thread and prevents any progress with > destruction of other devices (and net namespace creation/destruction), > so I guess it should not wait for any indefinite periods of time? I'm working on this currently: https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 I added a summary of what I've found to be the cause (or at least, one possible cause) of this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 I'm working on a patch to work around the main side-effect of this, which is hanging while holding the global net mutex. Hangs will still happen (e.g. if a dst leaks) but should not affect
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Mon, Apr 16, 2018 at 3:35 AM, Dmitry Vyukovwrote: > On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov wrote: >> On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman wrote: >>> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala wrote: > On 20.02.2018 18:26, Neil Horman wrote: >> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >>> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >>> wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: > > Is this meant to be fixed already? I am still seeing this on the > latest upstream tree. > These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. >>> >>> >>> Hi Tommi, >>> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >>> another one then. But I am still seeing these: >>> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> >>> on upstream tree pulled ~12 hours ago. >>> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing >> out a >> backtrace if the device name matches "lo". That should tell us >> definitively if >> the problem is in the same location or not > > > Hi Dmitry, I tested with the reproducer and the kernel .config file that > you > sent in the first email in this thread: > > With 4.16-rc2 unable to reproduce. > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting > for > lo to become free. Usage count = 3" > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in > sctp_v6_get_dst()" > cherry-picked on top, unable to reproduce. > > > Is syzkaller doing something else now to trigger the bug...? > Can you still trigger the bug with the same reproducer? Hi Neil, Tommi, Reviving this old thread about "unregister_netdevice: waiting for lo to become free. Usage count = 3" hangs. I still did not have time to deep dive into what happens there (too many bugs coming from syzbot). But this still actively happens and I suspect accounts to a significant portion of various hang reports, which are quite unpleasant. One idea that could make it all simpler: Is this wait loop in netdev_wait_allrefs() supposed to wait for any prolonged periods of time under any non-buggy conditions? E.g. more than 1-2 minutes? If it only supposed to wait briefly for things that already supposed to be shutting down, and we add a WARNING there after some timeout, then syzbot will report all info how/when it happens, hopefully extracting reproducers, and all the nice things. But this WARNING should not have any false positives under any realistic conditions (e.g. waiting for arrival of remote packets with large timeouts). Looking at some task hung reports, it seems that this code holds some mutexes, takes workqueue thread and prevents any progress with destruction of other devices (and net namespace creation/destruction), so I guess it should not wait for any indefinite periods of time? >>> >>> I'm working on this currently: >>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >>> >>> I added a summary of what I've found to be the cause (or at least, one >>> possible cause) of this: >>>
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Mon, Apr 16, 2018 at 3:35 AM, Dmitry Vyukov wrote: > On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov wrote: >> On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman wrote: >>> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala wrote: > On 20.02.2018 18:26, Neil Horman wrote: >> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >>> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >>> wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: > > Is this meant to be fixed already? I am still seeing this on the > latest upstream tree. > These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. >>> >>> >>> Hi Tommi, >>> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >>> another one then. But I am still seeing these: >>> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> >>> on upstream tree pulled ~12 hours ago. >>> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing >> out a >> backtrace if the device name matches "lo". That should tell us >> definitively if >> the problem is in the same location or not > > > Hi Dmitry, I tested with the reproducer and the kernel .config file that > you > sent in the first email in this thread: > > With 4.16-rc2 unable to reproduce. > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting > for > lo to become free. Usage count = 3" > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in > sctp_v6_get_dst()" > cherry-picked on top, unable to reproduce. > > > Is syzkaller doing something else now to trigger the bug...? > Can you still trigger the bug with the same reproducer? Hi Neil, Tommi, Reviving this old thread about "unregister_netdevice: waiting for lo to become free. Usage count = 3" hangs. I still did not have time to deep dive into what happens there (too many bugs coming from syzbot). But this still actively happens and I suspect accounts to a significant portion of various hang reports, which are quite unpleasant. One idea that could make it all simpler: Is this wait loop in netdev_wait_allrefs() supposed to wait for any prolonged periods of time under any non-buggy conditions? E.g. more than 1-2 minutes? If it only supposed to wait briefly for things that already supposed to be shutting down, and we add a WARNING there after some timeout, then syzbot will report all info how/when it happens, hopefully extracting reproducers, and all the nice things. But this WARNING should not have any false positives under any realistic conditions (e.g. waiting for arrival of remote packets with large timeouts). Looking at some task hung reports, it seems that this code holds some mutexes, takes workqueue thread and prevents any progress with destruction of other devices (and net namespace creation/destruction), so I guess it should not wait for any indefinite periods of time? >>> >>> I'm working on this currently: >>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >>> >>> I added a summary of what I've found to be the cause (or at least, one >>> possible cause) of this: >>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 >>> >>> I'm working on a patch to work around the main side-effect of this, >>> which is hanging while holding the global net mutex. Hangs will still >>> happen (e.g. if
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukovwrote: > On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman wrote: >> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: >>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >>> wrote: On 20.02.2018 18:26, Neil Horman wrote: > > On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >> >> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >> wrote: >>> >>> On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. >>> >>> These two commits are in v4.16-rc1: >>> >>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >>> Author: Tommi Rantala >>> Date: Mon Feb 5 21:48:14 2018 +0200 >>> >>> sctp: fix dst refcnt leak in sctp_v4_get_dst >>> ... >>> Fixes: 410f03831 ("sctp: add routing output fallback") >>> Fixes: 0ca50d12f ("sctp: fix src address selection if using >>> secondary >>> addresses") >>> >>> >>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >>> Author: Alexey Kodanev >>> Date: Mon Feb 5 15:10:35 2018 +0300 >>> >>> sctp: fix dst refcnt leak in sctp_v6_get_dst() >>> ... >>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >>> secondary >>> addresses for ipv6") >>> >>> >>> I guess we missed something if it's still reproducible. >>> >>> I can check it later this week, unless someone else beat me to it. >> >> >> Hi Tommi, >> >> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >> another one then. But I am still seeing these: >> >> [ 58.799130] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 60.847138] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 62.895093] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 64.943103] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> >> on upstream tree pulled ~12 hours ago. >> > Can you write a systemtap script to probe dev_hold, and dev_put, printing > out a > backtrace if the device name matches "lo". That should tell us > definitively if > the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? >>> >>> Hi Neil, Tommi, >>> >>> Reviving this old thread about "unregister_netdevice: waiting for lo >>> to become free. Usage count = 3" hangs. >>> I still did not have time to deep dive into what happens there (too >>> many bugs coming from syzbot). But this still actively happens and I >>> suspect accounts to a significant portion of various hang reports, >>> which are quite unpleasant. >>> >>> One idea that could make it all simpler: >>> >>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >>> prolonged periods of time under any non-buggy conditions? E.g. more >>> than 1-2 minutes? >>> If it only supposed to wait briefly for things that already supposed >>> to be shutting down, and we add a WARNING there after some timeout, >>> then syzbot will report all info how/when it happens, hopefully >>> extracting reproducers, and all the nice things. >>> But this WARNING should not have any false positives under any >>> realistic conditions (e.g. waiting for arrival of remote packets with >>> large timeouts). >>> >>> Looking at some task hung reports, it seems that this code holds some >>> mutexes, takes workqueue thread and prevents any progress with >>> destruction of other devices (and net namespace creation/destruction), >>> so I guess it should not wait for any indefinite periods of time? >> >> I'm working on this currently: >> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >> >> I added a summary of what I've found to be the cause (or at least, one >> possible cause) of this: >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 >> >> I'm working on a patch to work around the main side-effect of this, >> which is hanging while holding the global net mutex. Hangs will still >> happen (e.g. if a dst leaks) but
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov wrote: > On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman wrote: >> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: >>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >>> wrote: On 20.02.2018 18:26, Neil Horman wrote: > > On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >> >> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >> wrote: >>> >>> On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. >>> >>> These two commits are in v4.16-rc1: >>> >>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >>> Author: Tommi Rantala >>> Date: Mon Feb 5 21:48:14 2018 +0200 >>> >>> sctp: fix dst refcnt leak in sctp_v4_get_dst >>> ... >>> Fixes: 410f03831 ("sctp: add routing output fallback") >>> Fixes: 0ca50d12f ("sctp: fix src address selection if using >>> secondary >>> addresses") >>> >>> >>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >>> Author: Alexey Kodanev >>> Date: Mon Feb 5 15:10:35 2018 +0300 >>> >>> sctp: fix dst refcnt leak in sctp_v6_get_dst() >>> ... >>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >>> secondary >>> addresses for ipv6") >>> >>> >>> I guess we missed something if it's still reproducible. >>> >>> I can check it later this week, unless someone else beat me to it. >> >> >> Hi Tommi, >> >> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >> another one then. But I am still seeing these: >> >> [ 58.799130] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 60.847138] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 62.895093] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> [ 64.943103] unregister_netdevice: waiting for lo to become free. >> Usage count = 4 >> >> on upstream tree pulled ~12 hours ago. >> > Can you write a systemtap script to probe dev_hold, and dev_put, printing > out a > backtrace if the device name matches "lo". That should tell us > definitively if > the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? >>> >>> Hi Neil, Tommi, >>> >>> Reviving this old thread about "unregister_netdevice: waiting for lo >>> to become free. Usage count = 3" hangs. >>> I still did not have time to deep dive into what happens there (too >>> many bugs coming from syzbot). But this still actively happens and I >>> suspect accounts to a significant portion of various hang reports, >>> which are quite unpleasant. >>> >>> One idea that could make it all simpler: >>> >>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >>> prolonged periods of time under any non-buggy conditions? E.g. more >>> than 1-2 minutes? >>> If it only supposed to wait briefly for things that already supposed >>> to be shutting down, and we add a WARNING there after some timeout, >>> then syzbot will report all info how/when it happens, hopefully >>> extracting reproducers, and all the nice things. >>> But this WARNING should not have any false positives under any >>> realistic conditions (e.g. waiting for arrival of remote packets with >>> large timeouts). >>> >>> Looking at some task hung reports, it seems that this code holds some >>> mutexes, takes workqueue thread and prevents any progress with >>> destruction of other devices (and net namespace creation/destruction), >>> so I guess it should not wait for any indefinite periods of time? >> >> I'm working on this currently: >> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 >> >> I added a summary of what I've found to be the cause (or at least, one >> possible cause) of this: >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 >> >> I'm working on a patch to work around the main side-effect of this, >> which is hanging while holding the global net mutex. Hangs will still >> happen (e.g. if a dst leaks) but should not affect anything else, >> other than a leak of the dst and its net namespace. >> >> Fixing the dst leaks is important too, of course, but a dst leak (or >> other
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetmanwrote: > On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: >> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >> wrote: >>> On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > > On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > wrote: >> >> On 19.02.2018 20:59, Dmitry Vyukov wrote: >>> >>> Is this meant to be fixed already? I am still seeing this on the >>> latest upstream tree. >>> >> >> These two commits are in v4.16-rc1: >> >> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >> Author: Tommi Rantala >> Date: Mon Feb 5 21:48:14 2018 +0200 >> >> sctp: fix dst refcnt leak in sctp_v4_get_dst >> ... >> Fixes: 410f03831 ("sctp: add routing output fallback") >> Fixes: 0ca50d12f ("sctp: fix src address selection if using >> secondary >> addresses") >> >> >> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >> Author: Alexey Kodanev >> Date: Mon Feb 5 15:10:35 2018 +0300 >> >> sctp: fix dst refcnt leak in sctp_v6_get_dst() >> ... >> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >> secondary >> addresses for ipv6") >> >> >> I guess we missed something if it's still reproducible. >> >> I can check it later this week, unless someone else beat me to it. > > > Hi Tommi, > > Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > another one then. But I am still seeing these: > > [ 58.799130] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 60.847138] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 62.895093] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 64.943103] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > > on upstream tree pulled ~12 hours ago. > Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not >>> >>> >>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you >>> sent in the first email in this thread: >>> >>> With 4.16-rc2 unable to reproduce. >>> >>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for >>> lo to become free. Usage count = 3" >>> >>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" >>> cherry-picked on top, unable to reproduce. >>> >>> >>> Is syzkaller doing something else now to trigger the bug...? >>> Can you still trigger the bug with the same reproducer? >> >> Hi Neil, Tommi, >> >> Reviving this old thread about "unregister_netdevice: waiting for lo >> to become free. Usage count = 3" hangs. >> I still did not have time to deep dive into what happens there (too >> many bugs coming from syzbot). But this still actively happens and I >> suspect accounts to a significant portion of various hang reports, >> which are quite unpleasant. >> >> One idea that could make it all simpler: >> >> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >> prolonged periods of time under any non-buggy conditions? E.g. more >> than 1-2 minutes? >> If it only supposed to wait briefly for things that already supposed >> to be shutting down, and we add a WARNING there after some timeout, >> then syzbot will report all info how/when it happens, hopefully >> extracting reproducers, and all the nice things. >> But this WARNING should not have any false positives under any >> realistic conditions (e.g. waiting for arrival of remote packets with >> large timeouts). >> >> Looking at some task hung reports, it seems that this code holds some >> mutexes, takes workqueue thread and prevents any progress with >> destruction of other devices (and net namespace creation/destruction), >> so I guess it should not wait for any indefinite periods of time? > > I'm working on this currently: > https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 > > I added a summary of what I've found to be the cause (or at least, one > possible cause) of this: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 > > I'm working on a patch to work around the main side-effect of this, > which is hanging while holding the global net mutex. Hangs will still > happen (e.g. if a dst leaks) but should not affect anything else, > other than a leak of the dst and its net namespace. > > Fixing the dst leaks is important too, of course, but a dst leak (or > other cause) shouldn't break the entire system.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman wrote: > On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: >> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >> wrote: >>> On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > > On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > wrote: >> >> On 19.02.2018 20:59, Dmitry Vyukov wrote: >>> >>> Is this meant to be fixed already? I am still seeing this on the >>> latest upstream tree. >>> >> >> These two commits are in v4.16-rc1: >> >> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 >> Author: Tommi Rantala >> Date: Mon Feb 5 21:48:14 2018 +0200 >> >> sctp: fix dst refcnt leak in sctp_v4_get_dst >> ... >> Fixes: 410f03831 ("sctp: add routing output fallback") >> Fixes: 0ca50d12f ("sctp: fix src address selection if using >> secondary >> addresses") >> >> >> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 >> Author: Alexey Kodanev >> Date: Mon Feb 5 15:10:35 2018 +0300 >> >> sctp: fix dst refcnt leak in sctp_v6_get_dst() >> ... >> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using >> secondary >> addresses for ipv6") >> >> >> I guess we missed something if it's still reproducible. >> >> I can check it later this week, unless someone else beat me to it. > > > Hi Tommi, > > Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > another one then. But I am still seeing these: > > [ 58.799130] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 60.847138] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 62.895093] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 64.943103] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > > on upstream tree pulled ~12 hours ago. > Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not >>> >>> >>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you >>> sent in the first email in this thread: >>> >>> With 4.16-rc2 unable to reproduce. >>> >>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for >>> lo to become free. Usage count = 3" >>> >>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" >>> cherry-picked on top, unable to reproduce. >>> >>> >>> Is syzkaller doing something else now to trigger the bug...? >>> Can you still trigger the bug with the same reproducer? >> >> Hi Neil, Tommi, >> >> Reviving this old thread about "unregister_netdevice: waiting for lo >> to become free. Usage count = 3" hangs. >> I still did not have time to deep dive into what happens there (too >> many bugs coming from syzbot). But this still actively happens and I >> suspect accounts to a significant portion of various hang reports, >> which are quite unpleasant. >> >> One idea that could make it all simpler: >> >> Is this wait loop in netdev_wait_allrefs() supposed to wait for any >> prolonged periods of time under any non-buggy conditions? E.g. more >> than 1-2 minutes? >> If it only supposed to wait briefly for things that already supposed >> to be shutting down, and we add a WARNING there after some timeout, >> then syzbot will report all info how/when it happens, hopefully >> extracting reproducers, and all the nice things. >> But this WARNING should not have any false positives under any >> realistic conditions (e.g. waiting for arrival of remote packets with >> large timeouts). >> >> Looking at some task hung reports, it seems that this code holds some >> mutexes, takes workqueue thread and prevents any progress with >> destruction of other devices (and net namespace creation/destruction), >> so I guess it should not wait for any indefinite periods of time? > > I'm working on this currently: > https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 > > I added a summary of what I've found to be the cause (or at least, one > possible cause) of this: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 > > I'm working on a patch to work around the main side-effect of this, > which is hanging while holding the global net mutex. Hangs will still > happen (e.g. if a dst leaks) but should not affect anything else, > other than a leak of the dst and its net namespace. > > Fixing the dst leaks is important too, of course, but a dst leak (or > other cause) shouldn't break the entire system. Leaking some memory is definitely better than hanging the system. So I've made syzkaller to recognize "unregister_netdevice: waiting for (.*) to
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukovwrote: > On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala > wrote: >> On 20.02.2018 18:26, Neil Horman wrote: >>> >>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: >> >> Is this meant to be fixed already? I am still seeing this on the >> latest upstream tree. >> > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using > secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using > secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. >>> Can you write a systemtap script to probe dev_hold, and dev_put, printing >>> out a >>> backtrace if the device name matches "lo". That should tell us >>> definitively if >>> the problem is in the same location or not >> >> >> Hi Dmitry, I tested with the reproducer and the kernel .config file that you >> sent in the first email in this thread: >> >> With 4.16-rc2 unable to reproduce. >> >> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for >> lo to become free. Usage count = 3" >> >> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" >> cherry-picked on top, unable to reproduce. >> >> >> Is syzkaller doing something else now to trigger the bug...? >> Can you still trigger the bug with the same reproducer? > > Hi Neil, Tommi, > > Reviving this old thread about "unregister_netdevice: waiting for lo > to become free. Usage count = 3" hangs. > I still did not have time to deep dive into what happens there (too > many bugs coming from syzbot). But this still actively happens and I > suspect accounts to a significant portion of various hang reports, > which are quite unpleasant. > > One idea that could make it all simpler: > > Is this wait loop in netdev_wait_allrefs() supposed to wait for any > prolonged periods of time under any non-buggy conditions? E.g. more > than 1-2 minutes? > If it only supposed to wait briefly for things that already supposed > to be shutting down, and we add a WARNING there after some timeout, > then syzbot will report all info how/when it happens, hopefully > extracting reproducers, and all the nice things. > But this WARNING should not have any false positives under any > realistic conditions (e.g. waiting for arrival of remote packets with > large timeouts). > > Looking at some task hung reports, it seems that this code holds some > mutexes, takes workqueue thread and prevents any progress with > destruction of other devices (and net namespace creation/destruction), > so I guess it should not wait for any indefinite periods of time? I'm working on this currently: https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 I added a summary of what I've found to be the cause (or at least, one possible cause) of this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 I'm working on a patch to work around the main side-effect of this, which is hanging while holding the global net mutex. Hangs will still happen (e.g. if a dst leaks) but should not affect anything else, other than a leak of the dst and its net namespace. Fixing the dst leaks is important too, of course, but a dst leak (or other cause) shouldn't break the entire system.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov wrote: > On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala > wrote: >> On 20.02.2018 18:26, Neil Horman wrote: >>> >>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: >> >> Is this meant to be fixed already? I am still seeing this on the >> latest upstream tree. >> > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using > secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using > secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. >>> Can you write a systemtap script to probe dev_hold, and dev_put, printing >>> out a >>> backtrace if the device name matches "lo". That should tell us >>> definitively if >>> the problem is in the same location or not >> >> >> Hi Dmitry, I tested with the reproducer and the kernel .config file that you >> sent in the first email in this thread: >> >> With 4.16-rc2 unable to reproduce. >> >> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for >> lo to become free. Usage count = 3" >> >> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" >> cherry-picked on top, unable to reproduce. >> >> >> Is syzkaller doing something else now to trigger the bug...? >> Can you still trigger the bug with the same reproducer? > > Hi Neil, Tommi, > > Reviving this old thread about "unregister_netdevice: waiting for lo > to become free. Usage count = 3" hangs. > I still did not have time to deep dive into what happens there (too > many bugs coming from syzbot). But this still actively happens and I > suspect accounts to a significant portion of various hang reports, > which are quite unpleasant. > > One idea that could make it all simpler: > > Is this wait loop in netdev_wait_allrefs() supposed to wait for any > prolonged periods of time under any non-buggy conditions? E.g. more > than 1-2 minutes? > If it only supposed to wait briefly for things that already supposed > to be shutting down, and we add a WARNING there after some timeout, > then syzbot will report all info how/when it happens, hopefully > extracting reproducers, and all the nice things. > But this WARNING should not have any false positives under any > realistic conditions (e.g. waiting for arrival of remote packets with > large timeouts). > > Looking at some task hung reports, it seems that this code holds some > mutexes, takes workqueue thread and prevents any progress with > destruction of other devices (and net namespace creation/destruction), > so I guess it should not wait for any indefinite periods of time? I'm working on this currently: https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407 I added a summary of what I've found to be the cause (or at least, one possible cause) of this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72 I'm working on a patch to work around the main side-effect of this, which is hanging while holding the global net mutex. Hangs will still happen (e.g. if a dst leaks) but should not affect anything else, other than a leak of the dst and its net namespace. Fixing the dst leaks is important too, of course, but a dst leak (or other cause) shouldn't break the entire system.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote: > On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala >wrote: > > On 20.02.2018 18:26, Neil Horman wrote: > >> > >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > >>> > >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > >>> wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: > > > > Is this meant to be fixed already? I am still seeing this on the > > latest upstream tree. > > > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using > secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using > secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. > >>> > >>> > >>> Hi Tommi, > >>> > >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > >>> another one then. But I am still seeing these: > >>> > >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> > >>> on upstream tree pulled ~12 hours ago. > >>> > >> Can you write a systemtap script to probe dev_hold, and dev_put, printing > >> out a > >> backtrace if the device name matches "lo". That should tell us > >> definitively if > >> the problem is in the same location or not > > > > > > Hi Dmitry, I tested with the reproducer and the kernel .config file that you > > sent in the first email in this thread: > > > > With 4.16-rc2 unable to reproduce. > > > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for > > lo to become free. Usage count = 3" > > > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" > > cherry-picked on top, unable to reproduce. > > > > > > Is syzkaller doing something else now to trigger the bug...? > > Can you still trigger the bug with the same reproducer? > > Hi Neil, Tommi, > > Reviving this old thread about "unregister_netdevice: waiting for lo > to become free. Usage count = 3" hangs. > I still did not have time to deep dive into what happens there (too > many bugs coming from syzbot). But this still actively happens and I > suspect accounts to a significant portion of various hang reports, > which are quite unpleasant. > > One idea that could make it all simpler: > > Is this wait loop in netdev_wait_allrefs() supposed to wait for any > prolonged periods of time under any non-buggy conditions? E.g. more > than 1-2 minutes? As the name implies, its supposed to wait for the reference count to be zero indefinately, but yes, under normal operation, its intended to not have to wait very long at all. The issuance of the NETDEV_UNREGISTER_FINAL notification is meant to be a subscribable signal to any code path holding a reference that it needs to be dropped so that the progress can be made. Note that the "waiting for %s to become free" message is triggered after 10 seconds of waiting, and is likely the trigger you want, Its just an emergency level log message rather a WARN. I don't think we want to change that permanently, but you could certainly alter it in the code to cause syzbot to catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) ) > If it only supposed to wait briefly for things that already supposed > to be shutting down, and we add a WARNING there after some timeout, > then syzbot will report all info how/when it happens, hopefully > extracting reproducers, and all the nice things. > But this WARNING should not have any false positives under any > realistic conditions (e.g. waiting for arrival of remote packets with > large timeouts). > > Looking at some task hung reports, it seems that this code holds some > mutexes, takes workqueue thread and prevents any progress with > destruction of other devices (and net namespace creation/destruction), > so I guess it should not wait for any indefinite periods of time? Well, it drops everything and sleeps
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote: > On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala > wrote: > > On 20.02.2018 18:26, Neil Horman wrote: > >> > >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > >>> > >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > >>> wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: > > > > Is this meant to be fixed already? I am still seeing this on the > > latest upstream tree. > > > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using > secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using > secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. > >>> > >>> > >>> Hi Tommi, > >>> > >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > >>> another one then. But I am still seeing these: > >>> > >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. > >>> Usage count = 4 > >>> > >>> on upstream tree pulled ~12 hours ago. > >>> > >> Can you write a systemtap script to probe dev_hold, and dev_put, printing > >> out a > >> backtrace if the device name matches "lo". That should tell us > >> definitively if > >> the problem is in the same location or not > > > > > > Hi Dmitry, I tested with the reproducer and the kernel .config file that you > > sent in the first email in this thread: > > > > With 4.16-rc2 unable to reproduce. > > > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for > > lo to become free. Usage count = 3" > > > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" > > cherry-picked on top, unable to reproduce. > > > > > > Is syzkaller doing something else now to trigger the bug...? > > Can you still trigger the bug with the same reproducer? > > Hi Neil, Tommi, > > Reviving this old thread about "unregister_netdevice: waiting for lo > to become free. Usage count = 3" hangs. > I still did not have time to deep dive into what happens there (too > many bugs coming from syzbot). But this still actively happens and I > suspect accounts to a significant portion of various hang reports, > which are quite unpleasant. > > One idea that could make it all simpler: > > Is this wait loop in netdev_wait_allrefs() supposed to wait for any > prolonged periods of time under any non-buggy conditions? E.g. more > than 1-2 minutes? As the name implies, its supposed to wait for the reference count to be zero indefinately, but yes, under normal operation, its intended to not have to wait very long at all. The issuance of the NETDEV_UNREGISTER_FINAL notification is meant to be a subscribable signal to any code path holding a reference that it needs to be dropped so that the progress can be made. Note that the "waiting for %s to become free" message is triggered after 10 seconds of waiting, and is likely the trigger you want, Its just an emergency level log message rather a WARN. I don't think we want to change that permanently, but you could certainly alter it in the code to cause syzbot to catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) ) > If it only supposed to wait briefly for things that already supposed > to be shutting down, and we add a WARNING there after some timeout, > then syzbot will report all info how/when it happens, hopefully > extracting reproducers, and all the nice things. > But this WARNING should not have any false positives under any > realistic conditions (e.g. waiting for arrival of remote packets with > large timeouts). > > Looking at some task hung reports, it seems that this code holds some > mutexes, takes workqueue thread and prevents any progress with > destruction of other devices (and net namespace creation/destruction), > so I guess it should not wait for any indefinite periods of time? Well, it drops everything and sleeps periodically, so its safe in and of itself. The problem is its waiting for the reference count of a device to drop
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantalawrote: > On 20.02.2018 18:26, Neil Horman wrote: >> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >>> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >>> wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: > > Is this meant to be fixed already? I am still seeing this on the > latest upstream tree. > These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. >>> >>> >>> Hi Tommi, >>> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >>> another one then. But I am still seeing these: >>> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> >>> on upstream tree pulled ~12 hours ago. >>> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing >> out a >> backtrace if the device name matches "lo". That should tell us >> definitively if >> the problem is in the same location or not > > > Hi Dmitry, I tested with the reproducer and the kernel .config file that you > sent in the first email in this thread: > > With 4.16-rc2 unable to reproduce. > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for > lo to become free. Usage count = 3" > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" > cherry-picked on top, unable to reproduce. > > > Is syzkaller doing something else now to trigger the bug...? > Can you still trigger the bug with the same reproducer? Hi Neil, Tommi, Reviving this old thread about "unregister_netdevice: waiting for lo to become free. Usage count = 3" hangs. I still did not have time to deep dive into what happens there (too many bugs coming from syzbot). But this still actively happens and I suspect accounts to a significant portion of various hang reports, which are quite unpleasant. One idea that could make it all simpler: Is this wait loop in netdev_wait_allrefs() supposed to wait for any prolonged periods of time under any non-buggy conditions? E.g. more than 1-2 minutes? If it only supposed to wait briefly for things that already supposed to be shutting down, and we add a WARNING there after some timeout, then syzbot will report all info how/when it happens, hopefully extracting reproducers, and all the nice things. But this WARNING should not have any false positives under any realistic conditions (e.g. waiting for arrival of remote packets with large timeouts). Looking at some task hung reports, it seems that this code holds some mutexes, takes workqueue thread and prevents any progress with destruction of other devices (and net namespace creation/destruction), so I guess it should not wait for any indefinite periods of time?
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala wrote: > On 20.02.2018 18:26, Neil Horman wrote: >> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: >>> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >>> wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: > > Is this meant to be fixed already? I am still seeing this on the > latest upstream tree. > These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. >>> >>> >>> Hi Tommi, >>> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's >>> another one then. But I am still seeing these: >>> >>> [ 58.799130] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 60.847138] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 62.895093] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> [ 64.943103] unregister_netdevice: waiting for lo to become free. >>> Usage count = 4 >>> >>> on upstream tree pulled ~12 hours ago. >>> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing >> out a >> backtrace if the device name matches "lo". That should tell us >> definitively if >> the problem is in the same location or not > > > Hi Dmitry, I tested with the reproducer and the kernel .config file that you > sent in the first email in this thread: > > With 4.16-rc2 unable to reproduce. > > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for > lo to become free. Usage count = 3" > > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" > cherry-picked on top, unable to reproduce. > > > Is syzkaller doing something else now to trigger the bug...? > Can you still trigger the bug with the same reproducer? Hi Neil, Tommi, Reviving this old thread about "unregister_netdevice: waiting for lo to become free. Usage count = 3" hangs. I still did not have time to deep dive into what happens there (too many bugs coming from syzbot). But this still actively happens and I suspect accounts to a significant portion of various hang reports, which are quite unpleasant. One idea that could make it all simpler: Is this wait loop in netdev_wait_allrefs() supposed to wait for any prolonged periods of time under any non-buggy conditions? E.g. more than 1-2 minutes? If it only supposed to wait briefly for things that already supposed to be shutting down, and we add a WARNING there after some timeout, then syzbot will report all info how/when it happens, hopefully extracting reproducers, and all the nice things. But this WARNING should not have any false positives under any realistic conditions (e.g. waiting for arrival of remote packets with large timeouts). Looking at some task hung reports, it seems that this code holds some mutexes, takes workqueue thread and prevents any progress with destruction of other devices (and net namespace creation/destruction), so I guess it should not wait for any indefinite periods of time?
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantalawrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? Tommi
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 20.02.2018 18:26, Neil Horman wrote: On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala wrote: On 19.02.2018 20:59, Dmitry Vyukov wrote: Is this meant to be fixed already? I am still seeing this on the latest upstream tree. These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not Hi Dmitry, I tested with the reproducer and the kernel .config file that you sent in the first email in this thread: With 4.16-rc2 unable to reproduce. With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for lo to become free. Usage count = 3" With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()" cherry-picked on top, unable to reproduce. Is syzkaller doing something else now to trigger the bug...? Can you still trigger the bug with the same reproducer? Tommi
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala >wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: > >> > >> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long wrote: > > > > On 1/30/18 1:57 PM, David Ahern wrote: > >> > >> On 1/30/18 1:08 PM, Daniel Borkmann wrote: > >>> > >>> On 01/30/2018 07:32 PM, Cong Wang wrote: > > On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov > wrote: > > > > Hello, > > > > The following program creates a hang in unregister_netdevice. > > cleanup_net work hangs there forever periodically printing > > "unregister_netdevice: waiting for lo to become free. Usage count = > > 3" > > and creation of any new network namespaces hangs forever. > > > Interestingly, this is not reproducible on net-next. > >>> > >>> > >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: > >>> close > >>> sock if net namespace is exiting") in net/net-next from 5 days ago, > >>> maybe > >>> fixed due to that? > >>> > >> > >> This appears to be the commit introducing the refcnt leak: > >> > >> $ git bisect bad > >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit > >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 > >> Author: Xin Long > >> Date: Fri May 12 14:39:52 2017 +0800 > >> > >> sctp: fix src address selection if using secondary addresses for > >> ipv6 > >> > >> > >> v4.14 is bad. Running bisect in the background while doing other > >> things > >> > > > > Interesting. The commit that avoids the refcnt leak is > > > > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 > > Author: David Ahern > > Date: Wed Jan 24 19:45:29 2018 -0800 > > > > net/ipv6: Do not allow route add with a device that is down > > > > That commit does not intentionally address the problem so it is just > > masking the problematic code introduced by the commit above. > > Thanks, David A. > > I'm still on a trip. will look into this asap. > >>> > >>> > >>> Alexey and Tommi already had the patches for this issue on > >>> both SCTP v4 and v6 dst_get, Thanks. > >> > >> > >> > >> > >> Is this meant to be fixed already? I am still seeing this on the > >> latest upstream tree. > >> > > > > These two commits are in v4.16-rc1: > > > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > > Author: Tommi Rantala > > Date: Mon Feb 5 21:48:14 2018 +0200 > > > > sctp: fix dst refcnt leak in sctp_v4_get_dst > > ... > > Fixes: 410f03831 ("sctp: add routing output fallback") > > Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary > > addresses") > > > > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > > Author: Alexey Kodanev > > Date: Mon Feb 5 15:10:35 2018 +0300 > > > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > > ... > > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary > > addresses for ipv6") > > > > > > I guess we missed something if it's still reproducible. > > > > I can check it later this week, unless someone else beat me to it. > > Hi Tommi, > > Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > another one then. But I am still seeing these: > > [ 58.799130] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 60.847138] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 62.895093] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 64.943103] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > > on upstream tree pulled ~12 hours ago. > Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not Neil > Kernel does not detect this as any kind of BUG/WARNING, so > syzkaller/syzbot do not catch it as bug and do not try to reproduce, > localize and report. > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote: > On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala > wrote: > > On 19.02.2018 20:59, Dmitry Vyukov wrote: > >> > >> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long wrote: > > > > On 1/30/18 1:57 PM, David Ahern wrote: > >> > >> On 1/30/18 1:08 PM, Daniel Borkmann wrote: > >>> > >>> On 01/30/2018 07:32 PM, Cong Wang wrote: > > On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov > wrote: > > > > Hello, > > > > The following program creates a hang in unregister_netdevice. > > cleanup_net work hangs there forever periodically printing > > "unregister_netdevice: waiting for lo to become free. Usage count = > > 3" > > and creation of any new network namespaces hangs forever. > > > Interestingly, this is not reproducible on net-next. > >>> > >>> > >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: > >>> close > >>> sock if net namespace is exiting") in net/net-next from 5 days ago, > >>> maybe > >>> fixed due to that? > >>> > >> > >> This appears to be the commit introducing the refcnt leak: > >> > >> $ git bisect bad > >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit > >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 > >> Author: Xin Long > >> Date: Fri May 12 14:39:52 2017 +0800 > >> > >> sctp: fix src address selection if using secondary addresses for > >> ipv6 > >> > >> > >> v4.14 is bad. Running bisect in the background while doing other > >> things > >> > > > > Interesting. The commit that avoids the refcnt leak is > > > > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 > > Author: David Ahern > > Date: Wed Jan 24 19:45:29 2018 -0800 > > > > net/ipv6: Do not allow route add with a device that is down > > > > That commit does not intentionally address the problem so it is just > > masking the problematic code introduced by the commit above. > > Thanks, David A. > > I'm still on a trip. will look into this asap. > >>> > >>> > >>> Alexey and Tommi already had the patches for this issue on > >>> both SCTP v4 and v6 dst_get, Thanks. > >> > >> > >> > >> > >> Is this meant to be fixed already? I am still seeing this on the > >> latest upstream tree. > >> > > > > These two commits are in v4.16-rc1: > > > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > > Author: Tommi Rantala > > Date: Mon Feb 5 21:48:14 2018 +0200 > > > > sctp: fix dst refcnt leak in sctp_v4_get_dst > > ... > > Fixes: 410f03831 ("sctp: add routing output fallback") > > Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary > > addresses") > > > > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > > Author: Alexey Kodanev > > Date: Mon Feb 5 15:10:35 2018 +0300 > > > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > > ... > > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary > > addresses for ipv6") > > > > > > I guess we missed something if it's still reproducible. > > > > I can check it later this week, unless someone else beat me to it. > > Hi Tommi, > > Hmmm, I can't claim that it's exactly the same bug. Perhaps it's > another one then. But I am still seeing these: > > [ 58.799130] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 60.847138] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 62.895093] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > [ 64.943103] unregister_netdevice: waiting for lo to become free. > Usage count = 4 > > on upstream tree pulled ~12 hours ago. > Can you write a systemtap script to probe dev_hold, and dev_put, printing out a backtrace if the device name matches "lo". That should tell us definitively if the problem is in the same location or not Neil > Kernel does not detect this as any kind of BUG/WARNING, so > syzkaller/syzbot do not catch it as bug and do not try to reproduce, > localize and report. > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantalawrote: > On 19.02.2018 20:59, Dmitry Vyukov wrote: >> >> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long wrote: > > On 1/30/18 1:57 PM, David Ahern wrote: >> >> On 1/30/18 1:08 PM, Daniel Borkmann wrote: >>> >>> On 01/30/2018 07:32 PM, Cong Wang wrote: On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: > > Hello, > > The following program creates a hang in unregister_netdevice. > cleanup_net work hangs there forever periodically printing > "unregister_netdevice: waiting for lo to become free. Usage count = > 3" > and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next. >>> >>> >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: >>> close >>> sock if net namespace is exiting") in net/net-next from 5 days ago, >>> maybe >>> fixed due to that? >>> >> >> This appears to be the commit introducing the refcnt leak: >> >> $ git bisect bad >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 >> Author: Xin Long >> Date: Fri May 12 14:39:52 2017 +0800 >> >> sctp: fix src address selection if using secondary addresses for >> ipv6 >> >> >> v4.14 is bad. Running bisect in the background while doing other >> things >> > > Interesting. The commit that avoids the refcnt leak is > > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 > Author: David Ahern > Date: Wed Jan 24 19:45:29 2018 -0800 > > net/ipv6: Do not allow route add with a device that is down > > That commit does not intentionally address the problem so it is just > masking the problematic code introduced by the commit above. Thanks, David A. I'm still on a trip. will look into this asap. >>> >>> >>> Alexey and Tommi already had the patches for this issue on >>> both SCTP v4 and v6 dst_get, Thanks. >> >> >> >> >> Is this meant to be fixed already? I am still seeing this on the >> latest upstream tree. >> > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. Kernel does not detect this as any kind of BUG/WARNING, so syzkaller/syzbot do not catch it as bug and do not try to reproduce, localize and report.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala wrote: > On 19.02.2018 20:59, Dmitry Vyukov wrote: >> >> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long wrote: > > On 1/30/18 1:57 PM, David Ahern wrote: >> >> On 1/30/18 1:08 PM, Daniel Borkmann wrote: >>> >>> On 01/30/2018 07:32 PM, Cong Wang wrote: On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: > > Hello, > > The following program creates a hang in unregister_netdevice. > cleanup_net work hangs there forever periodically printing > "unregister_netdevice: waiting for lo to become free. Usage count = > 3" > and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next. >>> >>> >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: >>> close >>> sock if net namespace is exiting") in net/net-next from 5 days ago, >>> maybe >>> fixed due to that? >>> >> >> This appears to be the commit introducing the refcnt leak: >> >> $ git bisect bad >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 >> Author: Xin Long >> Date: Fri May 12 14:39:52 2017 +0800 >> >> sctp: fix src address selection if using secondary addresses for >> ipv6 >> >> >> v4.14 is bad. Running bisect in the background while doing other >> things >> > > Interesting. The commit that avoids the refcnt leak is > > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 > Author: David Ahern > Date: Wed Jan 24 19:45:29 2018 -0800 > > net/ipv6: Do not allow route add with a device that is down > > That commit does not intentionally address the problem so it is just > masking the problematic code introduced by the commit above. Thanks, David A. I'm still on a trip. will look into this asap. >>> >>> >>> Alexey and Tommi already had the patches for this issue on >>> both SCTP v4 and v6 dst_get, Thanks. >> >> >> >> >> Is this meant to be fixed already? I am still seeing this on the >> latest upstream tree. >> > > These two commits are in v4.16-rc1: > > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 > Author: Tommi Rantala > Date: Mon Feb 5 21:48:14 2018 +0200 > > sctp: fix dst refcnt leak in sctp_v4_get_dst > ... > Fixes: 410f03831 ("sctp: add routing output fallback") > Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary > addresses") > > > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 > Author: Alexey Kodanev > Date: Mon Feb 5 15:10:35 2018 +0300 > > sctp: fix dst refcnt leak in sctp_v6_get_dst() > ... > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary > addresses for ipv6") > > > I guess we missed something if it's still reproducible. > > I can check it later this week, unless someone else beat me to it. Hi Tommi, Hmmm, I can't claim that it's exactly the same bug. Perhaps it's another one then. But I am still seeing these: [ 58.799130] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 60.847138] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 62.895093] unregister_netdevice: waiting for lo to become free. Usage count = 4 [ 64.943103] unregister_netdevice: waiting for lo to become free. Usage count = 4 on upstream tree pulled ~12 hours ago. Kernel does not detect this as any kind of BUG/WARNING, so syzkaller/syzbot do not catch it as bug and do not try to reproduce, localize and report.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 19.02.2018 20:59, Dmitry Vyukov wrote: On Sat, Feb 3, 2018 at 1:15 PM, Xin Longwrote: On 1/30/18 1:57 PM, David Ahern wrote: On 1/30/18 1:08 PM, Daniel Borkmann wrote: On 01/30/2018 07:32 PM, Cong Wang wrote: On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: Hello, The following program creates a hang in unregister_netdevice. cleanup_net work hangs there forever periodically printing "unregister_netdevice: waiting for lo to become free. Usage count = 3" and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next. The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close sock if net namespace is exiting") in net/net-next from 5 days ago, maybe fixed due to that? This appears to be the commit introducing the refcnt leak: $ git bisect bad dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 v4.14 is bad. Running bisect in the background while doing other things Interesting. The commit that avoids the refcnt leak is commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 Author: David Ahern Date: Wed Jan 24 19:45:29 2018 -0800 net/ipv6: Do not allow route add with a device that is down That commit does not intentionally address the problem so it is just masking the problematic code introduced by the commit above. Thanks, David A. I'm still on a trip. will look into this asap. Alexey and Tommi already had the patches for this issue on both SCTP v4 and v6 dst_get, Thanks. Is this meant to be fixed already? I am still seeing this on the latest upstream tree. These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. Tommi
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 19.02.2018 20:59, Dmitry Vyukov wrote: On Sat, Feb 3, 2018 at 1:15 PM, Xin Long wrote: On 1/30/18 1:57 PM, David Ahern wrote: On 1/30/18 1:08 PM, Daniel Borkmann wrote: On 01/30/2018 07:32 PM, Cong Wang wrote: On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: Hello, The following program creates a hang in unregister_netdevice. cleanup_net work hangs there forever periodically printing "unregister_netdevice: waiting for lo to become free. Usage count = 3" and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next. The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close sock if net namespace is exiting") in net/net-next from 5 days ago, maybe fixed due to that? This appears to be the commit introducing the refcnt leak: $ git bisect bad dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 v4.14 is bad. Running bisect in the background while doing other things Interesting. The commit that avoids the refcnt leak is commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 Author: David Ahern Date: Wed Jan 24 19:45:29 2018 -0800 net/ipv6: Do not allow route add with a device that is down That commit does not intentionally address the problem so it is just masking the problematic code introduced by the commit above. Thanks, David A. I'm still on a trip. will look into this asap. Alexey and Tommi already had the patches for this issue on both SCTP v4 and v6 dst_get, Thanks. Is this meant to be fixed already? I am still seeing this on the latest upstream tree. These two commits are in v4.16-rc1: commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8 Author: Tommi Rantala Date: Mon Feb 5 21:48:14 2018 +0200 sctp: fix dst refcnt leak in sctp_v4_get_dst ... Fixes: 410f03831 ("sctp: add routing output fallback") Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses") commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2 Author: Alexey Kodanev Date: Mon Feb 5 15:10:35 2018 +0300 sctp: fix dst refcnt leak in sctp_v6_get_dst() ... Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6") I guess we missed something if it's still reproducible. I can check it later this week, unless someone else beat me to it. Tommi
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Sat, Feb 3, 2018 at 1:15 PM, Xin Longwrote: >>> On 1/30/18 1:57 PM, David Ahern wrote: On 1/30/18 1:08 PM, Daniel Borkmann wrote: > On 01/30/2018 07:32 PM, Cong Wang wrote: >> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov >> wrote: >>> Hello, >>> >>> The following program creates a hang in unregister_netdevice. >>> cleanup_net work hangs there forever periodically printing >>> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >>> and creation of any new network namespaces hangs forever. >> >> Interestingly, this is not reproducible on net-next. > > The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close > sock if net namespace is exiting") in net/net-next from 5 days ago, maybe > fixed due to that? > This appears to be the commit introducing the refcnt leak: $ git bisect bad dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 v4.14 is bad. Running bisect in the background while doing other things >>> >>> Interesting. The commit that avoids the refcnt leak is >>> >>> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 >>> Author: David Ahern >>> Date: Wed Jan 24 19:45:29 2018 -0800 >>> >>> net/ipv6: Do not allow route add with a device that is down >>> >>> That commit does not intentionally address the problem so it is just >>> masking the problematic code introduced by the commit above. >> Thanks, David A. >> >> I'm still on a trip. will look into this asap. > > Alexey and Tommi already had the patches for this issue on > both SCTP v4 and v6 dst_get, Thanks. Is this meant to be fixed already? I am still seeing this on the latest upstream tree.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Sat, Feb 3, 2018 at 1:15 PM, Xin Long wrote: >>> On 1/30/18 1:57 PM, David Ahern wrote: On 1/30/18 1:08 PM, Daniel Borkmann wrote: > On 01/30/2018 07:32 PM, Cong Wang wrote: >> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov >> wrote: >>> Hello, >>> >>> The following program creates a hang in unregister_netdevice. >>> cleanup_net work hangs there forever periodically printing >>> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >>> and creation of any new network namespaces hangs forever. >> >> Interestingly, this is not reproducible on net-next. > > The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close > sock if net namespace is exiting") in net/net-next from 5 days ago, maybe > fixed due to that? > This appears to be the commit introducing the refcnt leak: $ git bisect bad dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 v4.14 is bad. Running bisect in the background while doing other things >>> >>> Interesting. The commit that avoids the refcnt leak is >>> >>> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 >>> Author: David Ahern >>> Date: Wed Jan 24 19:45:29 2018 -0800 >>> >>> net/ipv6: Do not allow route add with a device that is down >>> >>> That commit does not intentionally address the problem so it is just >>> masking the problematic code introduced by the commit above. >> Thanks, David A. >> >> I'm still on a trip. will look into this asap. > > Alexey and Tommi already had the patches for this issue on > both SCTP v4 and v6 dst_get, Thanks. Is this meant to be fixed already? I am still seeing this on the latest upstream tree.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, Feb 1, 2018 at 1:49 AM, Xin Longwrote: > On Tue, Jan 30, 2018 at 11:59 PM, David Ahern wrote: >> On 1/30/18 1:57 PM, David Ahern wrote: >>> On 1/30/18 1:08 PM, Daniel Borkmann wrote: On 01/30/2018 07:32 PM, Cong Wang wrote: > On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: >> Hello, >> >> The following program creates a hang in unregister_netdevice. >> cleanup_net work hangs there forever periodically printing >> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >> and creation of any new network namespaces hangs forever. > > Interestingly, this is not reproducible on net-next. The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close sock if net namespace is exiting") in net/net-next from 5 days ago, maybe fixed due to that? >>> >>> This appears to be the commit introducing the refcnt leak: >>> >>> $ git bisect bad >>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit >>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 >>> Author: Xin Long >>> Date: Fri May 12 14:39:52 2017 +0800 >>> >>> sctp: fix src address selection if using secondary addresses for ipv6 >>> >>> >>> v4.14 is bad. Running bisect in the background while doing other things >>> >> >> Interesting. The commit that avoids the refcnt leak is >> >> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 >> Author: David Ahern >> Date: Wed Jan 24 19:45:29 2018 -0800 >> >> net/ipv6: Do not allow route add with a device that is down >> >> That commit does not intentionally address the problem so it is just >> masking the problematic code introduced by the commit above. > Thanks, David A. > > I'm still on a trip. will look into this asap. Alexey and Tommi already had the patches for this issue on both SCTP v4 and v6 dst_get, Thanks.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Thu, Feb 1, 2018 at 1:49 AM, Xin Long wrote: > On Tue, Jan 30, 2018 at 11:59 PM, David Ahern wrote: >> On 1/30/18 1:57 PM, David Ahern wrote: >>> On 1/30/18 1:08 PM, Daniel Borkmann wrote: On 01/30/2018 07:32 PM, Cong Wang wrote: > On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: >> Hello, >> >> The following program creates a hang in unregister_netdevice. >> cleanup_net work hangs there forever periodically printing >> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >> and creation of any new network namespaces hangs forever. > > Interestingly, this is not reproducible on net-next. The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close sock if net namespace is exiting") in net/net-next from 5 days ago, maybe fixed due to that? >>> >>> This appears to be the commit introducing the refcnt leak: >>> >>> $ git bisect bad >>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit >>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 >>> Author: Xin Long >>> Date: Fri May 12 14:39:52 2017 +0800 >>> >>> sctp: fix src address selection if using secondary addresses for ipv6 >>> >>> >>> v4.14 is bad. Running bisect in the background while doing other things >>> >> >> Interesting. The commit that avoids the refcnt leak is >> >> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 >> Author: David Ahern >> Date: Wed Jan 24 19:45:29 2018 -0800 >> >> net/ipv6: Do not allow route add with a device that is down >> >> That commit does not intentionally address the problem so it is just >> masking the problematic code introduced by the commit above. > Thanks, David A. > > I'm still on a trip. will look into this asap. Alexey and Tommi already had the patches for this issue on both SCTP v4 and v6 dst_get, Thanks.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Jan 30, 2018 at 11:59 PM, David Ahernwrote: > On 1/30/18 1:57 PM, David Ahern wrote: >> On 1/30/18 1:08 PM, Daniel Borkmann wrote: >>> On 01/30/2018 07:32 PM, Cong Wang wrote: On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: > Hello, > > The following program creates a hang in unregister_netdevice. > cleanup_net work hangs there forever periodically printing > "unregister_netdevice: waiting for lo to become free. Usage count = 3" > and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next. >>> >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close >>> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe >>> fixed due to that? >>> >> >> This appears to be the commit introducing the refcnt leak: >> >> $ git bisect bad >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 >> Author: Xin Long >> Date: Fri May 12 14:39:52 2017 +0800 >> >> sctp: fix src address selection if using secondary addresses for ipv6 >> >> >> v4.14 is bad. Running bisect in the background while doing other things >> > > Interesting. The commit that avoids the refcnt leak is > > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 > Author: David Ahern > Date: Wed Jan 24 19:45:29 2018 -0800 > > net/ipv6: Do not allow route add with a device that is down > > That commit does not intentionally address the problem so it is just > masking the problematic code introduced by the commit above. Thanks, David A. I'm still on a trip. will look into this asap.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Jan 30, 2018 at 11:59 PM, David Ahern wrote: > On 1/30/18 1:57 PM, David Ahern wrote: >> On 1/30/18 1:08 PM, Daniel Borkmann wrote: >>> On 01/30/2018 07:32 PM, Cong Wang wrote: On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: > Hello, > > The following program creates a hang in unregister_netdevice. > cleanup_net work hangs there forever periodically printing > "unregister_netdevice: waiting for lo to become free. Usage count = 3" > and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next. >>> >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close >>> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe >>> fixed due to that? >>> >> >> This appears to be the commit introducing the refcnt leak: >> >> $ git bisect bad >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 >> Author: Xin Long >> Date: Fri May 12 14:39:52 2017 +0800 >> >> sctp: fix src address selection if using secondary addresses for ipv6 >> >> >> v4.14 is bad. Running bisect in the background while doing other things >> > > Interesting. The commit that avoids the refcnt leak is > > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 > Author: David Ahern > Date: Wed Jan 24 19:45:29 2018 -0800 > > net/ipv6: Do not allow route add with a device that is down > > That commit does not intentionally address the problem so it is just > masking the problematic code introduced by the commit above. Thanks, David A. I'm still on a trip. will look into this asap.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 1/30/18 1:57 PM, David Ahern wrote: > On 1/30/18 1:08 PM, Daniel Borkmann wrote: >> On 01/30/2018 07:32 PM, Cong Wang wrote: >>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukovwrote: Hello, The following program creates a hang in unregister_netdevice. cleanup_net work hangs there forever periodically printing "unregister_netdevice: waiting for lo to become free. Usage count = 3" and creation of any new network namespaces hangs forever. >>> >>> Interestingly, this is not reproducible on net-next. >> >> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close >> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe >> fixed due to that? >> > > This appears to be the commit introducing the refcnt leak: > > $ git bisect bad > dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit > commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 > Author: Xin Long > Date: Fri May 12 14:39:52 2017 +0800 > > sctp: fix src address selection if using secondary addresses for ipv6 > > > v4.14 is bad. Running bisect in the background while doing other things > Interesting. The commit that avoids the refcnt leak is commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 Author: David Ahern Date: Wed Jan 24 19:45:29 2018 -0800 net/ipv6: Do not allow route add with a device that is down That commit does not intentionally address the problem so it is just masking the problematic code introduced by the commit above.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 1/30/18 1:57 PM, David Ahern wrote: > On 1/30/18 1:08 PM, Daniel Borkmann wrote: >> On 01/30/2018 07:32 PM, Cong Wang wrote: >>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: Hello, The following program creates a hang in unregister_netdevice. cleanup_net work hangs there forever periodically printing "unregister_netdevice: waiting for lo to become free. Usage count = 3" and creation of any new network namespaces hangs forever. >>> >>> Interestingly, this is not reproducible on net-next. >> >> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close >> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe >> fixed due to that? >> > > This appears to be the commit introducing the refcnt leak: > > $ git bisect bad > dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit > commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 > Author: Xin Long > Date: Fri May 12 14:39:52 2017 +0800 > > sctp: fix src address selection if using secondary addresses for ipv6 > > > v4.14 is bad. Running bisect in the background while doing other things > Interesting. The commit that avoids the refcnt leak is commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212 Author: David Ahern Date: Wed Jan 24 19:45:29 2018 -0800 net/ipv6: Do not allow route add with a device that is down That commit does not intentionally address the problem so it is just masking the problematic code introduced by the commit above.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 1/30/18 1:08 PM, Daniel Borkmann wrote: > On 01/30/2018 07:32 PM, Cong Wang wrote: >> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukovwrote: >>> Hello, >>> >>> The following program creates a hang in unregister_netdevice. >>> cleanup_net work hangs there forever periodically printing >>> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >>> and creation of any new network namespaces hangs forever. >> >> Interestingly, this is not reproducible on net-next. > > The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close > sock if net namespace is exiting") in net/net-next from 5 days ago, maybe > fixed due to that? > This appears to be the commit introducing the refcnt leak: $ git bisect bad dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 v4.14 is bad. Running bisect in the background while doing other things
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 1/30/18 1:08 PM, Daniel Borkmann wrote: > On 01/30/2018 07:32 PM, Cong Wang wrote: >> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: >>> Hello, >>> >>> The following program creates a hang in unregister_netdevice. >>> cleanup_net work hangs there forever periodically printing >>> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >>> and creation of any new network namespaces hangs forever. >> >> Interestingly, this is not reproducible on net-next. > > The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close > sock if net namespace is exiting") in net/net-next from 5 days ago, maybe > fixed due to that? > This appears to be the commit introducing the refcnt leak: $ git bisect bad dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 v4.14 is bad. Running bisect in the background while doing other things
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 01/30/2018 07:32 PM, Cong Wang wrote: > On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukovwrote: >> Hello, >> >> The following program creates a hang in unregister_netdevice. >> cleanup_net work hangs there forever periodically printing >> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >> and creation of any new network namespaces hangs forever. > > Interestingly, this is not reproducible on net-next. The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close sock if net namespace is exiting") in net/net-next from 5 days ago, maybe fixed due to that?
Re: net: hang in unregister_netdevice: waiting for lo to become free
On 01/30/2018 07:32 PM, Cong Wang wrote: > On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: >> Hello, >> >> The following program creates a hang in unregister_netdevice. >> cleanup_net work hangs there forever periodically printing >> "unregister_netdevice: waiting for lo to become free. Usage count = 3" >> and creation of any new network namespaces hangs forever. > > Interestingly, this is not reproducible on net-next. The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close sock if net namespace is exiting") in net/net-next from 5 days ago, maybe fixed due to that?
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukovwrote: > Hello, > > The following program creates a hang in unregister_netdevice. > cleanup_net work hangs there forever periodically printing > "unregister_netdevice: waiting for lo to become free. Usage count = 3" > and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next.
Re: net: hang in unregister_netdevice: waiting for lo to become free
On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov wrote: > Hello, > > The following program creates a hang in unregister_netdevice. > cleanup_net work hangs there forever periodically printing > "unregister_netdevice: waiting for lo to become free. Usage count = 3" > and creation of any new network namespaces hangs forever. Interestingly, this is not reproducible on net-next.