Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-11 Thread Dan Streetman
On Fri, May 11, 2018 at 5:19 AM, Dmitry Vyukov  wrote:
> On Thu, May 10, 2018 at 12:23 PM, Dan Streetman  wrote:
  wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, 
>> printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file 
> that you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: 
> waiting for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
> sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

 Hi Neil, Tommi,

 Reviving this old thread about "unregister_netdevice: waiting for lo
 to become free. Usage count = 3" hangs.
 I still did not have time to deep dive into what happens there (too
 many bugs coming from syzbot). But this still actively happens and I
 suspect accounts to a significant portion of various hang reports,
 which are quite unpleasant.

 One idea that could make it all simpler:

 Is this wait loop in netdev_wait_allrefs() supposed to wait for any
 prolonged periods of time under any non-buggy conditions? E.g. more
 than 1-2 minutes?
 If it only supposed to wait briefly for things that already supposed
 to be shutting down, and we add a WARNING there after some timeout,
 then syzbot will report all info how/when it happens, hopefully
 extracting reproducers, and all the nice things.
 But this WARNING should not have any false positives under any
 realistic conditions (e.g. waiting for arrival of remote packets with
 large timeouts).

 Looking at some task hung reports, it seems that this code holds some
 mutexes, takes workqueue thread and prevents any progress with
 destruction of other devices (and net namespace creation/destruction),
 so I guess it should not wait for any indefinite periods of time?
>>>
>>> I'm working on 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-11 Thread Dan Streetman
On Fri, May 11, 2018 at 5:19 AM, Dmitry Vyukov  wrote:
> On Thu, May 10, 2018 at 12:23 PM, Dan Streetman  wrote:
  wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, 
>> printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file 
> that you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: 
> waiting for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
> sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

 Hi Neil, Tommi,

 Reviving this old thread about "unregister_netdevice: waiting for lo
 to become free. Usage count = 3" hangs.
 I still did not have time to deep dive into what happens there (too
 many bugs coming from syzbot). But this still actively happens and I
 suspect accounts to a significant portion of various hang reports,
 which are quite unpleasant.

 One idea that could make it all simpler:

 Is this wait loop in netdev_wait_allrefs() supposed to wait for any
 prolonged periods of time under any non-buggy conditions? E.g. more
 than 1-2 minutes?
 If it only supposed to wait briefly for things that already supposed
 to be shutting down, and we add a WARNING there after some timeout,
 then syzbot will report all info how/when it happens, hopefully
 extracting reproducers, and all the nice things.
 But this WARNING should not have any false positives under any
 realistic conditions (e.g. waiting for arrival of remote packets with
 large timeouts).

 Looking at some task hung reports, it seems that this code holds some
 mutexes, takes workqueue thread and prevents any progress with
 destruction of other devices (and net namespace creation/destruction),
 so I guess it should not wait for any indefinite periods of time?
>>>
>>> I'm working on this currently:
>>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>>
>>> I added a summary of what I've found to 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-11 Thread Dmitry Vyukov
On Thu, May 10, 2018 at 12:23 PM, Dan Streetman  wrote:
>>>  wrote:
 On 20.02.2018 18:26, Neil Horman wrote:
>
> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>
>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>  wrote:
>>>
>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:

 Is this meant to be fixed already? I am still seeing this on the
 latest upstream tree.

>>>
>>> These two commits are in v4.16-rc1:
>>>
>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>> Author: Tommi Rantala 
>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>
>>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>>> ...
>>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>> secondary
>>> addresses")
>>>
>>>
>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>> Author: Alexey Kodanev 
>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>
>>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>> ...
>>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>> secondary
>>> addresses for ipv6")
>>>
>>>
>>> I guess we missed something if it's still reproducible.
>>>
>>> I can check it later this week, unless someone else beat me to it.
>>
>>
>> Hi Tommi,
>>
>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>> another one then. But I am still seeing these:
>>
>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>>
>> on upstream tree pulled ~12 hours ago.
>>
> Can you write a systemtap script to probe dev_hold, and dev_put, 
> printing
> out a
> backtrace if the device name matches "lo".  That should tell us
> definitively if
> the problem is in the same location or not


 Hi Dmitry, I tested with the reproducer and the kernel .config file 
 that you
 sent in the first email in this thread:

 With 4.16-rc2 unable to reproduce.

 With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: 
 waiting for
 lo to become free. Usage count = 3"

 With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
 sctp_v6_get_dst()"
 cherry-picked on top, unable to reproduce.


 Is syzkaller doing something else now to trigger the bug...?
 Can you still trigger the bug with the same reproducer?
>>>
>>> Hi Neil, Tommi,
>>>
>>> Reviving this old thread about "unregister_netdevice: waiting for lo
>>> to become free. Usage count = 3" hangs.
>>> I still did not have time to deep dive into what happens there (too
>>> many bugs coming from syzbot). But this still actively happens and I
>>> suspect accounts to a significant portion of various hang reports,
>>> which are quite unpleasant.
>>>
>>> One idea that could make it all simpler:
>>>
>>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>>> prolonged periods of time under any non-buggy conditions? E.g. more
>>> than 1-2 minutes?
>>> If it only supposed to wait briefly for things that already supposed
>>> to be shutting down, and we add a WARNING there after some timeout,
>>> then syzbot will report all info how/when it happens, hopefully
>>> extracting reproducers, and all the nice things.
>>> But this WARNING should not have any false positives under any
>>> realistic conditions (e.g. waiting for arrival of remote packets with
>>> large timeouts).
>>>
>>> Looking at some task hung reports, it seems that this code holds some
>>> mutexes, takes workqueue thread and prevents any progress with
>>> destruction of other devices (and net namespace creation/destruction),
>>> so I guess it should not wait for any indefinite periods of time?
>>
>> I'm working on this currently:
>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>
>> I added a summary of what I've found to be the cause (or at least, one
>> possible 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-11 Thread Dmitry Vyukov
On Thu, May 10, 2018 at 12:23 PM, Dan Streetman  wrote:
>>>  wrote:
 On 20.02.2018 18:26, Neil Horman wrote:
>
> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>
>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>  wrote:
>>>
>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:

 Is this meant to be fixed already? I am still seeing this on the
 latest upstream tree.

>>>
>>> These two commits are in v4.16-rc1:
>>>
>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>> Author: Tommi Rantala 
>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>
>>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>>> ...
>>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>> secondary
>>> addresses")
>>>
>>>
>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>> Author: Alexey Kodanev 
>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>
>>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>> ...
>>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>> secondary
>>> addresses for ipv6")
>>>
>>>
>>> I guess we missed something if it's still reproducible.
>>>
>>> I can check it later this week, unless someone else beat me to it.
>>
>>
>> Hi Tommi,
>>
>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>> another one then. But I am still seeing these:
>>
>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>>
>> on upstream tree pulled ~12 hours ago.
>>
> Can you write a systemtap script to probe dev_hold, and dev_put, 
> printing
> out a
> backtrace if the device name matches "lo".  That should tell us
> definitively if
> the problem is in the same location or not


 Hi Dmitry, I tested with the reproducer and the kernel .config file 
 that you
 sent in the first email in this thread:

 With 4.16-rc2 unable to reproduce.

 With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: 
 waiting for
 lo to become free. Usage count = 3"

 With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
 sctp_v6_get_dst()"
 cherry-picked on top, unable to reproduce.


 Is syzkaller doing something else now to trigger the bug...?
 Can you still trigger the bug with the same reproducer?
>>>
>>> Hi Neil, Tommi,
>>>
>>> Reviving this old thread about "unregister_netdevice: waiting for lo
>>> to become free. Usage count = 3" hangs.
>>> I still did not have time to deep dive into what happens there (too
>>> many bugs coming from syzbot). But this still actively happens and I
>>> suspect accounts to a significant portion of various hang reports,
>>> which are quite unpleasant.
>>>
>>> One idea that could make it all simpler:
>>>
>>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>>> prolonged periods of time under any non-buggy conditions? E.g. more
>>> than 1-2 minutes?
>>> If it only supposed to wait briefly for things that already supposed
>>> to be shutting down, and we add a WARNING there after some timeout,
>>> then syzbot will report all info how/when it happens, hopefully
>>> extracting reproducers, and all the nice things.
>>> But this WARNING should not have any false positives under any
>>> realistic conditions (e.g. waiting for arrival of remote packets with
>>> large timeouts).
>>>
>>> Looking at some task hung reports, it seems that this code holds some
>>> mutexes, takes workqueue thread and prevents any progress with
>>> destruction of other devices (and net namespace creation/destruction),
>>> so I guess it should not wait for any indefinite periods of time?
>>
>> I'm working on this currently:
>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>
>> I added a summary of what I've found to be the cause (or at least, one
>> possible cause) of this:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>>
>> I'm working on a 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-10 Thread Dan Streetman
On Thu, May 10, 2018 at 2:46 AM, Dmitry Vyukov  wrote:
> On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman  wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>  wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:

 On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
>>
>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>
>>> Is this meant to be fixed already? I am still seeing this on the
>>> latest upstream tree.
>>>
>>
>> These two commits are in v4.16-rc1:
>>
>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>> Author: Tommi Rantala 
>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>
>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>> ...
>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>> secondary
>> addresses")
>>
>>
>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>> Author: Alexey Kodanev 
>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>
>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>> ...
>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>> secondary
>> addresses for ipv6")
>>
>>
>> I guess we missed something if it's still reproducible.
>>
>> I can check it later this week, unless someone else beat me to it.
>
>
> Hi Tommi,
>
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
>
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
>
> on upstream tree pulled ~12 hours ago.
>
 Can you write a systemtap script to probe dev_hold, and dev_put, 
 printing
 out a
 backtrace if the device name matches "lo".  That should tell us
 definitively if
 the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file 
>>> that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: 
>>> waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
>>> sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-10 Thread Dan Streetman
On Thu, May 10, 2018 at 2:46 AM, Dmitry Vyukov  wrote:
> On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman  wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>  wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:

 On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
>>
>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>
>>> Is this meant to be fixed already? I am still seeing this on the
>>> latest upstream tree.
>>>
>>
>> These two commits are in v4.16-rc1:
>>
>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>> Author: Tommi Rantala 
>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>
>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>> ...
>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>> secondary
>> addresses")
>>
>>
>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>> Author: Alexey Kodanev 
>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>
>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>> ...
>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>> secondary
>> addresses for ipv6")
>>
>>
>> I guess we missed something if it's still reproducible.
>>
>> I can check it later this week, unless someone else beat me to it.
>
>
> Hi Tommi,
>
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
>
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
>
> on upstream tree pulled ~12 hours ago.
>
 Can you write a systemtap script to probe dev_hold, and dev_put, 
 printing
 out a
 backtrace if the device name matches "lo".  That should tell us
 definitively if
 the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file 
>>> that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: 
>>> waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
>>> sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one
> possible cause) of this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>
> I'm working on a patch to 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-10 Thread Dmitry Vyukov
On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman  wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

 On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
  wrote:
>
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
>  sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
>  Fixes: 410f03831 ("sctp: add routing output fallback")
>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
> secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.


 Hi Tommi,

 Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
 another one then. But I am still seeing these:

 [   58.799130] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   60.847138] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   62.895093] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   64.943103] unregister_netdevice: waiting for lo to become free.
 Usage count = 4

 on upstream tree pulled ~12 hours ago.

>>> Can you write a systemtap script to probe dev_hold, and dev_put, 
>>> printing
>>> out a
>>> backtrace if the device name matches "lo".  That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that 
>> you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
>> for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
>> sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?

 I'm working on this currently:
 https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407

 I added a summary of what I've found to be the cause (or at least, one
 possible cause) of this:
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72

 I'm working on a patch to work around the main side-effect of this,
 which 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-05-10 Thread Dmitry Vyukov
On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman  wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

 On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
  wrote:
>
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
>  sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
>  Fixes: 410f03831 ("sctp: add routing output fallback")
>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
> secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.


 Hi Tommi,

 Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
 another one then. But I am still seeing these:

 [   58.799130] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   60.847138] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   62.895093] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   64.943103] unregister_netdevice: waiting for lo to become free.
 Usage count = 4

 on upstream tree pulled ~12 hours ago.

>>> Can you write a systemtap script to probe dev_hold, and dev_put, 
>>> printing
>>> out a
>>> backtrace if the device name matches "lo".  That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that 
>> you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
>> for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
>> sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?

 I'm working on this currently:
 https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407

 I added a summary of what I've found to be the cause (or at least, one
 possible cause) of this:
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72

 I'm working on a patch to work around the main side-effect of this,
 which is hanging while holding the global net mutex.  Hangs will still
 happen (e.g. if a dst leaks) but should not affect 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-16 Thread Dan Streetman
On Mon, Apr 16, 2018 at 3:35 AM, Dmitry Vyukov  wrote:
> On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov  wrote:
>> On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
>>> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
 On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
  wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file that 
> you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
> for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
> sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

 Hi Neil, Tommi,

 Reviving this old thread about "unregister_netdevice: waiting for lo
 to become free. Usage count = 3" hangs.
 I still did not have time to deep dive into what happens there (too
 many bugs coming from syzbot). But this still actively happens and I
 suspect accounts to a significant portion of various hang reports,
 which are quite unpleasant.

 One idea that could make it all simpler:

 Is this wait loop in netdev_wait_allrefs() supposed to wait for any
 prolonged periods of time under any non-buggy conditions? E.g. more
 than 1-2 minutes?
 If it only supposed to wait briefly for things that already supposed
 to be shutting down, and we add a WARNING there after some timeout,
 then syzbot will report all info how/when it happens, hopefully
 extracting reproducers, and all the nice things.
 But this WARNING should not have any false positives under any
 realistic conditions (e.g. waiting for arrival of remote packets with
 large timeouts).

 Looking at some task hung reports, it seems that this code holds some
 mutexes, takes workqueue thread and prevents any progress with
 destruction of other devices (and net namespace creation/destruction),
 so I guess it should not wait for any indefinite periods of time?
>>>
>>> I'm working on this currently:
>>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>>
>>> I added a summary of what I've found to be the cause (or at least, one
>>> possible cause) of this:
>>> 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-16 Thread Dan Streetman
On Mon, Apr 16, 2018 at 3:35 AM, Dmitry Vyukov  wrote:
> On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov  wrote:
>> On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
>>> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
 On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
  wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file that 
> you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
> for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
> sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

 Hi Neil, Tommi,

 Reviving this old thread about "unregister_netdevice: waiting for lo
 to become free. Usage count = 3" hangs.
 I still did not have time to deep dive into what happens there (too
 many bugs coming from syzbot). But this still actively happens and I
 suspect accounts to a significant portion of various hang reports,
 which are quite unpleasant.

 One idea that could make it all simpler:

 Is this wait loop in netdev_wait_allrefs() supposed to wait for any
 prolonged periods of time under any non-buggy conditions? E.g. more
 than 1-2 minutes?
 If it only supposed to wait briefly for things that already supposed
 to be shutting down, and we add a WARNING there after some timeout,
 then syzbot will report all info how/when it happens, hopefully
 extracting reproducers, and all the nice things.
 But this WARNING should not have any false positives under any
 realistic conditions (e.g. waiting for arrival of remote packets with
 large timeouts).

 Looking at some task hung reports, it seems that this code holds some
 mutexes, takes workqueue thread and prevents any progress with
 destruction of other devices (and net namespace creation/destruction),
 so I guess it should not wait for any indefinite periods of time?
>>>
>>> I'm working on this currently:
>>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>>
>>> I added a summary of what I've found to be the cause (or at least, one
>>> possible cause) of this:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>>>
>>> I'm working on a patch to work around the main side-effect of this,
>>> which is hanging while holding the global net mutex.  Hangs will still
>>> happen (e.g. if 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-16 Thread Dmitry Vyukov
On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov  wrote:
> On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
>> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
>>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>>  wrote:
 On 20.02.2018 18:26, Neil Horman wrote:
>
> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>
>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>  wrote:
>>>
>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:

 Is this meant to be fixed already? I am still seeing this on the
 latest upstream tree.

>>>
>>> These two commits are in v4.16-rc1:
>>>
>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>> Author: Tommi Rantala 
>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>
>>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>>> ...
>>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>> secondary
>>> addresses")
>>>
>>>
>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>> Author: Alexey Kodanev 
>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>
>>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>> ...
>>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>> secondary
>>> addresses for ipv6")
>>>
>>>
>>> I guess we missed something if it's still reproducible.
>>>
>>> I can check it later this week, unless someone else beat me to it.
>>
>>
>> Hi Tommi,
>>
>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>> another one then. But I am still seeing these:
>>
>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>>
>> on upstream tree pulled ~12 hours ago.
>>
> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> out a
> backtrace if the device name matches "lo".  That should tell us
> definitively if
> the problem is in the same location or not


 Hi Dmitry, I tested with the reproducer and the kernel .config file that 
 you
 sent in the first email in this thread:

 With 4.16-rc2 unable to reproduce.

 With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
 for
 lo to become free. Usage count = 3"

 With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
 cherry-picked on top, unable to reproduce.


 Is syzkaller doing something else now to trigger the bug...?
 Can you still trigger the bug with the same reproducer?
>>>
>>> Hi Neil, Tommi,
>>>
>>> Reviving this old thread about "unregister_netdevice: waiting for lo
>>> to become free. Usage count = 3" hangs.
>>> I still did not have time to deep dive into what happens there (too
>>> many bugs coming from syzbot). But this still actively happens and I
>>> suspect accounts to a significant portion of various hang reports,
>>> which are quite unpleasant.
>>>
>>> One idea that could make it all simpler:
>>>
>>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>>> prolonged periods of time under any non-buggy conditions? E.g. more
>>> than 1-2 minutes?
>>> If it only supposed to wait briefly for things that already supposed
>>> to be shutting down, and we add a WARNING there after some timeout,
>>> then syzbot will report all info how/when it happens, hopefully
>>> extracting reproducers, and all the nice things.
>>> But this WARNING should not have any false positives under any
>>> realistic conditions (e.g. waiting for arrival of remote packets with
>>> large timeouts).
>>>
>>> Looking at some task hung reports, it seems that this code holds some
>>> mutexes, takes workqueue thread and prevents any progress with
>>> destruction of other devices (and net namespace creation/destruction),
>>> so I guess it should not wait for any indefinite periods of time?
>>
>> I'm working on this currently:
>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>
>> I added a summary of what I've found to be the cause (or at least, one
>> possible cause) of this:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>>
>> I'm working on a patch to work around the main side-effect of this,
>> which is hanging while holding the global net mutex.  Hangs will still
>> happen (e.g. if a dst leaks) but 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-16 Thread Dmitry Vyukov
On Fri, Apr 13, 2018 at 5:54 PM, Dmitry Vyukov  wrote:
> On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
>> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
>>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>>  wrote:
 On 20.02.2018 18:26, Neil Horman wrote:
>
> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>
>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>  wrote:
>>>
>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:

 Is this meant to be fixed already? I am still seeing this on the
 latest upstream tree.

>>>
>>> These two commits are in v4.16-rc1:
>>>
>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>> Author: Tommi Rantala 
>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>
>>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>>> ...
>>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>> secondary
>>> addresses")
>>>
>>>
>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>> Author: Alexey Kodanev 
>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>
>>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>> ...
>>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>> secondary
>>> addresses for ipv6")
>>>
>>>
>>> I guess we missed something if it's still reproducible.
>>>
>>> I can check it later this week, unless someone else beat me to it.
>>
>>
>> Hi Tommi,
>>
>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>> another one then. But I am still seeing these:
>>
>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>> Usage count = 4
>>
>> on upstream tree pulled ~12 hours ago.
>>
> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> out a
> backtrace if the device name matches "lo".  That should tell us
> definitively if
> the problem is in the same location or not


 Hi Dmitry, I tested with the reproducer and the kernel .config file that 
 you
 sent in the first email in this thread:

 With 4.16-rc2 unable to reproduce.

 With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
 for
 lo to become free. Usage count = 3"

 With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
 cherry-picked on top, unable to reproduce.


 Is syzkaller doing something else now to trigger the bug...?
 Can you still trigger the bug with the same reproducer?
>>>
>>> Hi Neil, Tommi,
>>>
>>> Reviving this old thread about "unregister_netdevice: waiting for lo
>>> to become free. Usage count = 3" hangs.
>>> I still did not have time to deep dive into what happens there (too
>>> many bugs coming from syzbot). But this still actively happens and I
>>> suspect accounts to a significant portion of various hang reports,
>>> which are quite unpleasant.
>>>
>>> One idea that could make it all simpler:
>>>
>>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>>> prolonged periods of time under any non-buggy conditions? E.g. more
>>> than 1-2 minutes?
>>> If it only supposed to wait briefly for things that already supposed
>>> to be shutting down, and we add a WARNING there after some timeout,
>>> then syzbot will report all info how/when it happens, hopefully
>>> extracting reproducers, and all the nice things.
>>> But this WARNING should not have any false positives under any
>>> realistic conditions (e.g. waiting for arrival of remote packets with
>>> large timeouts).
>>>
>>> Looking at some task hung reports, it seems that this code holds some
>>> mutexes, takes workqueue thread and prevents any progress with
>>> destruction of other devices (and net namespace creation/destruction),
>>> so I guess it should not wait for any indefinite periods of time?
>>
>> I'm working on this currently:
>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>
>> I added a summary of what I've found to be the cause (or at least, one
>> possible cause) of this:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>>
>> I'm working on a patch to work around the main side-effect of this,
>> which is hanging while holding the global net mutex.  Hangs will still
>> happen (e.g. if a dst leaks) but should not affect anything else,
>> other than a leak of the dst and its net namespace.
>>
>> Fixing the dst leaks is important too, of course, but a dst leak (or
>> other 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Dmitry Vyukov
On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>  wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:

 On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
>>
>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>
>>> Is this meant to be fixed already? I am still seeing this on the
>>> latest upstream tree.
>>>
>>
>> These two commits are in v4.16-rc1:
>>
>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>> Author: Tommi Rantala 
>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>
>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>> ...
>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>> secondary
>> addresses")
>>
>>
>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>> Author: Alexey Kodanev 
>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>
>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>> ...
>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>> secondary
>> addresses for ipv6")
>>
>>
>> I guess we missed something if it's still reproducible.
>>
>> I can check it later this week, unless someone else beat me to it.
>
>
> Hi Tommi,
>
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
>
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
>
> on upstream tree pulled ~12 hours ago.
>
 Can you write a systemtap script to probe dev_hold, and dev_put, printing
 out a
 backtrace if the device name matches "lo".  That should tell us
 definitively if
 the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one
> possible cause) of this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>
> I'm working on a patch to work around the main side-effect of this,
> which is hanging while holding the global net mutex.  Hangs will still
> happen (e.g. if a dst leaks) but should not affect anything else,
> other than a leak of the dst and its net namespace.
>
> Fixing the dst leaks is important too, of course, but a dst leak (or
> other cause) shouldn't break the entire system.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Dmitry Vyukov
On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman  wrote:
> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>  wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:

 On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
>>
>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>
>>> Is this meant to be fixed already? I am still seeing this on the
>>> latest upstream tree.
>>>
>>
>> These two commits are in v4.16-rc1:
>>
>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>> Author: Tommi Rantala 
>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>
>>  sctp: fix dst refcnt leak in sctp_v4_get_dst
>> ...
>>  Fixes: 410f03831 ("sctp: add routing output fallback")
>>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
>> secondary
>> addresses")
>>
>>
>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>> Author: Alexey Kodanev 
>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>
>>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
>> ...
>>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>> secondary
>> addresses for ipv6")
>>
>>
>> I guess we missed something if it's still reproducible.
>>
>> I can check it later this week, unless someone else beat me to it.
>
>
> Hi Tommi,
>
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
>
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
>
> on upstream tree pulled ~12 hours ago.
>
 Can you write a systemtap script to probe dev_hold, and dev_put, printing
 out a
 backtrace if the device name matches "lo".  That should tell us
 definitively if
 the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one
> possible cause) of this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>
> I'm working on a patch to work around the main side-effect of this,
> which is hanging while holding the global net mutex.  Hangs will still
> happen (e.g. if a dst leaks) but should not affect anything else,
> other than a leak of the dst and its net namespace.
>
> Fixing the dst leaks is important too, of course, but a dst leak (or
> other cause) shouldn't break the entire system.

Leaking some memory is definitely better than hanging the system.

So I've made syzkaller to recognize "unregister_netdevice: waiting for
(.*) to 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Dan Streetman
On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

 On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
  wrote:
>
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
>  sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
>  Fixes: 410f03831 ("sctp: add routing output fallback")
>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
> secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.


 Hi Tommi,

 Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
 another one then. But I am still seeing these:

 [   58.799130] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   60.847138] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   62.895093] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   64.943103] unregister_netdevice: waiting for lo to become free.
 Usage count = 4

 on upstream tree pulled ~12 hours ago.

>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>> out a
>>> backtrace if the device name matches "lo".  That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?

I'm working on this currently:
https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407

I added a summary of what I've found to be the cause (or at least, one
possible cause) of this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72

I'm working on a patch to work around the main side-effect of this,
which is hanging while holding the global net mutex.  Hangs will still
happen (e.g. if a dst leaks) but should not affect anything else,
other than a leak of the dst and its net namespace.

Fixing the dst leaks is important too, of course, but a dst leak (or
other cause) shouldn't break the entire system.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Dan Streetman
On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov  wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

 On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
  wrote:
>
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
>  sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
>  Fixes: 410f03831 ("sctp: add routing output fallback")
>  Fixes: 0ca50d12f ("sctp: fix src address selection if using
> secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
>  sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
>  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.


 Hi Tommi,

 Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
 another one then. But I am still seeing these:

 [   58.799130] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   60.847138] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   62.895093] unregister_netdevice: waiting for lo to become free.
 Usage count = 4
 [   64.943103] unregister_netdevice: waiting for lo to become free.
 Usage count = 4

 on upstream tree pulled ~12 hours ago.

>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>> out a
>>> backtrace if the device name matches "lo".  That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?

I'm working on this currently:
https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407

I added a summary of what I've found to be the cause (or at least, one
possible cause) of this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72

I'm working on a patch to work around the main side-effect of this,
which is hanging while holding the global net mutex.  Hangs will still
happen (e.g. if a dst leaks) but should not affect anything else,
other than a leak of the dst and its net namespace.

Fixing the dst leaks is important too, of course, but a dst leak (or
other cause) shouldn't break the entire system.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Neil Horman
On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
> > On 20.02.2018 18:26, Neil Horman wrote:
> >>
> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> >>>
> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
> >>>  wrote:
> 
>  On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >
> > Is this meant to be fixed already? I am still seeing this on the
> > latest upstream tree.
> >
> 
>  These two commits are in v4.16-rc1:
> 
>  commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>  Author: Tommi Rantala 
>  Date:   Mon Feb 5 21:48:14 2018 +0200
> 
>   sctp: fix dst refcnt leak in sctp_v4_get_dst
>  ...
>   Fixes: 410f03831 ("sctp: add routing output fallback")
>   Fixes: 0ca50d12f ("sctp: fix src address selection if using
>  secondary
>  addresses")
> 
> 
>  commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>  Author: Alexey Kodanev 
>  Date:   Mon Feb 5 15:10:35 2018 +0300
> 
>   sctp: fix dst refcnt leak in sctp_v6_get_dst()
>  ...
>   Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>  secondary
>  addresses for ipv6")
> 
> 
>  I guess we missed something if it's still reproducible.
> 
>  I can check it later this week, unless someone else beat me to it.
> >>>
> >>>
> >>> Hi Tommi,
> >>>
> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> >>> another one then. But I am still seeing these:
> >>>
> >>> [   58.799130] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   60.847138] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   62.895093] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   64.943103] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>>
> >>> on upstream tree pulled ~12 hours ago.
> >>>
> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> >> out a
> >> backtrace if the device name matches "lo".  That should tell us
> >> definitively if
> >> the problem is in the same location or not
> >
> >
> > Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> > sent in the first email in this thread:
> >
> > With 4.16-rc2 unable to reproduce.
> >
> > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> > lo to become free. Usage count = 3"
> >
> > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> > cherry-picked on top, unable to reproduce.
> >
> >
> > Is syzkaller doing something else now to trigger the bug...?
> > Can you still trigger the bug with the same reproducer?
> 
> Hi Neil, Tommi,
> 
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
> 
> One idea that could make it all simpler:
> 
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
As the name implies, its supposed to wait for the reference count to be zero
indefinately, but yes, under normal operation, its intended to not have to wait
very long at all.  The issuance of the NETDEV_UNREGISTER_FINAL notification is
meant to be a subscribable signal to any code path holding a reference that it
needs to be dropped so that the progress can be made.

Note that the "waiting for %s to become free" message is triggered after 10
seconds of waiting, and is likely the trigger you want, Its just an emergency
level log message rather a WARN.  I don't think we want to change that
permanently, but you could certainly alter it in the code to cause syzbot to
catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) )


> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
> 
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?
Well, it drops everything and sleeps 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-13 Thread Neil Horman
On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>  wrote:
> > On 20.02.2018 18:26, Neil Horman wrote:
> >>
> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> >>>
> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
> >>>  wrote:
> 
>  On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >
> > Is this meant to be fixed already? I am still seeing this on the
> > latest upstream tree.
> >
> 
>  These two commits are in v4.16-rc1:
> 
>  commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>  Author: Tommi Rantala 
>  Date:   Mon Feb 5 21:48:14 2018 +0200
> 
>   sctp: fix dst refcnt leak in sctp_v4_get_dst
>  ...
>   Fixes: 410f03831 ("sctp: add routing output fallback")
>   Fixes: 0ca50d12f ("sctp: fix src address selection if using
>  secondary
>  addresses")
> 
> 
>  commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>  Author: Alexey Kodanev 
>  Date:   Mon Feb 5 15:10:35 2018 +0300
> 
>   sctp: fix dst refcnt leak in sctp_v6_get_dst()
>  ...
>   Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>  secondary
>  addresses for ipv6")
> 
> 
>  I guess we missed something if it's still reproducible.
> 
>  I can check it later this week, unless someone else beat me to it.
> >>>
> >>>
> >>> Hi Tommi,
> >>>
> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> >>> another one then. But I am still seeing these:
> >>>
> >>> [   58.799130] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   60.847138] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   62.895093] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   64.943103] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>>
> >>> on upstream tree pulled ~12 hours ago.
> >>>
> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> >> out a
> >> backtrace if the device name matches "lo".  That should tell us
> >> definitively if
> >> the problem is in the same location or not
> >
> >
> > Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> > sent in the first email in this thread:
> >
> > With 4.16-rc2 unable to reproduce.
> >
> > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> > lo to become free. Usage count = 3"
> >
> > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> > cherry-picked on top, unable to reproduce.
> >
> >
> > Is syzkaller doing something else now to trigger the bug...?
> > Can you still trigger the bug with the same reproducer?
> 
> Hi Neil, Tommi,
> 
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
> 
> One idea that could make it all simpler:
> 
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
As the name implies, its supposed to wait for the reference count to be zero
indefinately, but yes, under normal operation, its intended to not have to wait
very long at all.  The issuance of the NETDEV_UNREGISTER_FINAL notification is
meant to be a subscribable signal to any code path holding a reference that it
needs to be dropped so that the progress can be made.

Note that the "waiting for %s to become free" message is triggered after 10
seconds of waiting, and is likely the trigger you want, Its just an emergency
level log message rather a WARN.  I don't think we want to change that
permanently, but you could certainly alter it in the code to cause syzbot to
catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) )


> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
> 
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?
Well, it drops everything and sleeps periodically, so its safe in and of itself.
The problem is its waiting for the reference count of a device to drop 

Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-12 Thread Dmitry Vyukov
On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
 wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

Hi Neil, Tommi,

Reviving this old thread about "unregister_netdevice: waiting for lo
to become free. Usage count = 3" hangs.
I still did not have time to deep dive into what happens there (too
many bugs coming from syzbot). But this still actively happens and I
suspect accounts to a significant portion of various hang reports,
which are quite unpleasant.

One idea that could make it all simpler:

Is this wait loop in netdev_wait_allrefs() supposed to wait for any
prolonged periods of time under any non-buggy conditions? E.g. more
than 1-2 minutes?
If it only supposed to wait briefly for things that already supposed
to be shutting down, and we add a WARNING there after some timeout,
then syzbot will report all info how/when it happens, hopefully
extracting reproducers, and all the nice things.
But this WARNING should not have any false positives under any
realistic conditions (e.g. waiting for arrival of remote packets with
large timeouts).

Looking at some task hung reports, it seems that this code holds some
mutexes, takes workqueue thread and prevents any progress with
destruction of other devices (and net namespace creation/destruction),
so I guess it should not wait for any indefinite periods of time?


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-12 Thread Dmitry Vyukov
On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
 wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

Hi Neil, Tommi,

Reviving this old thread about "unregister_netdevice: waiting for lo
to become free. Usage count = 3" hangs.
I still did not have time to deep dive into what happens there (too
many bugs coming from syzbot). But this still actively happens and I
suspect accounts to a significant portion of various hang reports,
which are quite unpleasant.

One idea that could make it all simpler:

Is this wait loop in netdev_wait_allrefs() supposed to wait for any
prolonged periods of time under any non-buggy conditions? E.g. more
than 1-2 minutes?
If it only supposed to wait briefly for things that already supposed
to be shutting down, and we add a WARNING there after some timeout,
then syzbot will report all info how/when it happens, hopefully
extracting reproducers, and all the nice things.
But this WARNING should not have any false positives under any
realistic conditions (e.g. waiting for arrival of remote packets with
large timeouts).

Looking at some task hung reports, it seems that this code holds some
mutexes, takes workqueue thread and prevents any progress with
destruction of other devices (and net namespace creation/destruction),
so I guess it should not wait for any indefinite periods of time?


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-21 Thread Tommi Rantala

On 20.02.2018 18:26, Neil Horman wrote:

On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
 wrote:

On 19.02.2018 20:59, Dmitry Vyukov wrote:

Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.



These two commits are in v4.16-rc1:

commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
Author: Tommi Rantala 
Date:   Mon Feb 5 21:48:14 2018 +0200

 sctp: fix dst refcnt leak in sctp_v4_get_dst
...
 Fixes: 410f03831 ("sctp: add routing output fallback")
 Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
addresses")


commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
Author: Alexey Kodanev 
Date:   Mon Feb 5 15:10:35 2018 +0300

 sctp: fix dst refcnt leak in sctp_v6_get_dst()
...
 Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
addresses for ipv6")


I guess we missed something if it's still reproducible.

I can check it later this week, unless someone else beat me to it.


Hi Tommi,

Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
another one then. But I am still seeing these:

[   58.799130] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   60.847138] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   62.895093] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   64.943103] unregister_netdevice: waiting for lo to become free.
Usage count = 4

on upstream tree pulled ~12 hours ago.


Can you write a systemtap script to probe dev_hold, and dev_put, printing out a
backtrace if the device name matches "lo".  That should tell us definitively if
the problem is in the same location or not


Hi Dmitry, I tested with the reproducer and the kernel .config file that 
you sent in the first email in this thread:


With 4.16-rc2 unable to reproduce.

With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
for lo to become free. Usage count = 3"


With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
sctp_v6_get_dst()" cherry-picked on top, unable to reproduce.



Is syzkaller doing something else now to trigger the bug...?
Can you still trigger the bug with the same reproducer?


Tommi


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-21 Thread Tommi Rantala

On 20.02.2018 18:26, Neil Horman wrote:

On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:

On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
 wrote:

On 19.02.2018 20:59, Dmitry Vyukov wrote:

Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.



These two commits are in v4.16-rc1:

commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
Author: Tommi Rantala 
Date:   Mon Feb 5 21:48:14 2018 +0200

 sctp: fix dst refcnt leak in sctp_v4_get_dst
...
 Fixes: 410f03831 ("sctp: add routing output fallback")
 Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
addresses")


commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
Author: Alexey Kodanev 
Date:   Mon Feb 5 15:10:35 2018 +0300

 sctp: fix dst refcnt leak in sctp_v6_get_dst()
...
 Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
addresses for ipv6")


I guess we missed something if it's still reproducible.

I can check it later this week, unless someone else beat me to it.


Hi Tommi,

Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
another one then. But I am still seeing these:

[   58.799130] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   60.847138] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   62.895093] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   64.943103] unregister_netdevice: waiting for lo to become free.
Usage count = 4

on upstream tree pulled ~12 hours ago.


Can you write a systemtap script to probe dev_hold, and dev_put, printing out a
backtrace if the device name matches "lo".  That should tell us definitively if
the problem is in the same location or not


Hi Dmitry, I tested with the reproducer and the kernel .config file that 
you sent in the first email in this thread:


With 4.16-rc2 unable to reproduce.

With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting 
for lo to become free. Usage count = 3"


With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in 
sctp_v6_get_dst()" cherry-picked on top, unable to reproduce.



Is syzkaller doing something else now to trigger the bug...?
Can you still trigger the bug with the same reproducer?


Tommi


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-20 Thread Neil Horman
On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
> > On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >>
> >> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:
> >
> > On 1/30/18 1:57 PM, David Ahern wrote:
> >>
> >> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
> >>>
> >>> On 01/30/2018 07:32 PM, Cong Wang wrote:
> 
>  On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov 
>  wrote:
> >
> > Hello,
> >
> > The following program creates a hang in unregister_netdevice.
> > cleanup_net work hangs there forever periodically printing
> > "unregister_netdevice: waiting for lo to become free. Usage count =
> > 3"
> > and creation of any new network namespaces hangs forever.
> 
> 
>  Interestingly, this is not reproducible on net-next.
> >>>
> >>>
> >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp:
> >>> close
> >>> sock if net namespace is exiting") in net/net-next from 5 days ago,
> >>> maybe
> >>> fixed due to that?
> >>>
> >>
> >> This appears to be the commit introducing the refcnt leak:
> >>
> >> $ git bisect bad
> >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
> >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
> >> Author: Xin Long 
> >> Date:   Fri May 12 14:39:52 2017 +0800
> >>
> >>  sctp: fix src address selection if using secondary addresses for
> >> ipv6
> >>
> >>
> >> v4.14 is bad. Running bisect in the background while doing other
> >> things
> >>
> >
> > Interesting. The commit that avoids the refcnt leak is
> >
> > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
> > Author: David Ahern 
> > Date:   Wed Jan 24 19:45:29 2018 -0800
> >
> >  net/ipv6: Do not allow route add with a device that is down
> >
> > That commit does not intentionally address the problem so it is just
> > masking the problematic code introduced by the commit above.
> 
>  Thanks, David A.
> 
>  I'm still on a trip. will look into this asap.
> >>>
> >>>
> >>> Alexey and Tommi already had the patches for this issue on
> >>> both SCTP v4 and v6 dst_get, Thanks.
> >>
> >>
> >>
> >>
> >> Is this meant to be fixed already? I am still seeing this on the
> >> latest upstream tree.
> >>
> >
> > These two commits are in v4.16-rc1:
> >
> > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> > Author: Tommi Rantala 
> > Date:   Mon Feb 5 21:48:14 2018 +0200
> >
> > sctp: fix dst refcnt leak in sctp_v4_get_dst
> > ...
> > Fixes: 410f03831 ("sctp: add routing output fallback")
> > Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
> > addresses")
> >
> >
> > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> > Author: Alexey Kodanev 
> > Date:   Mon Feb 5 15:10:35 2018 +0300
> >
> > sctp: fix dst refcnt leak in sctp_v6_get_dst()
> > ...
> > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
> > addresses for ipv6")
> >
> >
> > I guess we missed something if it's still reproducible.
> >
> > I can check it later this week, unless someone else beat me to it.
> 
> Hi Tommi,
> 
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
> 
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> 
> on upstream tree pulled ~12 hours ago.
> 
Can you write a systemtap script to probe dev_hold, and dev_put, printing out a
backtrace if the device name matches "lo".  That should tell us definitively if
the problem is in the same location or not

Neil

> Kernel does not detect this as any kind of BUG/WARNING, so
> syzkaller/syzbot do not catch it as bug and do not try to reproduce,
> localize and report.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-20 Thread Neil Horman
On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>  wrote:
> > On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >>
> >> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:
> >
> > On 1/30/18 1:57 PM, David Ahern wrote:
> >>
> >> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
> >>>
> >>> On 01/30/2018 07:32 PM, Cong Wang wrote:
> 
>  On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov 
>  wrote:
> >
> > Hello,
> >
> > The following program creates a hang in unregister_netdevice.
> > cleanup_net work hangs there forever periodically printing
> > "unregister_netdevice: waiting for lo to become free. Usage count =
> > 3"
> > and creation of any new network namespaces hangs forever.
> 
> 
>  Interestingly, this is not reproducible on net-next.
> >>>
> >>>
> >>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp:
> >>> close
> >>> sock if net namespace is exiting") in net/net-next from 5 days ago,
> >>> maybe
> >>> fixed due to that?
> >>>
> >>
> >> This appears to be the commit introducing the refcnt leak:
> >>
> >> $ git bisect bad
> >> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
> >> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
> >> Author: Xin Long 
> >> Date:   Fri May 12 14:39:52 2017 +0800
> >>
> >>  sctp: fix src address selection if using secondary addresses for
> >> ipv6
> >>
> >>
> >> v4.14 is bad. Running bisect in the background while doing other
> >> things
> >>
> >
> > Interesting. The commit that avoids the refcnt leak is
> >
> > commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
> > Author: David Ahern 
> > Date:   Wed Jan 24 19:45:29 2018 -0800
> >
> >  net/ipv6: Do not allow route add with a device that is down
> >
> > That commit does not intentionally address the problem so it is just
> > masking the problematic code introduced by the commit above.
> 
>  Thanks, David A.
> 
>  I'm still on a trip. will look into this asap.
> >>>
> >>>
> >>> Alexey and Tommi already had the patches for this issue on
> >>> both SCTP v4 and v6 dst_get, Thanks.
> >>
> >>
> >>
> >>
> >> Is this meant to be fixed already? I am still seeing this on the
> >> latest upstream tree.
> >>
> >
> > These two commits are in v4.16-rc1:
> >
> > commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> > Author: Tommi Rantala 
> > Date:   Mon Feb 5 21:48:14 2018 +0200
> >
> > sctp: fix dst refcnt leak in sctp_v4_get_dst
> > ...
> > Fixes: 410f03831 ("sctp: add routing output fallback")
> > Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
> > addresses")
> >
> >
> > commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> > Author: Alexey Kodanev 
> > Date:   Mon Feb 5 15:10:35 2018 +0300
> >
> > sctp: fix dst refcnt leak in sctp_v6_get_dst()
> > ...
> > Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
> > addresses for ipv6")
> >
> >
> > I guess we missed something if it's still reproducible.
> >
> > I can check it later this week, unless someone else beat me to it.
> 
> Hi Tommi,
> 
> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> another one then. But I am still seeing these:
> 
> [   58.799130] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   60.847138] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   62.895093] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> [   64.943103] unregister_netdevice: waiting for lo to become free.
> Usage count = 4
> 
> on upstream tree pulled ~12 hours ago.
> 
Can you write a systemtap script to probe dev_hold, and dev_put, printing out a
backtrace if the device name matches "lo".  That should tell us definitively if
the problem is in the same location or not

Neil

> Kernel does not detect this as any kind of BUG/WARNING, so
> syzkaller/syzbot do not catch it as bug and do not try to reproduce,
> localize and report.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-20 Thread Dmitry Vyukov
On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
 wrote:
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:
>
> On 1/30/18 1:57 PM, David Ahern wrote:
>>
>> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
>>>
>>> On 01/30/2018 07:32 PM, Cong Wang wrote:

 On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov 
 wrote:
>
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count =
> 3"
> and creation of any new network namespaces hangs forever.


 Interestingly, this is not reproducible on net-next.
>>>
>>>
>>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp:
>>> close
>>> sock if net namespace is exiting") in net/net-next from 5 days ago,
>>> maybe
>>> fixed due to that?
>>>
>>
>> This appears to be the commit introducing the refcnt leak:
>>
>> $ git bisect bad
>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
>> Author: Xin Long 
>> Date:   Fri May 12 14:39:52 2017 +0800
>>
>>  sctp: fix src address selection if using secondary addresses for
>> ipv6
>>
>>
>> v4.14 is bad. Running bisect in the background while doing other
>> things
>>
>
> Interesting. The commit that avoids the refcnt leak is
>
> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
> Author: David Ahern 
> Date:   Wed Jan 24 19:45:29 2018 -0800
>
>  net/ipv6: Do not allow route add with a device that is down
>
> That commit does not intentionally address the problem so it is just
> masking the problematic code introduced by the commit above.

 Thanks, David A.

 I'm still on a trip. will look into this asap.
>>>
>>>
>>> Alexey and Tommi already had the patches for this issue on
>>> both SCTP v4 and v6 dst_get, Thanks.
>>
>>
>>
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
> sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
> Fixes: 410f03831 ("sctp: add routing output fallback")
> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
> sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.

Hi Tommi,

Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
another one then. But I am still seeing these:

[   58.799130] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   60.847138] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   62.895093] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   64.943103] unregister_netdevice: waiting for lo to become free.
Usage count = 4

on upstream tree pulled ~12 hours ago.

Kernel does not detect this as any kind of BUG/WARNING, so
syzkaller/syzbot do not catch it as bug and do not try to reproduce,
localize and report.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-20 Thread Dmitry Vyukov
On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
 wrote:
> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>
>> On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:
>
> On 1/30/18 1:57 PM, David Ahern wrote:
>>
>> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
>>>
>>> On 01/30/2018 07:32 PM, Cong Wang wrote:

 On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov 
 wrote:
>
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count =
> 3"
> and creation of any new network namespaces hangs forever.


 Interestingly, this is not reproducible on net-next.
>>>
>>>
>>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp:
>>> close
>>> sock if net namespace is exiting") in net/net-next from 5 days ago,
>>> maybe
>>> fixed due to that?
>>>
>>
>> This appears to be the commit introducing the refcnt leak:
>>
>> $ git bisect bad
>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
>> Author: Xin Long 
>> Date:   Fri May 12 14:39:52 2017 +0800
>>
>>  sctp: fix src address selection if using secondary addresses for
>> ipv6
>>
>>
>> v4.14 is bad. Running bisect in the background while doing other
>> things
>>
>
> Interesting. The commit that avoids the refcnt leak is
>
> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
> Author: David Ahern 
> Date:   Wed Jan 24 19:45:29 2018 -0800
>
>  net/ipv6: Do not allow route add with a device that is down
>
> That commit does not intentionally address the problem so it is just
> masking the problematic code introduced by the commit above.

 Thanks, David A.

 I'm still on a trip. will look into this asap.
>>>
>>>
>>> Alexey and Tommi already had the patches for this issue on
>>> both SCTP v4 and v6 dst_get, Thanks.
>>
>>
>>
>>
>> Is this meant to be fixed already? I am still seeing this on the
>> latest upstream tree.
>>
>
> These two commits are in v4.16-rc1:
>
> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> Author: Tommi Rantala 
> Date:   Mon Feb 5 21:48:14 2018 +0200
>
> sctp: fix dst refcnt leak in sctp_v4_get_dst
> ...
> Fixes: 410f03831 ("sctp: add routing output fallback")
> Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary
> addresses")
>
>
> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> Author: Alexey Kodanev 
> Date:   Mon Feb 5 15:10:35 2018 +0300
>
> sctp: fix dst refcnt leak in sctp_v6_get_dst()
> ...
> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary
> addresses for ipv6")
>
>
> I guess we missed something if it's still reproducible.
>
> I can check it later this week, unless someone else beat me to it.

Hi Tommi,

Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
another one then. But I am still seeing these:

[   58.799130] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   60.847138] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   62.895093] unregister_netdevice: waiting for lo to become free.
Usage count = 4
[   64.943103] unregister_netdevice: waiting for lo to become free.
Usage count = 4

on upstream tree pulled ~12 hours ago.

Kernel does not detect this as any kind of BUG/WARNING, so
syzkaller/syzbot do not catch it as bug and do not try to reproduce,
localize and report.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-19 Thread Tommi Rantala

On 19.02.2018 20:59, Dmitry Vyukov wrote:

On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:

On 1/30/18 1:57 PM, David Ahern wrote:

On 1/30/18 1:08 PM, Daniel Borkmann wrote:

On 01/30/2018 07:32 PM, Cong Wang wrote:

On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:

Hello,

The following program creates a hang in unregister_netdevice.
cleanup_net work hangs there forever periodically printing
"unregister_netdevice: waiting for lo to become free. Usage count = 3"
and creation of any new network namespaces hangs forever.


Interestingly, this is not reproducible on net-next.


The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
fixed due to that?



This appears to be the commit introducing the refcnt leak:

$ git bisect bad
dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
Author: Xin Long 
Date:   Fri May 12 14:39:52 2017 +0800

 sctp: fix src address selection if using secondary addresses for ipv6


v4.14 is bad. Running bisect in the background while doing other things



Interesting. The commit that avoids the refcnt leak is

commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
Author: David Ahern 
Date:   Wed Jan 24 19:45:29 2018 -0800

 net/ipv6: Do not allow route add with a device that is down

That commit does not intentionally address the problem so it is just
masking the problematic code introduced by the commit above.

Thanks, David A.

I'm still on a trip. will look into this asap.


Alexey and Tommi already had the patches for this issue on
both SCTP v4 and v6 dst_get, Thanks.




Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.



These two commits are in v4.16-rc1:

commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
Author: Tommi Rantala 
Date:   Mon Feb 5 21:48:14 2018 +0200

sctp: fix dst refcnt leak in sctp_v4_get_dst
...
Fixes: 410f03831 ("sctp: add routing output fallback")
Fixes: 0ca50d12f ("sctp: fix src address selection if using 
secondary addresses")



commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
Author: Alexey Kodanev 
Date:   Mon Feb 5 15:10:35 2018 +0300

sctp: fix dst refcnt leak in sctp_v6_get_dst()
...
Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using 
secondary addresses for ipv6")



I guess we missed something if it's still reproducible.

I can check it later this week, unless someone else beat me to it.

Tommi


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-19 Thread Tommi Rantala

On 19.02.2018 20:59, Dmitry Vyukov wrote:

On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:

On 1/30/18 1:57 PM, David Ahern wrote:

On 1/30/18 1:08 PM, Daniel Borkmann wrote:

On 01/30/2018 07:32 PM, Cong Wang wrote:

On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:

Hello,

The following program creates a hang in unregister_netdevice.
cleanup_net work hangs there forever periodically printing
"unregister_netdevice: waiting for lo to become free. Usage count = 3"
and creation of any new network namespaces hangs forever.


Interestingly, this is not reproducible on net-next.


The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
fixed due to that?



This appears to be the commit introducing the refcnt leak:

$ git bisect bad
dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
Author: Xin Long 
Date:   Fri May 12 14:39:52 2017 +0800

 sctp: fix src address selection if using secondary addresses for ipv6


v4.14 is bad. Running bisect in the background while doing other things



Interesting. The commit that avoids the refcnt leak is

commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
Author: David Ahern 
Date:   Wed Jan 24 19:45:29 2018 -0800

 net/ipv6: Do not allow route add with a device that is down

That commit does not intentionally address the problem so it is just
masking the problematic code introduced by the commit above.

Thanks, David A.

I'm still on a trip. will look into this asap.


Alexey and Tommi already had the patches for this issue on
both SCTP v4 and v6 dst_get, Thanks.




Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.



These two commits are in v4.16-rc1:

commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
Author: Tommi Rantala 
Date:   Mon Feb 5 21:48:14 2018 +0200

sctp: fix dst refcnt leak in sctp_v4_get_dst
...
Fixes: 410f03831 ("sctp: add routing output fallback")
Fixes: 0ca50d12f ("sctp: fix src address selection if using 
secondary addresses")



commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
Author: Alexey Kodanev 
Date:   Mon Feb 5 15:10:35 2018 +0300

sctp: fix dst refcnt leak in sctp_v6_get_dst()
...
Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using 
secondary addresses for ipv6")



I guess we missed something if it's still reproducible.

I can check it later this week, unless someone else beat me to it.

Tommi


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-19 Thread Dmitry Vyukov
On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:
>>> On 1/30/18 1:57 PM, David Ahern wrote:
 On 1/30/18 1:08 PM, Daniel Borkmann wrote:
> On 01/30/2018 07:32 PM, Cong Wang wrote:
>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  
>> wrote:
>>> Hello,
>>>
>>> The following program creates a hang in unregister_netdevice.
>>> cleanup_net work hangs there forever periodically printing
>>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>>> and creation of any new network namespaces hangs forever.
>>
>> Interestingly, this is not reproducible on net-next.
>
> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
> fixed due to that?
>

 This appears to be the commit introducing the refcnt leak:

 $ git bisect bad
 dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
 commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
 Author: Xin Long 
 Date:   Fri May 12 14:39:52 2017 +0800

 sctp: fix src address selection if using secondary addresses for ipv6


 v4.14 is bad. Running bisect in the background while doing other things

>>>
>>> Interesting. The commit that avoids the refcnt leak is
>>>
>>> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
>>> Author: David Ahern 
>>> Date:   Wed Jan 24 19:45:29 2018 -0800
>>>
>>> net/ipv6: Do not allow route add with a device that is down
>>>
>>> That commit does not intentionally address the problem so it is just
>>> masking the problematic code introduced by the commit above.
>> Thanks, David A.
>>
>> I'm still on a trip. will look into this asap.
>
> Alexey and Tommi already had the patches for this issue on
> both SCTP v4 and v6 dst_get, Thanks.



Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-19 Thread Dmitry Vyukov
On Sat, Feb 3, 2018 at 1:15 PM, Xin Long  wrote:
>>> On 1/30/18 1:57 PM, David Ahern wrote:
 On 1/30/18 1:08 PM, Daniel Borkmann wrote:
> On 01/30/2018 07:32 PM, Cong Wang wrote:
>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  
>> wrote:
>>> Hello,
>>>
>>> The following program creates a hang in unregister_netdevice.
>>> cleanup_net work hangs there forever periodically printing
>>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>>> and creation of any new network namespaces hangs forever.
>>
>> Interestingly, this is not reproducible on net-next.
>
> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
> fixed due to that?
>

 This appears to be the commit introducing the refcnt leak:

 $ git bisect bad
 dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
 commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
 Author: Xin Long 
 Date:   Fri May 12 14:39:52 2017 +0800

 sctp: fix src address selection if using secondary addresses for ipv6


 v4.14 is bad. Running bisect in the background while doing other things

>>>
>>> Interesting. The commit that avoids the refcnt leak is
>>>
>>> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
>>> Author: David Ahern 
>>> Date:   Wed Jan 24 19:45:29 2018 -0800
>>>
>>> net/ipv6: Do not allow route add with a device that is down
>>>
>>> That commit does not intentionally address the problem so it is just
>>> masking the problematic code introduced by the commit above.
>> Thanks, David A.
>>
>> I'm still on a trip. will look into this asap.
>
> Alexey and Tommi already had the patches for this issue on
> both SCTP v4 and v6 dst_get, Thanks.



Is this meant to be fixed already? I am still seeing this on the
latest upstream tree.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-03 Thread Xin Long
On Thu, Feb 1, 2018 at 1:49 AM, Xin Long  wrote:
> On Tue, Jan 30, 2018 at 11:59 PM, David Ahern  wrote:
>> On 1/30/18 1:57 PM, David Ahern wrote:
>>> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
 On 01/30/2018 07:32 PM, Cong Wang wrote:
> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program creates a hang in unregister_netdevice.
>> cleanup_net work hangs there forever periodically printing
>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>> and creation of any new network namespaces hangs forever.
>
> Interestingly, this is not reproducible on net-next.

 The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
 sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
 fixed due to that?

>>>
>>> This appears to be the commit introducing the refcnt leak:
>>>
>>> $ git bisect bad
>>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
>>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
>>> Author: Xin Long 
>>> Date:   Fri May 12 14:39:52 2017 +0800
>>>
>>> sctp: fix src address selection if using secondary addresses for ipv6
>>>
>>>
>>> v4.14 is bad. Running bisect in the background while doing other things
>>>
>>
>> Interesting. The commit that avoids the refcnt leak is
>>
>> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
>> Author: David Ahern 
>> Date:   Wed Jan 24 19:45:29 2018 -0800
>>
>> net/ipv6: Do not allow route add with a device that is down
>>
>> That commit does not intentionally address the problem so it is just
>> masking the problematic code introduced by the commit above.
> Thanks, David A.
>
> I'm still on a trip. will look into this asap.

Alexey and Tommi already had the patches for this issue on
both SCTP v4 and v6 dst_get, Thanks.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-02-03 Thread Xin Long
On Thu, Feb 1, 2018 at 1:49 AM, Xin Long  wrote:
> On Tue, Jan 30, 2018 at 11:59 PM, David Ahern  wrote:
>> On 1/30/18 1:57 PM, David Ahern wrote:
>>> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
 On 01/30/2018 07:32 PM, Cong Wang wrote:
> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program creates a hang in unregister_netdevice.
>> cleanup_net work hangs there forever periodically printing
>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>> and creation of any new network namespaces hangs forever.
>
> Interestingly, this is not reproducible on net-next.

 The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
 sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
 fixed due to that?

>>>
>>> This appears to be the commit introducing the refcnt leak:
>>>
>>> $ git bisect bad
>>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
>>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
>>> Author: Xin Long 
>>> Date:   Fri May 12 14:39:52 2017 +0800
>>>
>>> sctp: fix src address selection if using secondary addresses for ipv6
>>>
>>>
>>> v4.14 is bad. Running bisect in the background while doing other things
>>>
>>
>> Interesting. The commit that avoids the refcnt leak is
>>
>> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
>> Author: David Ahern 
>> Date:   Wed Jan 24 19:45:29 2018 -0800
>>
>> net/ipv6: Do not allow route add with a device that is down
>>
>> That commit does not intentionally address the problem so it is just
>> masking the problematic code introduced by the commit above.
> Thanks, David A.
>
> I'm still on a trip. will look into this asap.

Alexey and Tommi already had the patches for this issue on
both SCTP v4 and v6 dst_get, Thanks.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-31 Thread Xin Long
On Tue, Jan 30, 2018 at 11:59 PM, David Ahern  wrote:
> On 1/30/18 1:57 PM, David Ahern wrote:
>> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
>>> On 01/30/2018 07:32 PM, Cong Wang wrote:
 On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
> and creation of any new network namespaces hangs forever.

 Interestingly, this is not reproducible on net-next.
>>>
>>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
>>> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
>>> fixed due to that?
>>>
>>
>> This appears to be the commit introducing the refcnt leak:
>>
>> $ git bisect bad
>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
>> Author: Xin Long 
>> Date:   Fri May 12 14:39:52 2017 +0800
>>
>> sctp: fix src address selection if using secondary addresses for ipv6
>>
>>
>> v4.14 is bad. Running bisect in the background while doing other things
>>
>
> Interesting. The commit that avoids the refcnt leak is
>
> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
> Author: David Ahern 
> Date:   Wed Jan 24 19:45:29 2018 -0800
>
> net/ipv6: Do not allow route add with a device that is down
>
> That commit does not intentionally address the problem so it is just
> masking the problematic code introduced by the commit above.
Thanks, David A.

I'm still on a trip. will look into this asap.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-31 Thread Xin Long
On Tue, Jan 30, 2018 at 11:59 PM, David Ahern  wrote:
> On 1/30/18 1:57 PM, David Ahern wrote:
>> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
>>> On 01/30/2018 07:32 PM, Cong Wang wrote:
 On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
> and creation of any new network namespaces hangs forever.

 Interestingly, this is not reproducible on net-next.
>>>
>>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
>>> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
>>> fixed due to that?
>>>
>>
>> This appears to be the commit introducing the refcnt leak:
>>
>> $ git bisect bad
>> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
>> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
>> Author: Xin Long 
>> Date:   Fri May 12 14:39:52 2017 +0800
>>
>> sctp: fix src address selection if using secondary addresses for ipv6
>>
>>
>> v4.14 is bad. Running bisect in the background while doing other things
>>
>
> Interesting. The commit that avoids the refcnt leak is
>
> commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
> Author: David Ahern 
> Date:   Wed Jan 24 19:45:29 2018 -0800
>
> net/ipv6: Do not allow route add with a device that is down
>
> That commit does not intentionally address the problem so it is just
> masking the problematic code introduced by the commit above.
Thanks, David A.

I'm still on a trip. will look into this asap.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread David Ahern
On 1/30/18 1:57 PM, David Ahern wrote:
> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
>> On 01/30/2018 07:32 PM, Cong Wang wrote:
>>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
 Hello,

 The following program creates a hang in unregister_netdevice.
 cleanup_net work hangs there forever periodically printing
 "unregister_netdevice: waiting for lo to become free. Usage count = 3"
 and creation of any new network namespaces hangs forever.
>>>
>>> Interestingly, this is not reproducible on net-next.
>>
>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
>> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
>> fixed due to that?
>>
> 
> This appears to be the commit introducing the refcnt leak:
> 
> $ git bisect bad
> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
> Author: Xin Long 
> Date:   Fri May 12 14:39:52 2017 +0800
> 
> sctp: fix src address selection if using secondary addresses for ipv6
> 
> 
> v4.14 is bad. Running bisect in the background while doing other things
> 

Interesting. The commit that avoids the refcnt leak is

commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
Author: David Ahern 
Date:   Wed Jan 24 19:45:29 2018 -0800

net/ipv6: Do not allow route add with a device that is down

That commit does not intentionally address the problem so it is just
masking the problematic code introduced by the commit above.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread David Ahern
On 1/30/18 1:57 PM, David Ahern wrote:
> On 1/30/18 1:08 PM, Daniel Borkmann wrote:
>> On 01/30/2018 07:32 PM, Cong Wang wrote:
>>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
 Hello,

 The following program creates a hang in unregister_netdevice.
 cleanup_net work hangs there forever periodically printing
 "unregister_netdevice: waiting for lo to become free. Usage count = 3"
 and creation of any new network namespaces hangs forever.
>>>
>>> Interestingly, this is not reproducible on net-next.
>>
>> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
>> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
>> fixed due to that?
>>
> 
> This appears to be the commit introducing the refcnt leak:
> 
> $ git bisect bad
> dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
> Author: Xin Long 
> Date:   Fri May 12 14:39:52 2017 +0800
> 
> sctp: fix src address selection if using secondary addresses for ipv6
> 
> 
> v4.14 is bad. Running bisect in the background while doing other things
> 

Interesting. The commit that avoids the refcnt leak is

commit 955ec4cb3b54c7c389a9f830be7d3ae2056b9212
Author: David Ahern 
Date:   Wed Jan 24 19:45:29 2018 -0800

net/ipv6: Do not allow route add with a device that is down

That commit does not intentionally address the problem so it is just
masking the problematic code introduced by the commit above.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread David Ahern
On 1/30/18 1:08 PM, Daniel Borkmann wrote:
> On 01/30/2018 07:32 PM, Cong Wang wrote:
>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
>>> Hello,
>>>
>>> The following program creates a hang in unregister_netdevice.
>>> cleanup_net work hangs there forever periodically printing
>>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>>> and creation of any new network namespaces hangs forever.
>>
>> Interestingly, this is not reproducible on net-next.
> 
> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
> fixed due to that?
> 

This appears to be the commit introducing the refcnt leak:

$ git bisect bad
dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
Author: Xin Long 
Date:   Fri May 12 14:39:52 2017 +0800

sctp: fix src address selection if using secondary addresses for ipv6


v4.14 is bad. Running bisect in the background while doing other things


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread David Ahern
On 1/30/18 1:08 PM, Daniel Borkmann wrote:
> On 01/30/2018 07:32 PM, Cong Wang wrote:
>> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
>>> Hello,
>>>
>>> The following program creates a hang in unregister_netdevice.
>>> cleanup_net work hangs there forever periodically printing
>>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>>> and creation of any new network namespaces hangs forever.
>>
>> Interestingly, this is not reproducible on net-next.
> 
> The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
> sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
> fixed due to that?
> 

This appears to be the commit introducing the refcnt leak:

$ git bisect bad
dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 is the first bad commit
commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
Author: Xin Long 
Date:   Fri May 12 14:39:52 2017 +0800

sctp: fix src address selection if using secondary addresses for ipv6


v4.14 is bad. Running bisect in the background while doing other things


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread Daniel Borkmann
On 01/30/2018 07:32 PM, Cong Wang wrote:
> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program creates a hang in unregister_netdevice.
>> cleanup_net work hangs there forever periodically printing
>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>> and creation of any new network namespaces hangs forever.
> 
> Interestingly, this is not reproducible on net-next.

The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
fixed due to that?


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread Daniel Borkmann
On 01/30/2018 07:32 PM, Cong Wang wrote:
> On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program creates a hang in unregister_netdevice.
>> cleanup_net work hangs there forever periodically printing
>> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
>> and creation of any new network namespaces hangs forever.
> 
> Interestingly, this is not reproducible on net-next.

The most recent change on netns refcnt was 4ee806d51176 ("net: tcp: close
sock if net namespace is exiting") in net/net-next from 5 days ago, maybe
fixed due to that?


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread Cong Wang
On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
> and creation of any new network namespaces hangs forever.

Interestingly, this is not reproducible on net-next.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread Cong Wang
On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
> and creation of any new network namespaces hangs forever.

Interestingly, this is not reproducible on net-next.