Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-13 Thread David Ahern
On 8/13/17 2:56 PM, Wei Wang wrote:
>> Looking at my patch to move host routes from loopback to device with the
>> address, I have this:
>>
>> @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
>> const struct arg_dev_net *adn = arg;
>> const struct net_device *dev = adn->dev;
>>
>> -   if ((rt->dst.dev == dev || !dev) &&
>> +   if ((rt->dst.dev == dev || !dev ||
>> +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) &&
>> rt != adn->net->ipv6.ip6_null_entry &&
>> (rt->rt6i_nsiblings == 0 ||
>>  (dev && netdev_unregistering(dev)) ||
> 
> As you explained earlier, after your patch, all entries in the fib6
> tree will have rt->dst.dev be the same as rt->rt6i_idev->dev except
> those ones created by p6_rt_cache_alloc() and ip6_rt_pcpu_alloc().
> Then the above newly added check is mainly to catch those cached dst
> entries (created by ip6_rt_cached_alloc()). right?
> And it is required because __ipv6_ifa_notify() -> ip6_del_rt() won't
> take care of those cached dst entries.
> 
> Then I think I should wait for your patches to get merged before
> submitting my patch?

no. your patch will need to go back to 4.12; my changes will not be
appropriate for that.


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-13 Thread Wei Wang
> Looking at my patch to move host routes from loopback to device with the
> address, I have this:
>
> @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
> const struct arg_dev_net *adn = arg;
> const struct net_device *dev = adn->dev;
>
> -   if ((rt->dst.dev == dev || !dev) &&
> +   if ((rt->dst.dev == dev || !dev ||
> +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) &&
> rt != adn->net->ipv6.ip6_null_entry &&
> (rt->rt6i_nsiblings == 0 ||
>  (dev && netdev_unregistering(dev)) ||

As you explained earlier, after your patch, all entries in the fib6
tree will have rt->dst.dev be the same as rt->rt6i_idev->dev except
those ones created by p6_rt_cache_alloc() and ip6_rt_pcpu_alloc().
Then the above newly added check is mainly to catch those cached dst
entries (created by ip6_rt_cached_alloc()). right?
And it is required because __ipv6_ifa_notify() -> ip6_del_rt() won't
take care of those cached dst entries.

Then I think I should wait for your patches to get merged before
submitting my patch?

Thanks.
Wei


On Sun, Aug 13, 2017 at 9:24 AM, David Ahern  wrote:
> On 8/12/17 1:42 PM, Wei Wang wrote:
>> Hi Ido,
>>
 - if ((rt->dst.dev == dev || !dev) &&
 + if ((rt->dst.dev == dev || !dev ||
 +  rt->rt6i_idev->dev == dev) &&
>>>
>>> Can you please explain why this line is needed? While host routes aren't
>>> removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are
>>> removed later on in addrconf_ifdown().
>>>
>>
>> Yes.. Agree. But one difference is that if the route is removed from
>> addrconf_ifdown(), dst_dev_put() won't be called to release the
>> devices before doing dst_release(). It is OK if dst_release() sees the
>> refcnt on dst already drops to 0 and directly destroys the dst. But I
>> think it will cause problem if at the time, the dst is still held by
>> some other users because then the refcnt on the device going down will
>> not get released.
>> That's why I think we should remove the dst with either dst->dev ==
>> going down dev or rt6->rt6i_idev->dev == going down dev from the fib6
>> tree always because there, we always call dst_dev_put() to release the
>> device.
>>
>>> With your patch, if I check the return value of ip6_del_rt() in
>>> __ipv6_ifa_notify() I see that -ENONET is returned. Because the host
>>> route was already removed by rt6_ifdown(). When the line in question is
>>> removed from the patch I don't get the error anymore.
>>>
>>
>> Right. That is expected as the route is already removed from the tree.
>>
>>> Is it possible that in John's case the host route was correctly removed
>>> from the FIB and that the unreleased reference was due to a wrong check
>>> in ip6_dst_ifdown() (which you patched correctly AFAICT)?
>>>
>>
>> Yes. possible. But as I explained earlier, I still think we should
>> also remove routes with rt6->rt6i_idev->dev == going down dev from the
>> tree.
>
> Looking at my patch to move host routes from loopback to device with the
> address, I have this:
>
> @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
> const struct arg_dev_net *adn = arg;
> const struct net_device *dev = adn->dev;
>
> -   if ((rt->dst.dev == dev || !dev) &&
> +   if ((rt->dst.dev == dev || !dev ||
> +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) &&
> rt != adn->net->ipv6.ip6_null_entry &&
> (rt->rt6i_nsiblings == 0 ||
>  (dev && netdev_unregistering(dev)) ||
>
>


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-13 Thread David Ahern
On 8/12/17 1:42 PM, Wei Wang wrote:
> Hi Ido,
> 
>>> - if ((rt->dst.dev == dev || !dev) &&
>>> + if ((rt->dst.dev == dev || !dev ||
>>> +  rt->rt6i_idev->dev == dev) &&
>>
>> Can you please explain why this line is needed? While host routes aren't
>> removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are
>> removed later on in addrconf_ifdown().
>>
> 
> Yes.. Agree. But one difference is that if the route is removed from
> addrconf_ifdown(), dst_dev_put() won't be called to release the
> devices before doing dst_release(). It is OK if dst_release() sees the
> refcnt on dst already drops to 0 and directly destroys the dst. But I
> think it will cause problem if at the time, the dst is still held by
> some other users because then the refcnt on the device going down will
> not get released.
> That's why I think we should remove the dst with either dst->dev ==
> going down dev or rt6->rt6i_idev->dev == going down dev from the fib6
> tree always because there, we always call dst_dev_put() to release the
> device.
> 
>> With your patch, if I check the return value of ip6_del_rt() in
>> __ipv6_ifa_notify() I see that -ENONET is returned. Because the host
>> route was already removed by rt6_ifdown(). When the line in question is
>> removed from the patch I don't get the error anymore.
>>
> 
> Right. That is expected as the route is already removed from the tree.
> 
>> Is it possible that in John's case the host route was correctly removed
>> from the FIB and that the unreleased reference was due to a wrong check
>> in ip6_dst_ifdown() (which you patched correctly AFAICT)?
>>
> 
> Yes. possible. But as I explained earlier, I still think we should
> also remove routes with rt6->rt6i_idev->dev == going down dev from the
> tree.

Looking at my patch to move host routes from loopback to device with the
address, I have this:

@@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
const struct arg_dev_net *adn = arg;
const struct net_device *dev = adn->dev;

-   if ((rt->dst.dev == dev || !dev) &&
+   if ((rt->dst.dev == dev || !dev ||
+(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) &&
rt != adn->net->ipv6.ip6_null_entry &&
(rt->rt6i_nsiblings == 0 ||
 (dev && netdev_unregistering(dev)) ||




Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-12 Thread Wei Wang
Hi Ido,

>> - if ((rt->dst.dev == dev || !dev) &&
>> + if ((rt->dst.dev == dev || !dev ||
>> +  rt->rt6i_idev->dev == dev) &&
>
> Can you please explain why this line is needed? While host routes aren't
> removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are
> removed later on in addrconf_ifdown().
>

Yes.. Agree. But one difference is that if the route is removed from
addrconf_ifdown(), dst_dev_put() won't be called to release the
devices before doing dst_release(). It is OK if dst_release() sees the
refcnt on dst already drops to 0 and directly destroys the dst. But I
think it will cause problem if at the time, the dst is still held by
some other users because then the refcnt on the device going down will
not get released.
That's why I think we should remove the dst with either dst->dev ==
going down dev or rt6->rt6i_idev->dev == going down dev from the fib6
tree always because there, we always call dst_dev_put() to release the
device.

> With your patch, if I check the return value of ip6_del_rt() in
> __ipv6_ifa_notify() I see that -ENONET is returned. Because the host
> route was already removed by rt6_ifdown(). When the line in question is
> removed from the patch I don't get the error anymore.
>

Right. That is expected as the route is already removed from the tree.

> Is it possible that in John's case the host route was correctly removed
> from the FIB and that the unreleased reference was due to a wrong check
> in ip6_dst_ifdown() (which you patched correctly AFAICT)?
>

Yes. possible. But as I explained earlier, I still think we should
also remove routes with rt6->rt6i_idev->dev == going down dev from the
tree.

Thanks.
Wei

On Sat, Aug 12, 2017 at 11:01 AM, Ido Schimmel  wrote:
> Hi Wei,
>
> On Fri, Aug 11, 2017 at 05:10:02PM -0700, Wei Wang wrote:
>> I think we have a potential fix for this issue.
>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
>> possible that rt6->dst.dev points to loopback device while
>> rt6->rt6i_idev->dev points to a real device.
>> When the real device goes down, the current fib6 clean up code only
>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
>> That leaves unreleased refcnt on the real device if rt6->dst.dev
>> points to loopback dev.
>
> [...]
>
>> From 2d8861808c2029013f6b6e86120ba6902329145b Mon Sep 17 00:00:00 2001
>> From: Wei Wang 
>> Date: Fri, 11 Aug 2017 16:36:04 -0700
>> Subject: [PATCH 1/2] potential fix for unregister_netdevice()
>>
>> Change-Id: I5d5f6f7a7ad0f5dd769f33487db17ff2570d52ea
>> ---
>>  net/ipv6/route.c | 17 -
>>  1 file changed, 8 insertions(+), 9 deletions(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 4d30c96a819d..105922903932 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -417,14 +417,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, 
>> struct net_device *dev,
>>   struct net_device *loopback_dev =
>>   dev_net(dev)->loopback_dev;
>>
>> - if (dev != loopback_dev) {
>> - if (idev && idev->dev == dev) {
>> - struct inet6_dev *loopback_idev =
>> - in6_dev_get(loopback_dev);
>> - if (loopback_idev) {
>> - rt->rt6i_idev = loopback_idev;
>> - in6_dev_put(idev);
>> - }
>> + if (idev && idev->dev != loopback_dev) {
>> + struct inet6_dev *loopback_idev =
>> + in6_dev_get(loopback_dev);
>> + if (loopback_idev) {
>> + rt->rt6i_idev = loopback_idev;
>> + in6_dev_put(idev);
>>   }
>>   }
>>  }
>> @@ -2789,7 +2787,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
>>   const struct arg_dev_net *adn = arg;
>>   const struct net_device *dev = adn->dev;
>>
>> - if ((rt->dst.dev == dev || !dev) &&
>> + if ((rt->dst.dev == dev || !dev ||
>> +  rt->rt6i_idev->dev == dev) &&
>
> Can you please explain why this line is needed? While host routes aren't
> removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are
> removed later on in addrconf_ifdown().
>
> With your patch, if I check the return value of ip6_del_rt() in
> __ipv6_ifa_notify() I see that -ENONET is returned. Because the host
> route was already removed by rt6_ifdown(). When the line in question is
> removed from the patch I don't get the error anymore.
>
> Is it possible that in John's case the host route was correctly removed
> from the FIB and that the unreleased reference was due to a wrong check
> in ip6_dst_ifdown() (which you patched correctly AFAICT)?
>
> Thanks
>
>>   rt != adn->net->ipv6.ip6_null_entry &&
>>   (rt->rt6i_nsiblings == 0 ||
>>(dev && netdev_unregistering(dev)) ||
>> --
>> 2.14.0.434.g98096fd7a8-goog
>>


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-12 Thread Wei Wang
On Fri, Aug 11, 2017 at 8:37 PM, David Ahern  wrote:
> On 8/11/17 6:25 PM, Wei Wang wrote:
>> By "a patch to fix that" do you mean after your patch, for every rt6,
>> rt6->rt6i_idev will be the same as rt6->dst.dev?
>
> FIB entries should have them the same device with my patch.
>
> The copies done (ip6_rt_cache_alloc and ip6_rt_pcpu_alloc) will have to
> set dst.dev to loopback or VRF device for RTF_LOCAL routes; it's the
> only way to get local traffic to work and this is similar to what IPv4 does.

Got it. Thanks David.

Wei


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-12 Thread Wei Wang
On Fri, Aug 11, 2017 at 8:07 PM, John Stultz  wrote:
> On Fri, Aug 11, 2017 at 5:31 PM, John Stultz  wrote:
>> On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang  wrote:
 If after Cong's fix, the issue still happens, could you help try the
 patch attached and collect all logs when you try the reproduce the
 issue? It would be great to have logs for both success case and the
 failure case.

 Thanks so much for your help.

>>>
>>> I think we have a potential fix for this issue.
>>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
>>> possible that rt6->dst.dev points to loopback device while
>>> rt6->rt6i_idev->dev points to a real device.
>>> When the real device goes down, the current fib6 clean up code only
>>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
>>> That leaves unreleased refcnt on the real device if rt6->dst.dev
>>> points to loopback dev.
>>>
>>> The attached potential fix is tested by Martin and made sure it fixes his 
>>> issue.
>>>
>>> John,
>>> It will be great if you can also give it a try and see if it fixes the
>>> issue on your side before I submit an official patch.
>>
>> So yes, sorry I haven't been able to get back quicker on the other
>> patches sent, was mucking about in other work.
>>
>> So yea, this patch  (potential fix for unregister_netdevice()) seems
>> to avoid the issue.
>>
>> I'm going to do some further testing, but its looking good so far.
>
> Looks good so far! I've not hit the issue yet.
>
> Thanks so much for sorting out a fix!
>
> If its useful:
> Tested-by: John Stultz 
>
> thanks again
> -john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-12 Thread Wei Wang
> Looks good so far! I've not hit the issue yet.
>

Great. I will prepare an official patch then.

> Thanks so much for sorting out a fix!
>
> If its useful:
> Tested-by: John Stultz 

Sure will do.

Thanks.
Wei

On Fri, Aug 11, 2017 at 8:07 PM, John Stultz  wrote:
> On Fri, Aug 11, 2017 at 5:31 PM, John Stultz  wrote:
>> On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang  wrote:
 If after Cong's fix, the issue still happens, could you help try the
 patch attached and collect all logs when you try the reproduce the
 issue? It would be great to have logs for both success case and the
 failure case.

 Thanks so much for your help.

>>>
>>> I think we have a potential fix for this issue.
>>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
>>> possible that rt6->dst.dev points to loopback device while
>>> rt6->rt6i_idev->dev points to a real device.
>>> When the real device goes down, the current fib6 clean up code only
>>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
>>> That leaves unreleased refcnt on the real device if rt6->dst.dev
>>> points to loopback dev.
>>>
>>> The attached potential fix is tested by Martin and made sure it fixes his 
>>> issue.
>>>
>>> John,
>>> It will be great if you can also give it a try and see if it fixes the
>>> issue on your side before I submit an official patch.
>>
>> So yes, sorry I haven't been able to get back quicker on the other
>> patches sent, was mucking about in other work.
>>
>> So yea, this patch  (potential fix for unregister_netdevice()) seems
>> to avoid the issue.
>>
>> I'm going to do some further testing, but its looking good so far.
>
> Looks good so far! I've not hit the issue yet.
>
> Thanks so much for sorting out a fix!
>
> If its useful:
> Tested-by: John Stultz 
>
> thanks again
> -john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-12 Thread Ido Schimmel
Hi Wei,

On Fri, Aug 11, 2017 at 05:10:02PM -0700, Wei Wang wrote:
> I think we have a potential fix for this issue.
> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
> possible that rt6->dst.dev points to loopback device while
> rt6->rt6i_idev->dev points to a real device.
> When the real device goes down, the current fib6 clean up code only
> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
> That leaves unreleased refcnt on the real device if rt6->dst.dev
> points to loopback dev.

[...]

> From 2d8861808c2029013f6b6e86120ba6902329145b Mon Sep 17 00:00:00 2001
> From: Wei Wang 
> Date: Fri, 11 Aug 2017 16:36:04 -0700
> Subject: [PATCH 1/2] potential fix for unregister_netdevice()
> 
> Change-Id: I5d5f6f7a7ad0f5dd769f33487db17ff2570d52ea
> ---
>  net/ipv6/route.c | 17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 4d30c96a819d..105922903932 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -417,14 +417,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, 
> struct net_device *dev,
>   struct net_device *loopback_dev =
>   dev_net(dev)->loopback_dev;
>  
> - if (dev != loopback_dev) {
> - if (idev && idev->dev == dev) {
> - struct inet6_dev *loopback_idev =
> - in6_dev_get(loopback_dev);
> - if (loopback_idev) {
> - rt->rt6i_idev = loopback_idev;
> - in6_dev_put(idev);
> - }
> + if (idev && idev->dev != loopback_dev) {
> + struct inet6_dev *loopback_idev =
> + in6_dev_get(loopback_dev);
> + if (loopback_idev) {
> + rt->rt6i_idev = loopback_idev;
> + in6_dev_put(idev);
>   }
>   }
>  }
> @@ -2789,7 +2787,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
>   const struct arg_dev_net *adn = arg;
>   const struct net_device *dev = adn->dev;
>  
> - if ((rt->dst.dev == dev || !dev) &&
> + if ((rt->dst.dev == dev || !dev ||
> +  rt->rt6i_idev->dev == dev) &&

Can you please explain why this line is needed? While host routes aren't
removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are
removed later on in addrconf_ifdown().

With your patch, if I check the return value of ip6_del_rt() in
__ipv6_ifa_notify() I see that -ENONET is returned. Because the host
route was already removed by rt6_ifdown(). When the line in question is
removed from the patch I don't get the error anymore.

Is it possible that in John's case the host route was correctly removed
from the FIB and that the unreleased reference was due to a wrong check
in ip6_dst_ifdown() (which you patched correctly AFAICT)?

Thanks

>   rt != adn->net->ipv6.ip6_null_entry &&
>   (rt->rt6i_nsiblings == 0 ||
>(dev && netdev_unregistering(dev)) ||
> -- 
> 2.14.0.434.g98096fd7a8-goog
> 


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread David Ahern
On 8/11/17 6:25 PM, Wei Wang wrote:
> By "a patch to fix that" do you mean after your patch, for every rt6,
> rt6->rt6i_idev will be the same as rt6->dst.dev?

FIB entries should have them the same device with my patch.

The copies done (ip6_rt_cache_alloc and ip6_rt_pcpu_alloc) will have to
set dst.dev to loopback or VRF device for RTF_LOCAL routes; it's the
only way to get local traffic to work and this is similar to what IPv4 does.


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread John Stultz
On Fri, Aug 11, 2017 at 5:31 PM, John Stultz  wrote:
> On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang  wrote:
>>> If after Cong's fix, the issue still happens, could you help try the
>>> patch attached and collect all logs when you try the reproduce the
>>> issue? It would be great to have logs for both success case and the
>>> failure case.
>>>
>>> Thanks so much for your help.
>>>
>>
>> I think we have a potential fix for this issue.
>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
>> possible that rt6->dst.dev points to loopback device while
>> rt6->rt6i_idev->dev points to a real device.
>> When the real device goes down, the current fib6 clean up code only
>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
>> That leaves unreleased refcnt on the real device if rt6->dst.dev
>> points to loopback dev.
>>
>> The attached potential fix is tested by Martin and made sure it fixes his 
>> issue.
>>
>> John,
>> It will be great if you can also give it a try and see if it fixes the
>> issue on your side before I submit an official patch.
>
> So yes, sorry I haven't been able to get back quicker on the other
> patches sent, was mucking about in other work.
>
> So yea, this patch  (potential fix for unregister_netdevice()) seems
> to avoid the issue.
>
> I'm going to do some further testing, but its looking good so far.

Looks good so far! I've not hit the issue yet.

Thanks so much for sorting out a fix!

If its useful:
Tested-by: John Stultz 

thanks again
-john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread Wei Wang
> So yes, sorry I haven't been able to get back quicker on the other
> patches sent, was mucking about in other work.
>
> So yea, this patch  (potential fix for unregister_netdevice()) seems
> to avoid the issue.
>
> I'm going to do some further testing, but its looking good so far.
>

Great. Thanks.

> Do you still want feedback on the previous changes?

If this patch is good, then I don't really need the previous feedback.

Thanks.
Wei


On Fri, Aug 11, 2017 at 5:31 PM, John Stultz  wrote:
> On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang  wrote:
>>> If after Cong's fix, the issue still happens, could you help try the
>>> patch attached and collect all logs when you try the reproduce the
>>> issue? It would be great to have logs for both success case and the
>>> failure case.
>>>
>>> Thanks so much for your help.
>>>
>>
>> I think we have a potential fix for this issue.
>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
>> possible that rt6->dst.dev points to loopback device while
>> rt6->rt6i_idev->dev points to a real device.
>> When the real device goes down, the current fib6 clean up code only
>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
>> That leaves unreleased refcnt on the real device if rt6->dst.dev
>> points to loopback dev.
>>
>> The attached potential fix is tested by Martin and made sure it fixes his 
>> issue.
>>
>> John,
>> It will be great if you can also give it a try and see if it fixes the
>> issue on your side before I submit an official patch.
>
> So yes, sorry I haven't been able to get back quicker on the other
> patches sent, was mucking about in other work.
>
> So yea, this patch  (potential fix for unregister_netdevice()) seems
> to avoid the issue.
>
> I'm going to do some further testing, but its looking good so far.
>
> Do you still want feedback on the previous changes?
>
> thanks
> -john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread John Stultz
On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang  wrote:
>> If after Cong's fix, the issue still happens, could you help try the
>> patch attached and collect all logs when you try the reproduce the
>> issue? It would be great to have logs for both success case and the
>> failure case.
>>
>> Thanks so much for your help.
>>
>
> I think we have a potential fix for this issue.
> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
> possible that rt6->dst.dev points to loopback device while
> rt6->rt6i_idev->dev points to a real device.
> When the real device goes down, the current fib6 clean up code only
> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
> That leaves unreleased refcnt on the real device if rt6->dst.dev
> points to loopback dev.
>
> The attached potential fix is tested by Martin and made sure it fixes his 
> issue.
>
> John,
> It will be great if you can also give it a try and see if it fixes the
> issue on your side before I submit an official patch.

So yes, sorry I haven't been able to get back quicker on the other
patches sent, was mucking about in other work.

So yea, this patch  (potential fix for unregister_netdevice()) seems
to avoid the issue.

I'm going to do some further testing, but its looking good so far.

Do you still want feedback on the previous changes?

thanks
-john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread Wei Wang
On Fri, Aug 11, 2017 at 5:19 PM, David Ahern  wrote:
> On 8/11/17 6:10 PM, Wei Wang wrote:
>> I think we have a potential fix for this issue.
>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
>> possible that rt6->dst.dev points to loopback device while
>> rt6->rt6i_idev->dev points to a real device.
>> When the real device goes down, the current fib6 clean up code only
>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
>> That leaves unreleased refcnt on the real device if rt6->dst.dev
>> points to loopback dev.
>
> Yes, host routes and anycast routes.
>
> I have a patch to fix that but it is held up on a few VRF test cases
> failing. Hopefully I can get that figured out next week. These unrelated
> routes against the loopback device have been a source of a number of
> problems (e.g. take down 'lo' and all of IPv6 networking stops for that
> namespace).

Thanks David.
By "a patch to fix that" do you mean after your patch, for every rt6,
rt6->rt6i_idev will be the same as rt6->dst.dev?


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread David Ahern
On 8/11/17 6:10 PM, Wei Wang wrote:
> I think we have a potential fix for this issue.
> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
> possible that rt6->dst.dev points to loopback device while
> rt6->rt6i_idev->dev points to a real device.
> When the real device goes down, the current fib6 clean up code only
> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
> That leaves unreleased refcnt on the real device if rt6->dst.dev
> points to loopback dev.

Yes, host routes and anycast routes.

I have a patch to fix that but it is held up on a few VRF test cases
failing. Hopefully I can get that figured out next week. These unrelated
routes against the loopback device have been a source of a number of
problems (e.g. take down 'lo' and all of IPv6 networking stops for that
namespace).


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread Wei Wang
> If after Cong's fix, the issue still happens, could you help try the
> patch attached and collect all logs when you try the reproduce the
> issue? It would be great to have logs for both success case and the
> failure case.
>
> Thanks so much for your help.
>

I think we have a potential fix for this issue.
Martin and I found that when addrconf_dst_alloc() creates a rt6, it is
possible that rt6->dst.dev points to loopback device while
rt6->rt6i_idev->dev points to a real device.
When the real device goes down, the current fib6 clean up code only
checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same.
That leaves unreleased refcnt on the real device if rt6->dst.dev
points to loopback dev.

The attached potential fix is tested by Martin and made sure it fixes his issue.

John,
It will be great if you can also give it a try and see if it fixes the
issue on your side before I submit an official patch.

Thanks very much for the help from everyone.

Wei

On Fri, Aug 11, 2017 at 10:25 AM, Wei Wang  wrote:
> On Fri, Aug 11, 2017 at 9:48 AM, Cong Wang  wrote:
>> Hi,
>>
>> On Thu, Aug 10, 2017 at 11:12 AM, John Stultz  wrote:
>>> On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang  wrote:
 Hi John,

 Is it possible to try the attached patch?
>>>
>>> Thanks so much for the quick turn around!
>>>
>>> So I dropped all the reverts you suggested, and applied this one
>>> against 4.13-rc4, but I'm still seeing the problematic behavior.
>>
>> Does the following one-line fix make a difference?
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index a640fbcba15d..c145a35763a0 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -141,7 +141,7 @@ static void rt6_uncached_list_del(struct rt6_info *rt)
>> struct uncached_list *ul = rt->rt6i_uncached_list;
>>
>> spin_lock_bh(>lock);
>> -   list_del(>rt6i_uncached);
>> +   list_del_init(>rt6i_uncached);
>> spin_unlock_bh(>lock);
>> }
>>  }
>
>
> Thanks a lot Cong for proposing this fix.
>
> For the last few days, John has been helping me running debug image
> and we found out that the leaked dst is probably in addrconf.c.
> Martin and I are looking through the code and trying to put more debugs.
>
> John,
>
> If after Cong's fix, the issue still happens, could you help try the
> patch attached and collect all logs when you try the reproduce the
> issue? It would be great to have logs for both success case and the
> failure case.
>
> Thanks so much for your help.
>
> Wei
From 2d8861808c2029013f6b6e86120ba6902329145b Mon Sep 17 00:00:00 2001
From: Wei Wang 
Date: Fri, 11 Aug 2017 16:36:04 -0700
Subject: [PATCH 1/2] potential fix for unregister_netdevice()

Change-Id: I5d5f6f7a7ad0f5dd769f33487db17ff2570d52ea
---
 net/ipv6/route.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4d30c96a819d..105922903932 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -417,14 +417,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 	struct net_device *loopback_dev =
 		dev_net(dev)->loopback_dev;
 
-	if (dev != loopback_dev) {
-		if (idev && idev->dev == dev) {
-			struct inet6_dev *loopback_idev =
-in6_dev_get(loopback_dev);
-			if (loopback_idev) {
-rt->rt6i_idev = loopback_idev;
-in6_dev_put(idev);
-			}
+	if (idev && idev->dev != loopback_dev) {
+		struct inet6_dev *loopback_idev =
+			in6_dev_get(loopback_dev);
+		if (loopback_idev) {
+			rt->rt6i_idev = loopback_idev;
+			in6_dev_put(idev);
 		}
 	}
 }
@@ -2789,7 +2787,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
 	const struct arg_dev_net *adn = arg;
 	const struct net_device *dev = adn->dev;
 
-	if ((rt->dst.dev == dev || !dev) &&
+	if ((rt->dst.dev == dev || !dev ||
+	 rt->rt6i_idev->dev == dev) &&
 	rt != adn->net->ipv6.ip6_null_entry &&
 	(rt->rt6i_nsiblings == 0 ||
 	 (dev && netdev_unregistering(dev)) ||
-- 
2.14.0.434.g98096fd7a8-goog



Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread Wei Wang
On Fri, Aug 11, 2017 at 9:48 AM, Cong Wang  wrote:
> Hi,
>
> On Thu, Aug 10, 2017 at 11:12 AM, John Stultz  wrote:
>> On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang  wrote:
>>> Hi John,
>>>
>>> Is it possible to try the attached patch?
>>
>> Thanks so much for the quick turn around!
>>
>> So I dropped all the reverts you suggested, and applied this one
>> against 4.13-rc4, but I'm still seeing the problematic behavior.
>
> Does the following one-line fix make a difference?
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index a640fbcba15d..c145a35763a0 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -141,7 +141,7 @@ static void rt6_uncached_list_del(struct rt6_info *rt)
> struct uncached_list *ul = rt->rt6i_uncached_list;
>
> spin_lock_bh(>lock);
> -   list_del(>rt6i_uncached);
> +   list_del_init(>rt6i_uncached);
> spin_unlock_bh(>lock);
> }
>  }


Thanks a lot Cong for proposing this fix.

For the last few days, John has been helping me running debug image
and we found out that the leaked dst is probably in addrconf.c.
Martin and I are looking through the code and trying to put more debugs.

John,

If after Cong's fix, the issue still happens, could you help try the
patch attached and collect all logs when you try the reproduce the
issue? It would be great to have logs for both success case and the
failure case.

Thanks so much for your help.

Wei
From 0ff591e00eac13888c5ee9c84ee51b286b27389d Mon Sep 17 00:00:00 2001
From: Wei Wang 
Date: Thu, 10 Aug 2017 21:12:32 -0700
Subject: [PATCH] ipv6: unregister_netdevice debug

Change-Id: I97b044b957a146de480e5427212c1ce80bc3dd3c
---
 include/net/dst.h |  7 +++
 include/net/dst_ops.h |  1 +
 include/net/ip6_fib.h | 10 ++
 net/core/dev.c| 16 +++
 net/core/dst.c| 10 --
 net/ipv6/addrconf.c   | 24 +++---
 net/ipv6/route.c  | 55 ---
 7 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index f73611ec4017..48d9f0322492 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -56,6 +56,13 @@ struct dst_entry {
 #define DST_XFRM_TUNNEL		0x0020
 #define DST_XFRM_QUEUE		0x0040
 #define DST_METADATA		0x0080
+#define RTF_ICMPV6_DST		0x0100
+#define RTF_ADDRCONF_DST	0x0200
+#define RTF_UNCACHED_DST	0x0400
+#define RTF_CACHE_DST		0x0800
+#define RTF_FIB6_DST		0x1000
+#define RTF_PCPU_DST		0x2000
+
 
 	short			error;
 
diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
index c84b3287e38b..eef7d6b6f3cd 100644
--- a/include/net/dst_ops.h
+++ b/include/net/dst_ops.h
@@ -39,6 +39,7 @@ struct dst_ops {
 	struct kmem_cache	*kmem_cachep;
 
 	struct percpu_counter	pcpuc_entries cacheline_aligned_in_smp;
+	void			(*do_account)(struct dst_entry *dst);
 };
 
 static inline int dst_entries_get_fast(struct dst_ops *dst)
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 1a88008cc6f5..af382d123f63 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -214,6 +214,16 @@ struct rt6_statistics {
 	__u32		fib_rt_entries;		/* rt entries in table	*/
 	__u32		fib_rt_cache;		/* cache routes		*/
 	__u32		fib_discarded_routes;
+	__u32		icmpv6_dst;
+	__u32		addrconf_dst;
+	__u32		fib6_dst;
+	__u32		cache_dst;
+	__u32		uncached_dst;
+	__u32		pcpu_dst;
+	__u32		dev_free_fib6;
+	__u32		dev_free_uncached;
+	__u32		dev_free_icmpv6;
+	__u32		dev_free_addrconf;
 };
 
 #define RTN_TL_ROOT	0x0001
diff --git a/net/core/dev.c b/net/core/dev.c
index 8ea6b4b42611..5c6fea78003f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1,3 +1,4 @@
+
 /*
  *  NET3Protocol independent device support routines.
  *
@@ -7738,8 +7739,23 @@ static void netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
 	int refcnt;
+	struct net *net = dev_net(dev);
 
 	linkwatch_forget_dev(dev);
+	printk("%s: dev_name %s, icmpv6_dst %u, addrconf_dst %u, fib6_dst %u, pcpu_dst %u, cache_dst %u, uncached_list %u, dev_free_fib6 %u, dev_free_uncached %u, dev_free_icmpv6 %u, dev_free_addrconf %u, fib_rt_entries %u\n", __func__,
+	   dev->name,
+	   net->ipv6.rt6_stats->icmpv6_dst,
+	   net->ipv6.rt6_stats->addrconf_dst,
+	   net->ipv6.rt6_stats->fib6_dst,
+	   net->ipv6.rt6_stats->pcpu_dst,
+	   net->ipv6.rt6_stats->cache_dst,
+	   net->ipv6.rt6_stats->uncached_dst,
+	   net->ipv6.rt6_stats->dev_free_fib6,
+	   net->ipv6.rt6_stats->dev_free_uncached,
+	   net->ipv6.rt6_stats->dev_free_icmpv6,
+	   net->ipv6.rt6_stats->dev_free_addrconf,
+	   net->ipv6.rt6_stats->fib_rt_entries);
+
 
 	rebroadcast_time = warning_time = jiffies;
 	refcnt = netdev_refcnt_read(dev);
diff --git a/net/core/dst.c b/net/core/dst.c
index 00aa972ad1a1..8c20deec398b 100644
--- 

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-11 Thread Cong Wang
Hi,

On Thu, Aug 10, 2017 at 11:12 AM, John Stultz  wrote:
> On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang  wrote:
>> Hi John,
>>
>> Is it possible to try the attached patch?
>
> Thanks so much for the quick turn around!
>
> So I dropped all the reverts you suggested, and applied this one
> against 4.13-rc4, but I'm still seeing the problematic behavior.

Does the following one-line fix make a difference?

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a640fbcba15d..c145a35763a0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -141,7 +141,7 @@ static void rt6_uncached_list_del(struct rt6_info *rt)
struct uncached_list *ul = rt->rt6i_uncached_list;

spin_lock_bh(>lock);
-   list_del(>rt6i_uncached);
+   list_del_init(>rt6i_uncached);
spin_unlock_bh(>lock);
}
 }


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-10 Thread Wei Wang
On Thu, Aug 10, 2017 at 11:12 AM, John Stultz  wrote:
> On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang  wrote:
>> Hi John,
>>
>> Is it possible to try the attached patch?
>
> Thanks so much for the quick turn around!
>
> So I dropped all the reverts you suggested, and applied this one
> against 4.13-rc4, but I'm still seeing the problematic behavior.
>

Thanks for confirming.
I have been going through the code and not yet found any leaks.
I am also trying to reproduce the issue myself.
Martin seems to also see this issue.
I will continue investigating.

>> I am not sure if it actually fixes the issue. But I think it is worth a try.
>> Also, could you get me all the ipv6 routes when you plug in the usb
>> using "ip -6 route show"? (If you have multiple routing tables
>> configured, could you dump them all?)
>
> # ip -6 route show
> 2601:1c2:1002:83f0::/64 dev eth0  proto kernel  metric 256  expires
> 345599sec pref medium
> fe80::/64 dev eth0  proto kernel  metric 256  pref medium
> default via fe80::200:caff:fe11:2233 dev eth0  proto ra  metric 1024
> expires 1799sec hoplimit 64 pref medium
>
>
> After unplugging the device (and getting the unregister_netdevice errors):
> # ip -6 route show
> #
>

Thanks for the logs.


>
> thanks
> -john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-10 Thread John Stultz
On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang  wrote:
> Hi John,
>
> Is it possible to try the attached patch?

Thanks so much for the quick turn around!

So I dropped all the reverts you suggested, and applied this one
against 4.13-rc4, but I'm still seeing the problematic behavior.

> I am not sure if it actually fixes the issue. But I think it is worth a try.
> Also, could you get me all the ipv6 routes when you plug in the usb
> using "ip -6 route show"? (If you have multiple routing tables
> configured, could you dump them all?)

# ip -6 route show
2601:1c2:1002:83f0::/64 dev eth0  proto kernel  metric 256  expires
345599sec pref medium
fe80::/64 dev eth0  proto kernel  metric 256  pref medium
default via fe80::200:caff:fe11:2233 dev eth0  proto ra  metric 1024
expires 1799sec hoplimit 64 pref medium


After unplugging the device (and getting the unregister_netdevice errors):
# ip -6 route show
#


thanks
-john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Wei Wang
Hi John,

Is it possible to try the attached patch?
I am not sure if it actually fixes the issue. But I think it is worth a try.
Also, could you get me all the ipv6 routes when you plug in the usb
using "ip -6 route show"? (If you have multiple routing tables
configured, could you dump them all?)

Thanks a lot.
Wei

On Wed, Aug 9, 2017 at 6:36 PM, Wei Wang  wrote:
> On Wed, Aug 9, 2017 at 6:26 PM, John Stultz  wrote:
>> On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
>>> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
 On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
> (Cc'ing Wei whose commit was blamed)
>
> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  
> wrote:
>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  
>> wrote:
>>> So, with recent testing with my HiKey board, I've been noticing some
>>> quirky behavior with my USB eth adapter.
>>>
>>> Basically, pluging the usb eth adapter in and then removing it, when
>>> plugging it back in I often find that its not detected, and the system
>>> slowly spits out the following message over and over:
>>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>>
>> The other bit is that after this starts printing, the board will no
>> longer reboot (it hangs continuing to occasionally print the above
>> message), and I have to manually reset the device.
>>
>
> So this warning is not temporarily shown but lasts until a reboot,
> right? If so it is a dst refcnt leak.

 Correct, once I get into the state it lasts until a reboot.

> How reproducible is it for you? From my reading, it seems always
> reproduced when you unplug and plug your usb eth interface?
> Is there anything else involved? For example, network namespace.

 So with 4.13-rc3/4 I seem to trigger it easily, often with the first
 unplug of the USB eth adapter.

 But as I get back closer to 4.12, it seemingly becomes harder to
 trigger, but sometimes still happens.

 So far, I've not been able to trigger it with 4.12.

 I don't think network namespaces are involved?  Though its out of my
 area, so AOSP may be using them these days.  Is there a simple way to
 check?

 I'll also do another bisection to see if the bad point moves back any 
 further.
>>
>> So I went through another bisection around and got  9514528d92d4 ipv6:
>> call dst_dev_put() properly as the first bad commit again.
>>
>>> If you see the problem starts to happen on commit
>>> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
>>> the following commits:
>>> (from new to old)
>>> 1eb04e7c9e63 net: reorder all the dst flags
>>> a4c2fd7f7891 net: remove DST_NOCACHE flag
>>> b2a9c0ed75a3 net: remove DST_NOGC flag
>>> 5b7c9a8ff828 net: remove dst gc related code
>>> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
>>> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
>>> ad65a2f05695 ipv6: call dst_hold_safe() properly
>>> 9514528d92d4 ipv6: call dst_dev_put() properly
>>
>>
>> And reverting this set off of 4.13-rc4 seems to make the issue go away.
>>
>> Is there anything I can test to help narrow down the specific problem
>> with that patchset?
>>
>
> Thanks John for confirming.
> Let me spend some time on the commits and I will let you know if I
> have some debug image for you to try.
>
> Wei
>
>
>> thanks
>> -john
From 93f2836679c81915b110ff56617f9f5dae2e6927 Mon Sep 17 00:00:00 2001
From: Wei Wang 
Date: Wed, 9 Aug 2017 22:27:36 -0700
Subject: [PATCH] ipv6: unregister netdev bug fix

Change-Id: I30fa739989ac50fbc7f4cbc6a04130005589cc25
---
 include/net/ip6_route.h |  1 +
 net/ipv6/addrconf.c | 10 +++---
 net/ipv6/anycast.c  |  3 ++-
 net/ipv6/route.c|  2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 907d39a42f6b..dec1424ce619 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -94,6 +94,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg);
 int ip6_route_add(struct fib6_config *cfg, struct netlink_ext_ack *extack);
 int ip6_ins_rt(struct rt6_info *);
 int ip6_del_rt(struct rt6_info *);
+void rt6_uncached_list_add(struct rt6_info *rt);
 
 static inline int ip6_route_get_saddr(struct net *net, struct rt6_info *rt,
   const struct in6_addr *daddr,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3c46e9513a31..06a27addb93c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3079,7 +3079,8 @@ static void init_loopback(struct net_device *dev)
 			/* Failure cases are ignored */
 			if (!IS_ERR(sp_rt)) {
 sp_ifa->rt = sp_rt;
-ip6_ins_rt(sp_rt);
+if 

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Wei Wang
On Wed, Aug 9, 2017 at 6:26 PM, John Stultz  wrote:
> On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
>> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
>>> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
 (Cc'ing Wei whose commit was blamed)

 On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  
> wrote:
>> So, with recent testing with my HiKey board, I've been noticing some
>> quirky behavior with my USB eth adapter.
>>
>> Basically, pluging the usb eth adapter in and then removing it, when
>> plugging it back in I often find that its not detected, and the system
>> slowly spits out the following message over and over:
>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>
> The other bit is that after this starts printing, the board will no
> longer reboot (it hangs continuing to occasionally print the above
> message), and I have to manually reset the device.
>

 So this warning is not temporarily shown but lasts until a reboot,
 right? If so it is a dst refcnt leak.
>>>
>>> Correct, once I get into the state it lasts until a reboot.
>>>
 How reproducible is it for you? From my reading, it seems always
 reproduced when you unplug and plug your usb eth interface?
 Is there anything else involved? For example, network namespace.
>>>
>>> So with 4.13-rc3/4 I seem to trigger it easily, often with the first
>>> unplug of the USB eth adapter.
>>>
>>> But as I get back closer to 4.12, it seemingly becomes harder to
>>> trigger, but sometimes still happens.
>>>
>>> So far, I've not been able to trigger it with 4.12.
>>>
>>> I don't think network namespaces are involved?  Though its out of my
>>> area, so AOSP may be using them these days.  Is there a simple way to
>>> check?
>>>
>>> I'll also do another bisection to see if the bad point moves back any 
>>> further.
>
> So I went through another bisection around and got  9514528d92d4 ipv6:
> call dst_dev_put() properly as the first bad commit again.
>
>> If you see the problem starts to happen on commit
>> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
>> the following commits:
>> (from new to old)
>> 1eb04e7c9e63 net: reorder all the dst flags
>> a4c2fd7f7891 net: remove DST_NOCACHE flag
>> b2a9c0ed75a3 net: remove DST_NOGC flag
>> 5b7c9a8ff828 net: remove dst gc related code
>> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
>> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
>> ad65a2f05695 ipv6: call dst_hold_safe() properly
>> 9514528d92d4 ipv6: call dst_dev_put() properly
>
>
> And reverting this set off of 4.13-rc4 seems to make the issue go away.
>
> Is there anything I can test to help narrow down the specific problem
> with that patchset?
>

Thanks John for confirming.
Let me spend some time on the commits and I will let you know if I
have some debug image for you to try.

Wei


> thanks
> -john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread John Stultz
On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
>> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
>>> (Cc'ing Wei whose commit was blamed)
>>>
>>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
 On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
> So, with recent testing with my HiKey board, I've been noticing some
> quirky behavior with my USB eth adapter.
>
> Basically, pluging the usb eth adapter in and then removing it, when
> plugging it back in I often find that its not detected, and the system
> slowly spits out the following message over and over:
>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1

 The other bit is that after this starts printing, the board will no
 longer reboot (it hangs continuing to occasionally print the above
 message), and I have to manually reset the device.

>>>
>>> So this warning is not temporarily shown but lasts until a reboot,
>>> right? If so it is a dst refcnt leak.
>>
>> Correct, once I get into the state it lasts until a reboot.
>>
>>> How reproducible is it for you? From my reading, it seems always
>>> reproduced when you unplug and plug your usb eth interface?
>>> Is there anything else involved? For example, network namespace.
>>
>> So with 4.13-rc3/4 I seem to trigger it easily, often with the first
>> unplug of the USB eth adapter.
>>
>> But as I get back closer to 4.12, it seemingly becomes harder to
>> trigger, but sometimes still happens.
>>
>> So far, I've not been able to trigger it with 4.12.
>>
>> I don't think network namespaces are involved?  Though its out of my
>> area, so AOSP may be using them these days.  Is there a simple way to
>> check?
>>
>> I'll also do another bisection to see if the bad point moves back any 
>> further.

So I went through another bisection around and got  9514528d92d4 ipv6:
call dst_dev_put() properly as the first bad commit again.

> If you see the problem starts to happen on commit
> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
> the following commits:
> (from new to old)
> 1eb04e7c9e63 net: reorder all the dst flags
> a4c2fd7f7891 net: remove DST_NOCACHE flag
> b2a9c0ed75a3 net: remove DST_NOGC flag
> 5b7c9a8ff828 net: remove dst gc related code
> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
> ad65a2f05695 ipv6: call dst_hold_safe() properly
> 9514528d92d4 ipv6: call dst_dev_put() properly


And reverting this set off of 4.13-rc4 seems to make the issue go away.

Is there anything I can test to help narrow down the specific problem
with that patchset?

thanks
-john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread John Stultz
On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
>
> Does your USB adapter get an IPv6 address?

Yes, it does.

> If you see the problem starts to happen on commit
> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
> the following commits:
> (from new to old)
> 1eb04e7c9e63 net: reorder all the dst flags
> a4c2fd7f7891 net: remove DST_NOCACHE flag
> b2a9c0ed75a3 net: remove DST_NOGC flag
> 5b7c9a8ff828 net: remove dst gc related code
> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
> ad65a2f05695 ipv6: call dst_hold_safe() properly
> 9514528d92d4 ipv6: call dst_dev_put() properly
>
> and try if it starts to work?

I'll give that a shot!

Thanks so much for the help! I really appreciate it!
-john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Wei Wang
On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
>> (Cc'ing Wei whose commit was blamed)
>>
>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
>>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
 So, with recent testing with my HiKey board, I've been noticing some
 quirky behavior with my USB eth adapter.

 Basically, pluging the usb eth adapter in and then removing it, when
 plugging it back in I often find that its not detected, and the system
 slowly spits out the following message over and over:
   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>>>
>>> The other bit is that after this starts printing, the board will no
>>> longer reboot (it hangs continuing to occasionally print the above
>>> message), and I have to manually reset the device.
>>>
>>
>> So this warning is not temporarily shown but lasts until a reboot,
>> right? If so it is a dst refcnt leak.
>
> Correct, once I get into the state it lasts until a reboot.
>
>> How reproducible is it for you? From my reading, it seems always
>> reproduced when you unplug and plug your usb eth interface?
>> Is there anything else involved? For example, network namespace.
>
> So with 4.13-rc3/4 I seem to trigger it easily, often with the first
> unplug of the USB eth adapter.
>
> But as I get back closer to 4.12, it seemingly becomes harder to
> trigger, but sometimes still happens.
>
> So far, I've not been able to trigger it with 4.12.
>
> I don't think network namespaces are involved?  Though its out of my
> area, so AOSP may be using them these days.  Is there a simple way to
> check?
>
> I'll also do another bisection to see if the bad point moves back any further.
>
> thanks
> -john

Hi John,

Does your USB adapter get an IPv6 address?

If you see the problem starts to happen on commit
9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
the following commits:
(from new to old)
1eb04e7c9e63 net: reorder all the dst flags
a4c2fd7f7891 net: remove DST_NOCACHE flag
b2a9c0ed75a3 net: remove DST_NOGC flag
5b7c9a8ff828 net: remove dst gc related code
db916649b5dd ipv6: get rid of icmp6 dst garbage collector
587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
ad65a2f05695 ipv6: call dst_hold_safe() properly
9514528d92d4 ipv6: call dst_dev_put() properly

and try if it starts to work?

By only reverting 9514528d92d4 definitely won't work as all the later
commits depend on this one.

Thanks a lot.
Wei


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread John Stultz
On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
> (Cc'ing Wei whose commit was blamed)
>
> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
>>> So, with recent testing with my HiKey board, I've been noticing some
>>> quirky behavior with my USB eth adapter.
>>>
>>> Basically, pluging the usb eth adapter in and then removing it, when
>>> plugging it back in I often find that its not detected, and the system
>>> slowly spits out the following message over and over:
>>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>>
>> The other bit is that after this starts printing, the board will no
>> longer reboot (it hangs continuing to occasionally print the above
>> message), and I have to manually reset the device.
>>
>
> So this warning is not temporarily shown but lasts until a reboot,
> right? If so it is a dst refcnt leak.

Correct, once I get into the state it lasts until a reboot.

> How reproducible is it for you? From my reading, it seems always
> reproduced when you unplug and plug your usb eth interface?
> Is there anything else involved? For example, network namespace.

So with 4.13-rc3/4 I seem to trigger it easily, often with the first
unplug of the USB eth adapter.

But as I get back closer to 4.12, it seemingly becomes harder to
trigger, but sometimes still happens.

So far, I've not been able to trigger it with 4.12.

I don't think network namespaces are involved?  Though its out of my
area, so AOSP may be using them these days.  Is there a simple way to
check?

I'll also do another bisection to see if the bad point moves back any further.

thanks
-john


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Cong Wang
(Cc'ing Wei whose commit was blamed)

On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
>> So, with recent testing with my HiKey board, I've been noticing some
>> quirky behavior with my USB eth adapter.
>>
>> Basically, pluging the usb eth adapter in and then removing it, when
>> plugging it back in I often find that its not detected, and the system
>> slowly spits out the following message over and over:
>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>
> The other bit is that after this starts printing, the board will no
> longer reboot (it hangs continuing to occasionally print the above
> message), and I have to manually reset the device.
>

So this warning is not temporarily shown but lasts until a reboot,
right? If so it is a dst refcnt leak.

How reproducible is it for you? From my reading, it seems always
reproduced when you unplug and plug your usb eth interface?
Is there anything else involved? For example, network namespace.

Thanks.


Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-07 Thread John Stultz
On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
> So, with recent testing with my HiKey board, I've been noticing some
> quirky behavior with my USB eth adapter.
>
> Basically, pluging the usb eth adapter in and then removing it, when
> plugging it back in I often find that its not detected, and the system
> slowly spits out the following message over and over:
>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1

The other bit is that after this starts printing, the board will no
longer reboot (it hangs continuing to occasionally print the above
message), and I have to manually reset the device.

thanks
-john