Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-18 Thread Eric Dumazet
On Fri, 2016-11-18 at 16:38 -0800, Jarno Rajahalme wrote:

> This fixes the problem for me, so for whatever it’s worth:
> 
> Tested-by: Jarno Rajahalme 
> 

Thanks for testing !

https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=e88a2766143a27bfe6704b4493b214de4094cf29





Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-18 Thread Jarno Rajahalme

> On Nov 14, 2016, at 3:09 PM, Eric Dumazet  wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet 
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index 
> d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2
>  100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, 
> struct net_device *de
>   struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
>   __skb_queue_head_init(>napi_skbs);
> +
> + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state);
> +
>   netif_napi_add(dev, >napi, gro_cell_poll, 64);
>   napi_enable(>napi);
>   }
> 
> 
> 
> 
> 

This fixes the problem for me, so for whatever it’s worth:

Tested-by: Jarno Rajahalme 



Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-17 Thread Jarno Rajahalme

> On Nov 14, 2016, at 3:09 PM, Eric Dumazet  wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet 
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index 
> d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2
>  100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, 
> struct net_device *de
>   struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
>   __skb_queue_head_init(>napi_skbs);
> +
> + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state);
> +
>   netif_napi_add(dev, >napi, gro_cell_poll, 64);
>   napi_enable(>napi);
>   }
> 

This solved a ~20 second slowdown between OVS datapath unit tests for me.

  Jarno



Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Eric Dumazet
On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
> 
> > synchronize_rcu_expidited is not enough if you have multiple network
> > devices in play.
> > 
> > Looking at the code it comes down to this commit, and it appears there
> > is a promise add rcu grace period combining by Eric Dumazet.
> > 
> > Eric since people are hitting noticable stalls because of the rcu grace
> > period taking a long time do you think you could look at this code path
> > a bit more?
> > 
> > commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
> > Author: Eric Dumazet 
> > Date:   Wed Nov 18 06:31:03 2015 -0800
> 
> Absolutely, I will take a loop asap.

The worst offender should be fixed by the following patch.

busy poll needs to poll the physical device, not a virtual one...

diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
index 
d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2
 100644
--- a/include/net/gro_cells.h
+++ b/include/net/gro_cells.h
@@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, 
struct net_device *de
struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
 
__skb_queue_head_init(>napi_skbs);
+
+   set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state);
+
netif_napi_add(dev, >napi, gro_cell_poll, 64);
napi_enable(>napi);
}







Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Eric Dumazet
On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:

> synchronize_rcu_expidited is not enough if you have multiple network
> devices in play.
> 
> Looking at the code it comes down to this commit, and it appears there
> is a promise add rcu grace period combining by Eric Dumazet.
> 
> Eric since people are hitting noticable stalls because of the rcu grace
> period taking a long time do you think you could look at this code path
> a bit more?
> 
> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
> Author: Eric Dumazet 
> Date:   Wed Nov 18 06:31:03 2015 -0800

Absolutely, I will take a loop asap.

Thanks Eric.




Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Eric W. Biederman
"Paul E. McKenney"  writes:

> On Mon, Nov 14, 2016 at 09:44:35AM -0800, Cong Wang wrote:
>> On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney
>>  wrote:
>> > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
>> >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang  
>> >> wrote:
>> >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
>> >> >  wrote:
>> >> >>
>> >> >> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() 
>> >> >> be
>> >> >> modified to check for net_mutex being held in addition to the current
>> >> >> checks for RTNL being held?
>> >> >>
>> >> >
>> >> > Good point!
>> >> >
>> >> > Like commit be3fc413da9eb17cce0991f214ab0, checking
>> >> > for net_mutex for this case seems to be an optimization, I assume
>> >> > synchronize_rcu_expedited() and synchronize_rcu() have the same
>> >> > behavior...
>> >>
>> >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
>> >> gets wrong on rtnl_is_locked(), the lock could be locked by other
>> >> process not by the current one, therefore it should be
>> >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
>> >> is enabled... Sigh.
>> >>
>> >> I don't see any better way than letting callers decide if they want the
>> >> expedited version or not, but this requires changes of all callers of
>> >> synchronize_net(). Hm.
>> >
>> > I must confess that I don't understand how it would help to use an
>> > expedited grace period when some other process is holding RTNL.
>> > In contrast, I do well understand how it helps when the current process
>> > is holding RTNL.
>> 
>> Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks
>> rtnl_is_locked(), clearly we need to assert "it is held by the current 
>> process"
>> rather than "it is locked by whatever process".
>> 
>> But given *_is_held() is always defined by LOCKDEP, so we probably need
>> mutex to provide such a helper directly, mutex->owner is not always defined
>> either. :-/
>
> There is always the option of making acquisition and release set a per-task
> variable that can be tested.  (Where did I put that asbestos suit, anyway?)
>
>   Thanx, Paul

synchronize_rcu_expidited is not enough if you have multiple network
devices in play.

Looking at the code it comes down to this commit, and it appears there
is a promise add rcu grace period combining by Eric Dumazet.

Eric since people are hitting noticable stalls because of the rcu grace
period taking a long time do you think you could look at this code path
a bit more?

commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
Author: Eric Dumazet 
Date:   Wed Nov 18 06:31:03 2015 -0800

net: provide generic busy polling to all NAPI drivers

NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 

Eric


Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Paul E. McKenney
On Mon, Nov 14, 2016 at 09:44:35AM -0800, Cong Wang wrote:
> On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney
>  wrote:
> > On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
> >> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang  
> >> wrote:
> >> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> >> >  wrote:
> >> >>
> >> >> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> >> >> modified to check for net_mutex being held in addition to the current
> >> >> checks for RTNL being held?
> >> >>
> >> >
> >> > Good point!
> >> >
> >> > Like commit be3fc413da9eb17cce0991f214ab0, checking
> >> > for net_mutex for this case seems to be an optimization, I assume
> >> > synchronize_rcu_expedited() and synchronize_rcu() have the same
> >> > behavior...
> >>
> >> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
> >> gets wrong on rtnl_is_locked(), the lock could be locked by other
> >> process not by the current one, therefore it should be
> >> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
> >> is enabled... Sigh.
> >>
> >> I don't see any better way than letting callers decide if they want the
> >> expedited version or not, but this requires changes of all callers of
> >> synchronize_net(). Hm.
> >
> > I must confess that I don't understand how it would help to use an
> > expedited grace period when some other process is holding RTNL.
> > In contrast, I do well understand how it helps when the current process
> > is holding RTNL.
> 
> Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks
> rtnl_is_locked(), clearly we need to assert "it is held by the current 
> process"
> rather than "it is locked by whatever process".
> 
> But given *_is_held() is always defined by LOCKDEP, so we probably need
> mutex to provide such a helper directly, mutex->owner is not always defined
> either. :-/

There is always the option of making acquisition and release set a per-task
variable that can be tested.  (Where did I put that asbestos suit, anyway?)

Thanx, Paul



Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Cong Wang
On Mon, Nov 14, 2016 at 8:24 AM, Paul E. McKenney
 wrote:
> On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
>> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang  wrote:
>> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
>> >  wrote:
>> >>
>> >> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
>> >> modified to check for net_mutex being held in addition to the current
>> >> checks for RTNL being held?
>> >>
>> >
>> > Good point!
>> >
>> > Like commit be3fc413da9eb17cce0991f214ab0, checking
>> > for net_mutex for this case seems to be an optimization, I assume
>> > synchronize_rcu_expedited() and synchronize_rcu() have the same
>> > behavior...
>>
>> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
>> gets wrong on rtnl_is_locked(), the lock could be locked by other
>> process not by the current one, therefore it should be
>> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
>> is enabled... Sigh.
>>
>> I don't see any better way than letting callers decide if they want the
>> expedited version or not, but this requires changes of all callers of
>> synchronize_net(). Hm.
>
> I must confess that I don't understand how it would help to use an
> expedited grace period when some other process is holding RTNL.
> In contrast, I do well understand how it helps when the current process
> is holding RTNL.

Yeah, this is exactly my point. And same for ASSERT_RTNL() which checks
rtnl_is_locked(), clearly we need to assert "it is held by the current process"
rather than "it is locked by whatever process".

But given *_is_held() is always defined by LOCKDEP, so we probably need
mutex to provide such a helper directly, mutex->owner is not always defined
either. :-/


Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Hannes Frederic Sowa
Hi Cong,

On Sat, Nov 12, 2016, at 01:55, Cong Wang wrote:
> On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
>  wrote:
> >
> > Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> > modified to check for net_mutex being held in addition to the current
> > checks for RTNL being held?
> >
> 
> Good point!
> 
> Like commit be3fc413da9eb17cce0991f214ab0, checking
> for net_mutex for this case seems to be an optimization, I assume
> synchronize_rcu_expedited() and synchronize_rcu() have the same
> behavior...
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index eaad4c2..3415b6b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev);
>  void synchronize_net(void)
>  {
> might_sleep();
> -   if (rtnl_is_locked())
> +   if (rtnl_is_locked() || lockdep_is_held(_mutex))
> synchronize_rcu_expedited();

I don't think we should depend on lockdep for this check but rather use
mutex_is_locked here (I think it would fail to build like this without
CONFIG_LOCKDEP).

Bye,
Hannes


Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-14 Thread Paul E. McKenney
On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang  wrote:
> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> >  wrote:
> >>
> >> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> >> modified to check for net_mutex being held in addition to the current
> >> checks for RTNL being held?
> >>
> >
> > Good point!
> >
> > Like commit be3fc413da9eb17cce0991f214ab0, checking
> > for net_mutex for this case seems to be an optimization, I assume
> > synchronize_rcu_expedited() and synchronize_rcu() have the same
> > behavior...
> 
> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
> gets wrong on rtnl_is_locked(), the lock could be locked by other
> process not by the current one, therefore it should be
> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
> is enabled... Sigh.
> 
> I don't see any better way than letting callers decide if they want the
> expedited version or not, but this requires changes of all callers of
> synchronize_net(). Hm.

I must confess that I don't understand how it would help to use an
expedited grace period when some other process is holding RTNL.
In contrast, I do well understand how it helps when the current process
is holding RTNL.

So what am I missing here?

Thanx, Paul



Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-13 Thread Cong Wang
On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang  wrote:
> On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
>  wrote:
>>
>> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
>> modified to check for net_mutex being held in addition to the current
>> checks for RTNL being held?
>>
>
> Good point!
>
> Like commit be3fc413da9eb17cce0991f214ab0, checking
> for net_mutex for this case seems to be an optimization, I assume
> synchronize_rcu_expedited() and synchronize_rcu() have the same
> behavior...

Thinking a bit more, I think commit be3fc413da9eb17cce0991f
gets wrong on rtnl_is_locked(), the lock could be locked by other
process not by the current one, therefore it should be
lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
is enabled... Sigh.

I don't see any better way than letting callers decide if they want the
expedited version or not, but this requires changes of all callers of
synchronize_net(). Hm.


Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-11 Thread Cong Wang
On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
 wrote:
>
> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> modified to check for net_mutex being held in addition to the current
> checks for RTNL being held?
>

Good point!

Like commit be3fc413da9eb17cce0991f214ab0, checking
for net_mutex for this case seems to be an optimization, I assume
synchronize_rcu_expedited() and synchronize_rcu() have the same
behavior...

diff --git a/net/core/dev.c b/net/core/dev.c
index eaad4c2..3415b6b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev);
 void synchronize_net(void)
 {
might_sleep();
-   if (rtnl_is_locked())
+   if (rtnl_is_locked() || lockdep_is_held(_mutex))
synchronize_rcu_expedited();
else
synchronize_rcu();


Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-11 Thread Paul E. McKenney
On Fri, Nov 11, 2016 at 01:11:01PM +, Rolf Neugebauer wrote:
> On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
>  wrote:
> > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
> >> (Cc'ing Paul)
> >>
> >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
> >>  wrote:
> >> > Hi
> >> >
> >> > We noticed some long delays starting docker containers on some newer
> >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> >> > fine). We narrowed this down to the creation of a network namespace
> >> > being delayed directly after removing another one (details and
> >> > reproduction below). We have seen delays of up to 60s on some systems.
> >> >
> >> > - The delay is proportional to the number of CPUs (online or offline).
> >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> >> > to 240 offline vCPUs even if one configures the VM with only, say 2
> >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
> >> > kernel config.
> >> >
> >> > - The delay is also dependent on some tunnel network interfaces being
> >> > present (which we had compiled in in one of our kernel configs).
> >> >
> >> > - We can reproduce this issue with stock kernels from
> >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> >> > as well as other hypervisors like qemu and hyperkit where we have good
> >> > control over the number of CPUs.
> >> >
> >> > A simple test is:
> >> > modprobe ipip
> >> > moprobe  ip_gre
> >> > modprobe ip_vti
> >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
> >> >
> >> > with an output like:
> >> > add netns foo ===> 0:00.00
> >> > del netns foo ===> 0:00.01
> >> > add netns bar ===> 0:08.53
> >> > del netns bar ===> 0:00.01
> >> >
> >> > This is on a 4.9-rc4 kernel from the above URL configured with
> >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
> >> >
> >> > Below is a dump of the work queues while the second 'ip add netns' is
> >> > hanging. The state of the work queues does not seem to change while
> >> > the command is delayed and the pattern shown is consistent across
> >> > different kernel versions.
> >> >
> >> > Is this a known issue and/or is someone working on a fix?
> >>
> >> Not to me.
> >>
> >>
> >> >
> >> > [  610.356272] sysrq: SysRq : Show Blocked State
> >> > [  610.356742]   taskPC stack   pid father
> >> > [  610.357252] kworker/u480:1  D0  1994  2 0x
> >> > [  610.357752] Workqueue: netns cleanup_net
> >> > [  610.358239]  9892f1065800  9892ee1e1e00
> >> > 9892f8e59340
> >> > [  610.358705]  9892f4526900 bf0104b5ba88 be486df3
> >> > bf0104b5ba60
> >> > [  610.359168]  00ffbdcbe663 9892f8e59340 000100012e70
> >> > 9892ee1e1e00
> >> > [  610.359677] Call Trace:
> >> > [  610.360169]  [] ? __schedule+0x233/0x6e0
> >> > [  610.360723]  [] schedule+0x36/0x80
> >> > [  610.361194]  [] schedule_timeout+0x22a/0x3f0
> >> > [  610.361789]  [] ? __schedule+0x23b/0x6e0
> >> > [  610.362260]  [] wait_for_completion+0xb4/0x140
> >> > [  610.362736]  [] ? wake_up_q+0x80/0x80
> >> > [  610.363306]  [] __wait_rcu_gp+0xc8/0xf0
> >> > [  610.363782]  [] synchronize_sched+0x5c/0x80
> >> > [  610.364137]  [] ? call_rcu_bh+0x20/0x20
> >> > [  610.364742]  [] ?
> >> > trace_raw_output_rcu_utilization+0x60/0x60
> >> > [  610.365337]  [] synchronize_net+0x1c/0x30
> >>
> >> This is a worker which holds the net_mutex and is waiting for
> >> a RCU grace period to elapse.

Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
modified to check for net_mutex being held in addition to the current
checks for RTNL being held?

Thanx, Paul

> >> > [  610.365846]  [] netif_napi_del+0x23/0x80
> >> > [  610.367494]  [] ip_tunnel_dev_free+0x68/0xf0 
> >> > [ip_tunnel]
> >> > [  610.368007]  [] netdev_run_todo+0x230/0x330
> >> > [  610.368454]  [] rtnl_unlock+0xe/0x10
> >> > [  610.369001]  [] ip_tunnel_delete_net+0xdf/0x120 
> >> > [ip_tunnel]
> >> > [  610.369500]  [] ipip_exit_net+0x2c/0x30 [ipip]
> >> > [  610.369997]  [] ops_exit_list.isra.4+0x38/0x60
> >> > [  610.370636]  [] cleanup_net+0x1c4/0x2b0
> >> > [  610.371130]  [] process_one_work+0x1fc/0x4b0
> >> > [  610.371812]  [] worker_thread+0x4b/0x500
> >> > [  610.373074]  [] ? process_one_work+0x4b0/0x4b0
> >> > [  610.373622]  [] ? process_one_work+0x4b0/0x4b0
> >> > [  610.374100]  [] kthread+0xd9/0xf0
> >> > [  610.374574]  [] ? kthread_park+0x60/0x60
> >> > [  610.375198]  [] ret_from_fork+0x25/0x30
> >> > [  610.375678] ip  D0  2149   2148 

Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-11 Thread Rolf Neugebauer
On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
 wrote:
> On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
>> (Cc'ing Paul)
>>
>> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
>>  wrote:
>> > Hi
>> >
>> > We noticed some long delays starting docker containers on some newer
>> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
>> > fine). We narrowed this down to the creation of a network namespace
>> > being delayed directly after removing another one (details and
>> > reproduction below). We have seen delays of up to 60s on some systems.
>> >
>> > - The delay is proportional to the number of CPUs (online or offline).
>> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
>> > to 240 offline vCPUs even if one configures the VM with only, say 2
>> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
>> > kernel config.
>> >
>> > - The delay is also dependent on some tunnel network interfaces being
>> > present (which we had compiled in in one of our kernel configs).
>> >
>> > - We can reproduce this issue with stock kernels from
>> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
>> > as well as other hypervisors like qemu and hyperkit where we have good
>> > control over the number of CPUs.
>> >
>> > A simple test is:
>> > modprobe ipip
>> > moprobe  ip_gre
>> > modprobe ip_vti
>> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
>> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
>> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
>> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
>> >
>> > with an output like:
>> > add netns foo ===> 0:00.00
>> > del netns foo ===> 0:00.01
>> > add netns bar ===> 0:08.53
>> > del netns bar ===> 0:00.01
>> >
>> > This is on a 4.9-rc4 kernel from the above URL configured with
>> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
>> >
>> > Below is a dump of the work queues while the second 'ip add netns' is
>> > hanging. The state of the work queues does not seem to change while
>> > the command is delayed and the pattern shown is consistent across
>> > different kernel versions.
>> >
>> > Is this a known issue and/or is someone working on a fix?
>>
>> Not to me.
>>
>>
>> >
>> > [  610.356272] sysrq: SysRq : Show Blocked State
>> > [  610.356742]   taskPC stack   pid father
>> > [  610.357252] kworker/u480:1  D0  1994  2 0x
>> > [  610.357752] Workqueue: netns cleanup_net
>> > [  610.358239]  9892f1065800  9892ee1e1e00
>> > 9892f8e59340
>> > [  610.358705]  9892f4526900 bf0104b5ba88 be486df3
>> > bf0104b5ba60
>> > [  610.359168]  00ffbdcbe663 9892f8e59340 000100012e70
>> > 9892ee1e1e00
>> > [  610.359677] Call Trace:
>> > [  610.360169]  [] ? __schedule+0x233/0x6e0
>> > [  610.360723]  [] schedule+0x36/0x80
>> > [  610.361194]  [] schedule_timeout+0x22a/0x3f0
>> > [  610.361789]  [] ? __schedule+0x23b/0x6e0
>> > [  610.362260]  [] wait_for_completion+0xb4/0x140
>> > [  610.362736]  [] ? wake_up_q+0x80/0x80
>> > [  610.363306]  [] __wait_rcu_gp+0xc8/0xf0
>> > [  610.363782]  [] synchronize_sched+0x5c/0x80
>> > [  610.364137]  [] ? call_rcu_bh+0x20/0x20
>> > [  610.364742]  [] ?
>> > trace_raw_output_rcu_utilization+0x60/0x60
>> > [  610.365337]  [] synchronize_net+0x1c/0x30
>>
>> This is a worker which holds the net_mutex and is waiting for
>> a RCU grace period to elapse.
>>
>>
>> > [  610.365846]  [] netif_napi_del+0x23/0x80
>> > [  610.367494]  [] ip_tunnel_dev_free+0x68/0xf0 
>> > [ip_tunnel]
>> > [  610.368007]  [] netdev_run_todo+0x230/0x330
>> > [  610.368454]  [] rtnl_unlock+0xe/0x10
>> > [  610.369001]  [] ip_tunnel_delete_net+0xdf/0x120 
>> > [ip_tunnel]
>> > [  610.369500]  [] ipip_exit_net+0x2c/0x30 [ipip]
>> > [  610.369997]  [] ops_exit_list.isra.4+0x38/0x60
>> > [  610.370636]  [] cleanup_net+0x1c4/0x2b0
>> > [  610.371130]  [] process_one_work+0x1fc/0x4b0
>> > [  610.371812]  [] worker_thread+0x4b/0x500
>> > [  610.373074]  [] ? process_one_work+0x4b0/0x4b0
>> > [  610.373622]  [] ? process_one_work+0x4b0/0x4b0
>> > [  610.374100]  [] kthread+0xd9/0xf0
>> > [  610.374574]  [] ? kthread_park+0x60/0x60
>> > [  610.375198]  [] ret_from_fork+0x25/0x30
>> > [  610.375678] ip  D0  2149   2148 0x
>> > [  610.376185]  9892f0a99000  9892f0a66900
>> > [  610.376185]  9892f8e59340
>> > [  610.376185]  9892f4526900 bf0101173db8 be486df3
>> > [  610.376753]  0005fecffd76
>> > [  610.376762]  00ff9892f11d9820 9892f8e59340 9892
>> > 9892f0a66900
>> > [  610.377274] Call Trace:
>> > [  610.377789]  [] ? __schedule+0x233/0x6e0
>> > [  610.378306]  [] schedule+0x36/0x80
>> > [  610.378992]  [] schedule_preempt_disabled+0xe/0x10

Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-10 Thread Paul E. McKenney
On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
> (Cc'ing Paul)
> 
> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
>  wrote:
> > Hi
> >
> > We noticed some long delays starting docker containers on some newer
> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> > fine). We narrowed this down to the creation of a network namespace
> > being delayed directly after removing another one (details and
> > reproduction below). We have seen delays of up to 60s on some systems.
> >
> > - The delay is proportional to the number of CPUs (online or offline).
> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> > to 240 offline vCPUs even if one configures the VM with only, say 2
> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
> > kernel config.
> >
> > - The delay is also dependent on some tunnel network interfaces being
> > present (which we had compiled in in one of our kernel configs).
> >
> > - We can reproduce this issue with stock kernels from
> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> > as well as other hypervisors like qemu and hyperkit where we have good
> > control over the number of CPUs.
> >
> > A simple test is:
> > modprobe ipip
> > moprobe  ip_gre
> > modprobe ip_vti
> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
> >
> > with an output like:
> > add netns foo ===> 0:00.00
> > del netns foo ===> 0:00.01
> > add netns bar ===> 0:08.53
> > del netns bar ===> 0:00.01
> >
> > This is on a 4.9-rc4 kernel from the above URL configured with
> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
> >
> > Below is a dump of the work queues while the second 'ip add netns' is
> > hanging. The state of the work queues does not seem to change while
> > the command is delayed and the pattern shown is consistent across
> > different kernel versions.
> >
> > Is this a known issue and/or is someone working on a fix?
> 
> Not to me.
> 
> 
> >
> > [  610.356272] sysrq: SysRq : Show Blocked State
> > [  610.356742]   taskPC stack   pid father
> > [  610.357252] kworker/u480:1  D0  1994  2 0x
> > [  610.357752] Workqueue: netns cleanup_net
> > [  610.358239]  9892f1065800  9892ee1e1e00
> > 9892f8e59340
> > [  610.358705]  9892f4526900 bf0104b5ba88 be486df3
> > bf0104b5ba60
> > [  610.359168]  00ffbdcbe663 9892f8e59340 000100012e70
> > 9892ee1e1e00
> > [  610.359677] Call Trace:
> > [  610.360169]  [] ? __schedule+0x233/0x6e0
> > [  610.360723]  [] schedule+0x36/0x80
> > [  610.361194]  [] schedule_timeout+0x22a/0x3f0
> > [  610.361789]  [] ? __schedule+0x23b/0x6e0
> > [  610.362260]  [] wait_for_completion+0xb4/0x140
> > [  610.362736]  [] ? wake_up_q+0x80/0x80
> > [  610.363306]  [] __wait_rcu_gp+0xc8/0xf0
> > [  610.363782]  [] synchronize_sched+0x5c/0x80
> > [  610.364137]  [] ? call_rcu_bh+0x20/0x20
> > [  610.364742]  [] ?
> > trace_raw_output_rcu_utilization+0x60/0x60
> > [  610.365337]  [] synchronize_net+0x1c/0x30
> 
> This is a worker which holds the net_mutex and is waiting for
> a RCU grace period to elapse.
> 
> 
> > [  610.365846]  [] netif_napi_del+0x23/0x80
> > [  610.367494]  [] ip_tunnel_dev_free+0x68/0xf0 
> > [ip_tunnel]
> > [  610.368007]  [] netdev_run_todo+0x230/0x330
> > [  610.368454]  [] rtnl_unlock+0xe/0x10
> > [  610.369001]  [] ip_tunnel_delete_net+0xdf/0x120 
> > [ip_tunnel]
> > [  610.369500]  [] ipip_exit_net+0x2c/0x30 [ipip]
> > [  610.369997]  [] ops_exit_list.isra.4+0x38/0x60
> > [  610.370636]  [] cleanup_net+0x1c4/0x2b0
> > [  610.371130]  [] process_one_work+0x1fc/0x4b0
> > [  610.371812]  [] worker_thread+0x4b/0x500
> > [  610.373074]  [] ? process_one_work+0x4b0/0x4b0
> > [  610.373622]  [] ? process_one_work+0x4b0/0x4b0
> > [  610.374100]  [] kthread+0xd9/0xf0
> > [  610.374574]  [] ? kthread_park+0x60/0x60
> > [  610.375198]  [] ret_from_fork+0x25/0x30
> > [  610.375678] ip  D0  2149   2148 0x
> > [  610.376185]  9892f0a99000  9892f0a66900
> > [  610.376185]  9892f8e59340
> > [  610.376185]  9892f4526900 bf0101173db8 be486df3
> > [  610.376753]  0005fecffd76
> > [  610.376762]  00ff9892f11d9820 9892f8e59340 9892
> > 9892f0a66900
> > [  610.377274] Call Trace:
> > [  610.377789]  [] ? __schedule+0x233/0x6e0
> > [  610.378306]  [] schedule+0x36/0x80
> > [  610.378992]  [] schedule_preempt_disabled+0xe/0x10
> > [  610.379514]  [] __mutex_lock_slowpath+0xb9/0x130
> > [  610.380031]  [] ? __kmalloc+0x162/0x1e0
> > [  610.380556]  [] mutex_lock+0x1f/0x30
> > [  610.381135]  [] 

Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-10 Thread Cong Wang
(Cc'ing Paul)

On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
 wrote:
> Hi
>
> We noticed some long delays starting docker containers on some newer
> kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> fine). We narrowed this down to the creation of a network namespace
> being delayed directly after removing another one (details and
> reproduction below). We have seen delays of up to 60s on some systems.
>
> - The delay is proportional to the number of CPUs (online or offline).
> We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> to 240 offline vCPUs even if one configures the VM with only, say 2
> vCPUs. We see linear increase in delay when we change NR_CPUS in the
> kernel config.
>
> - The delay is also dependent on some tunnel network interfaces being
> present (which we had compiled in in one of our kernel configs).
>
> - We can reproduce this issue with stock kernels from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> as well as other hypervisors like qemu and hyperkit where we have good
> control over the number of CPUs.
>
> A simple test is:
> modprobe ipip
> moprobe  ip_gre
> modprobe ip_vti
> echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
>
> with an output like:
> add netns foo ===> 0:00.00
> del netns foo ===> 0:00.01
> add netns bar ===> 0:08.53
> del netns bar ===> 0:00.01
>
> This is on a 4.9-rc4 kernel from the above URL configured with
> NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
>
> Below is a dump of the work queues while the second 'ip add netns' is
> hanging. The state of the work queues does not seem to change while
> the command is delayed and the pattern shown is consistent across
> different kernel versions.
>
> Is this a known issue and/or is someone working on a fix?

Not to me.


>
> [  610.356272] sysrq: SysRq : Show Blocked State
> [  610.356742]   taskPC stack   pid father
> [  610.357252] kworker/u480:1  D0  1994  2 0x
> [  610.357752] Workqueue: netns cleanup_net
> [  610.358239]  9892f1065800  9892ee1e1e00
> 9892f8e59340
> [  610.358705]  9892f4526900 bf0104b5ba88 be486df3
> bf0104b5ba60
> [  610.359168]  00ffbdcbe663 9892f8e59340 000100012e70
> 9892ee1e1e00
> [  610.359677] Call Trace:
> [  610.360169]  [] ? __schedule+0x233/0x6e0
> [  610.360723]  [] schedule+0x36/0x80
> [  610.361194]  [] schedule_timeout+0x22a/0x3f0
> [  610.361789]  [] ? __schedule+0x23b/0x6e0
> [  610.362260]  [] wait_for_completion+0xb4/0x140
> [  610.362736]  [] ? wake_up_q+0x80/0x80
> [  610.363306]  [] __wait_rcu_gp+0xc8/0xf0
> [  610.363782]  [] synchronize_sched+0x5c/0x80
> [  610.364137]  [] ? call_rcu_bh+0x20/0x20
> [  610.364742]  [] ?
> trace_raw_output_rcu_utilization+0x60/0x60
> [  610.365337]  [] synchronize_net+0x1c/0x30

This is a worker which holds the net_mutex and is waiting for
a RCU grace period to elapse.


> [  610.365846]  [] netif_napi_del+0x23/0x80
> [  610.367494]  [] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
> [  610.368007]  [] netdev_run_todo+0x230/0x330
> [  610.368454]  [] rtnl_unlock+0xe/0x10
> [  610.369001]  [] ip_tunnel_delete_net+0xdf/0x120 
> [ip_tunnel]
> [  610.369500]  [] ipip_exit_net+0x2c/0x30 [ipip]
> [  610.369997]  [] ops_exit_list.isra.4+0x38/0x60
> [  610.370636]  [] cleanup_net+0x1c4/0x2b0
> [  610.371130]  [] process_one_work+0x1fc/0x4b0
> [  610.371812]  [] worker_thread+0x4b/0x500
> [  610.373074]  [] ? process_one_work+0x4b0/0x4b0
> [  610.373622]  [] ? process_one_work+0x4b0/0x4b0
> [  610.374100]  [] kthread+0xd9/0xf0
> [  610.374574]  [] ? kthread_park+0x60/0x60
> [  610.375198]  [] ret_from_fork+0x25/0x30
> [  610.375678] ip  D0  2149   2148 0x
> [  610.376185]  9892f0a99000  9892f0a66900
> [  610.376185]  9892f8e59340
> [  610.376185]  9892f4526900 bf0101173db8 be486df3
> [  610.376753]  0005fecffd76
> [  610.376762]  00ff9892f11d9820 9892f8e59340 9892
> 9892f0a66900
> [  610.377274] Call Trace:
> [  610.377789]  [] ? __schedule+0x233/0x6e0
> [  610.378306]  [] schedule+0x36/0x80
> [  610.378992]  [] schedule_preempt_disabled+0xe/0x10
> [  610.379514]  [] __mutex_lock_slowpath+0xb9/0x130
> [  610.380031]  [] ? __kmalloc+0x162/0x1e0
> [  610.380556]  [] mutex_lock+0x1f/0x30
> [  610.381135]  [] copy_net_ns+0x9f/0x170
> [  610.381647]  [] create_new_namespaces+0x11b/0x200
> [  610.382249]  [] unshare_nsproxy_namespaces+0x5a/0xb0
> [  610.382818]  [] SyS_unshare+0x1cd/0x360
> [  610.383319]  [] entry_SYSCALL_64_fastpath+0x1e/0xad

This process is apparently waiting for the net_mutex held by the previous