Re: [openstack-dev] [neutron] Neutron router and nf_conntrack performance problems

2014-08-18 Thread Brian Haley
Stuart,

I also can't say I've seen this, but I am curious now.  I did have a few
questions for you though.

1. When you say you set nf_conntrack_max/nf_conntrack_hash to 256k, did you
really set the hash size that large?  Typically the hash is 1/8 of the max,
meaning you'd have 8 entries per hashbucket.

2. Does /sys/module/nf_conntrack/parameters/hashsize look correct?

3. Are you seeing any messages such as "nf_conntrack: table full, dropping 
packet"

4. How many entries are the in the conntrack table?  'sudo conntrack -C'

5. Have you been able to drill down any further into what's taking all the time
in nf_conntrack_tuple_taken() ?  I can't imagine you have a single bucket with
tons of entries and you're spinning looking at each, but it could be that 
simple.

Thanks,

-Brian

On 08/16/2014 12:12 PM, Stuart Fox wrote:
> Hey neutron dev!
> 
> Im having a serious problem with my neutron router getting spin locked in
> nf_conntrack_tuple_taken.
> Has anybody else experienced it?
> "perf top" shows nf_conntrack_tuple_taken at 75%
> As the incoming request rate goes up, so nf_conntrack_tuple_taken runs very 
> hot
> on CPU0 causing ksoftirqd/0 to run at 100%. At that point internal pings on 
> the
> GRE network go sky high and its game over. Pinging from a vm to the subnet
> default gateway on the neutron goes from 0.2ms to 11s! pinging from the same 
> vm
> to another vm in the same subnet stays constant at 0.2ms.
> 
> Very much indicates to me that the neutron router is having serious problems.
> No other part of the system seems under pressure.
> 
> ipv6 is disabled, and nf_conntrack_max/nf_conntrack_hash are set to 256k.
> We've tried the default 3.13 and the utopic 3.16 kernel (3.16 has lots of work
> on removing spinlocks around nf_conntrack). 3.16 survives a little longer but
> still gets in the same state
> 
> Neutron router
> 1 x Ubuntu 14.04/Icehouse 2014.1.1 on an ibm x3550 with 4 10G intel nics.
> eth0 - Mgt
> eth1 - GRE
> eth2 - Public
> eth3 - unused 
> 
> Compute/controller nodes
> 43 x Ubuntu 14.04/Icehouse 2014.1.1 ibm x240 flex blades with 4 emulex nics
> eth0 Mgt
> eth2 GRE
> 
> Any help very much appreciated!
> Replace the l2/l3 functions with hardware is very much an option if thats a
> better solution.
> Im running out of time before my client decides to stay on AWS.
> 
> 
> 
> BR,
> Stuart
> 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] Neutron router and nf_conntrack performance problems

2014-08-16 Thread Salvatore Orlando
Hi Stuart,

As far as I can tell, this is the first time I hear about this problem.
I can't make any judgment with the details you've shared here, but I would
initially focus on ovs, the kernel and their interactions.
For Neutron's l3 agent the only thing I can say is that it uses the
conntrack module for doing SNAT on the default gateway and for managing
floating IPs - but I guess that won't help you much.

I think the neutron community could do more to help you if we understand
something more about your particular situation.
- You mentioned 43 nodes between compute and controllers, but a single
"neutron router" (which I reckon it's the l3 agent). How many logical
routers is that agent hosting? Are you able to share how many internal
interfaces are connected to those routers?
The above is to just get an idea of the traffic passing through the l3 agent
- Have you noticed any other call counter spiking up? The one you mentioned
seems to be called only by nf_nat_used_tuple which is actually used in a
number of places.

Regards,
Salvatore

PS: If you have not already done so consider submitting this kind of
questions also to ask.openstack.org


On 16 August 2014 18:12, Stuart Fox  wrote:

> Hey neutron dev!
>
> Im having a serious problem with my neutron router getting spin locked in
> nf_conntrack_tuple_taken.
> Has anybody else experienced it?
> "perf top" shows nf_conntrack_tuple_taken at 75%
> As the incoming request rate goes up, so nf_conntrack_tuple_taken runs
> very hot on CPU0 causing ksoftirqd/0 to run at 100%. At that point internal
> pings on the GRE network go sky high and its game over. Pinging from a vm
> to the subnet default gateway on the neutron goes from 0.2ms to 11s!
> pinging from the same vm to another vm in the same subnet stays constant at
> 0.2ms.
>
> Very much indicates to me that the neutron router is having serious
> problems.
> No other part of the system seems under pressure.
>
> ipv6 is disabled, and nf_conntrack_max/nf_conntrack_hash are set to 256k.
> We've tried the default 3.13 and the utopic 3.16 kernel (3.16 has lots of
> work on removing spinlocks around nf_conntrack). 3.16 survives a little
> longer but still gets in the same state
>
> Neutron router
> 1 x Ubuntu 14.04/Icehouse 2014.1.1 on an ibm x3550 with 4 10G intel nics.
> eth0 - Mgt
> eth1 - GRE
> eth2 - Public
> eth3 - unused
>
> Compute/controller nodes
> 43 x Ubuntu 14.04/Icehouse 2014.1.1 ibm x240 flex blades with 4 emulex nics
> eth0 Mgt
> eth2 GRE
>
> Any help very much appreciated!
> Replace the l2/l3 functions with hardware is very much an option if thats
> a better solution.
> Im running out of time before my client decides to stay on AWS.
>
>
>
> BR,
> Stuart
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev