Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Thiago, I updated your bug report with my own tests and I don't experience your performance issues. George On Tue, Nov 19, 2013 at 6:53 PM, Martinx - ジェームズ thiagocmarti...@gmail.comwrote: Okay! BUG filled: https://bugs.launchpad.net/neutron/+bug/1252900 Regards, Thiago On 19 November 2013 16:00, Razique Mahroua razique.mahr...@gmail.comwrote: Yup :) On 18 Nov 2013, at 22:09, Martinx - ジェームズ wrote: Guys, Can I fill a BUG about this issue?! If yes, where?! Neutron Launchpad page? Tks, Thiago On 12 November 2013 04:24, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: At least one guy from Rackspace is aware of this problem, thanks Anne and James Denton! ^_^ Hope to talk with James Page on IRC tomorrow, today was too complicated for me... More experts coming! I have a good environment for you guys to test and debug this in deep, if desired. BTW, hey Ubuntu guys! Please, release the ML2 plugin! ASAP!! I would love to try it!=D Best, Thiago On 12 November 2013 02:40, Geraint Jones gera...@koding.com wrote: I suddenly have the identical situation occurring here - of note I am using grizzly and there have been two changes to the environment that have seemingly caused this : upgrade of OVS to 1.11 and upgrade of quantum-* from 2013.1.2 to 2013.1.3 I haven’t tried the default 1.04 from 12.04 and I can’t as this is a prod system. However if the openstack update is causing it then here is the place to start I suspect : https://launchpad.net/neutron/grizzly/2013.1.3 Performance of 1.04 in my env makes that unusable. -- Geraint Jones On 11/11/13 2:47 am, Jay Pipes jaypi...@gmail.com wrote: On 11/10/2013 01:35 PM, Martinx - ジェームズ wrote: Hi Jay! Thank you! I'll definitely take a look at those cookbooks but, I already tried Havana (Cloud Archive) with OVS 1.11.0, same poor results. Also, my previous region based on Grizzly / Quantum / GRE, worked perfectly for months (except with MTU = 1400) and, Havana is somehow different. Interesting. Well, we're just beginning the process of our Havana deployment testing and changes, so we'll certainly be double-checking performance based on the above feedback. Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/ openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/ openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Yup :) On 18 Nov 2013, at 22:09, Martinx - ジェームズ wrote: Guys, Can I fill a BUG about this issue?! If yes, where?! Neutron Launchpad page? Tks, Thiago On 12 November 2013 04:24, Martinx - ジェームズ thiagocmarti...@gmail.comwrote: At least one guy from Rackspace is aware of this problem, thanks Anne and James Denton! ^_^ Hope to talk with James Page on IRC tomorrow, today was too complicated for me... More experts coming! I have a good environment for you guys to test and debug this in deep, if desired. BTW, hey Ubuntu guys! Please, release the ML2 plugin! ASAP!! I would love to try it!=D Best, Thiago On 12 November 2013 02:40, Geraint Jones gera...@koding.com wrote: I suddenly have the identical situation occurring here - of note I am using grizzly and there have been two changes to the environment that have seemingly caused this : upgrade of OVS to 1.11 and upgrade of quantum-* from 2013.1.2 to 2013.1.3 I haven’t tried the default 1.04 from 12.04 and I can’t as this is a prod system. However if the openstack update is causing it then here is the place to start I suspect : https://launchpad.net/neutron/grizzly/2013.1.3 Performance of 1.04 in my env makes that unusable. -- Geraint Jones On 11/11/13 2:47 am, Jay Pipes jaypi...@gmail.com wrote: On 11/10/2013 01:35 PM, Martinx - ジェームズ wrote: Hi Jay! Thank you! I'll definitely take a look at those cookbooks but, I already tried Havana (Cloud Archive) with OVS 1.11.0, same poor results. Also, my previous region based on Grizzly / Quantum / GRE, worked perfectly for months (except with MTU = 1400) and, Havana is somehow different. Interesting. Well, we're just beginning the process of our Havana deployment testing and changes, so we'll certainly be double-checking performance based on the above feedback. Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Okay! BUG filled: https://bugs.launchpad.net/neutron/+bug/1252900 Regards, Thiago On 19 November 2013 16:00, Razique Mahroua razique.mahr...@gmail.comwrote: Yup :) On 18 Nov 2013, at 22:09, Martinx - ジェームズ wrote: Guys, Can I fill a BUG about this issue?! If yes, where?! Neutron Launchpad page? Tks, Thiago On 12 November 2013 04:24, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: At least one guy from Rackspace is aware of this problem, thanks Anne and James Denton! ^_^ Hope to talk with James Page on IRC tomorrow, today was too complicated for me... More experts coming! I have a good environment for you guys to test and debug this in deep, if desired. BTW, hey Ubuntu guys! Please, release the ML2 plugin! ASAP!! I would love to try it!=D Best, Thiago On 12 November 2013 02:40, Geraint Jones gera...@koding.com wrote: I suddenly have the identical situation occurring here - of note I am using grizzly and there have been two changes to the environment that have seemingly caused this : upgrade of OVS to 1.11 and upgrade of quantum-* from 2013.1.2 to 2013.1.3 I haven’t tried the default 1.04 from 12.04 and I can’t as this is a prod system. However if the openstack update is causing it then here is the place to start I suspect : https://launchpad.net/neutron/grizzly/2013.1.3 Performance of 1.04 in my env makes that unusable. -- Geraint Jones On 11/11/13 2:47 am, Jay Pipes jaypi...@gmail.com wrote: On 11/10/2013 01:35 PM, Martinx - ジェームズ wrote: Hi Jay! Thank you! I'll definitely take a look at those cookbooks but, I already tried Havana (Cloud Archive) with OVS 1.11.0, same poor results. Also, my previous region based on Grizzly / Quantum / GRE, worked perfectly for months (except with MTU = 1400) and, Havana is somehow different. Interesting. Well, we're just beginning the process of our Havana deployment testing and changes, so we'll certainly be double-checking performance based on the above feedback. Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/ openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/ openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
I suddenly have the identical situation occurring here - of note I am using grizzly and there have been two changes to the environment that have seemingly caused this : upgrade of OVS to 1.11 and upgrade of quantum-* from 2013.1.2 to 2013.1.3 I haven’t tried the default 1.04 from 12.04 and I can’t as this is a prod system. However if the openstack update is causing it then here is the place to start I suspect : https://launchpad.net/neutron/grizzly/2013.1.3 Performance of 1.04 in my env makes that unusable. -- Geraint Jones On 11/11/13 2:47 am, Jay Pipes jaypi...@gmail.com wrote: On 11/10/2013 01:35 PM, Martinx - ジェームズ wrote: Hi Jay! Thank you! I'll definitely take a look at those cookbooks but, I already tried Havana (Cloud Archive) with OVS 1.11.0, same poor results. Also, my previous region based on Grizzly / Quantum / GRE, worked perfectly for months (except with MTU = 1400) and, Havana is somehow different. Interesting. Well, we're just beginning the process of our Havana deployment testing and changes, so we'll certainly be double-checking performance based on the above feedback. Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
On 11/09/2013 07:09 PM, Martinx - ジェームズ wrote: Guys, This problem is kind of a deal breaker... I was counting on OpenStack Havana (and with Ubuntu) for my first public cloud that I'm (was) about to announce / launch but, this problem changed everything. I can not put Havana with Ubuntu LTS into production because of this network issue. This is a very serious problem for me... Since all sites, or even ssh connections, that pass through the Floating IPs entering into the tenant's subnets, are very slow and, all the connections freezes for seconds, every minute. Again, I'm seeing that there is no way to put Havana into production (using Per-Tenant Routers with Private Networks), _because the Network Node is broken_. At least when with Ubuntu... I'll try it with Debian 7, or CentOS (I don't like it), just to see if the problem persist but, I prefer Ubuntu distro since Warty Warthog...:-/ So, what is being done to fix it? I already tried everything I could, without any kind of success... Also, I followed this doc (to triple * triple re-check my env): http://docs.openstack.org/havana/install-guide/install/apt/content/section_networking-routers-with-private-networks.html but, it does not work as expected. I'd just like to point out that it is indeed possible to achieve good network performance (bi-directional) with Ubuntu 12.04, OVS 1.11, and OpenStack Grizzly with Neutron and GRE tunnels. We've deployed two zones with it and after upgrading to OVS 1.11, we are seeing pretty good performance. We use the OpenStack Chef cookbooks to configure Neutron: https://github.com/stackforge/cookbook-openstack-network You may want to go through the above cookbook and check the default settings that are in the attributes and written to the configuration file templates. I don't know of anything that changed between Grizzly and Havana that would have had an impact on network performance, but perhaps someone from the Neutron dev community could chime in here and write if there's been anything added in the Havana timeframe that may affect network performance... Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Jay! Thank you! I'll definitely take a look at those cookbooks but, I already tried Havana (Cloud Archive) with OVS 1.11.0, same poor results. Also, my previous region based on Grizzly / Quantum / GRE, worked perfectly for months (except with MTU = 1400) and, Havana is somehow different. Thanks! Thiago On 10 November 2013 15:21, Jay Pipes jaypi...@gmail.com wrote: On 11/09/2013 07:09 PM, Martinx - ジェ�`ムズ wrote: Guys, This problem is kind of a deal breaker... I was counting on OpenStack Havana (and with Ubuntu) for my first public cloud that I'm (was) about to announce / launch but, this problem changed everything. I can not put Havana with Ubuntu LTS into production because of this network issue. This is a very serious problem for me... Since all sites, or even ssh connections, that pass through the Floating IPs entering into the tenant's subnets, are very slow and, all the connections freezes for seconds, every minute. Again, I'm seeing that there is no way to put Havana into production (using Per-Tenant Routers with Private Networks), _because the Network Node is broken_. At least when with Ubuntu... I'll try it with Debian 7, or CentOS (I don't like it), just to see if the problem persist but, I prefer Ubuntu distro since Warty Warthog...:-/ So, what is being done to fix it? I already tried everything I could, without any kind of success... Also, I followed this doc (to triple * triple re-check my env): http://docs.openstack.org/havana/install-guide/install/ apt/content/section_networking-routers-with-private-networks.html but, it does not work as expected. I'd just like to point out that it is indeed possible to achieve good network performance (bi-directional) with Ubuntu 12.04, OVS 1.11, and OpenStack Grizzly with Neutron and GRE tunnels. We've deployed two zones with it and after upgrading to OVS 1.11, we are seeing pretty good performance. We use the OpenStack Chef cookbooks to configure Neutron: https://github.com/stackforge/cookbook-openstack-network You may want to go through the above cookbook and check the default settings that are in the attributes and written to the configuration file templates. I don't know of anything that changed between Grizzly and Havana that would have had an impact on network performance, but perhaps someone from the Neutron dev community could chime in here and write if there's been anything added in the Havana timeframe that may affect network performance... Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/ openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/ openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
On 11/10/2013 01:35 PM, Martinx - ジェームズ wrote: Hi Jay! Thank you! I'll definitely take a look at those cookbooks but, I already tried Havana (Cloud Archive) with OVS 1.11.0, same poor results. Also, my previous region based on Grizzly / Quantum / GRE, worked perfectly for months (except with MTU = 1400) and, Havana is somehow different. Interesting. Well, we're just beginning the process of our Havana deployment testing and changes, so we'll certainly be double-checking performance based on the above feedback. Best, -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Cool! Let me know what do you'll need. I'll make a tenant / project / user for you here at my cloud and I can give you root access to the network node (or any openstack node). Let me know if it is enough for you to debug / test it. Cheers! Thiago On 10 November 2013 07:34, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 10/11/13 00:09, Martinx - ジェームズ wrote: Also, I followed this doc (to triple * triple re-check my env): http://docs.openstack.org/havana/install-guide/install/apt/content/section_networking-routers-with-private-networks.html but, it does not work as expected. BTW, I can give full access into my environment for you guys, no problem... I can build a lab from scratch, following your instructions, I can also give root access to OpenStack experts... Just, let me know... =) Hey If you can set this up I can spare some time to help you debug tomorrow (monday) between 0900 and 1800 utc Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSf1M3AAoJEL/srsug59jDuacP/1hU3tDk6Dk9I+jsxjwlIH9H JeIs00GoLdB3yg+M/c1aWbPV9Pihgm1mC27aI/mnBXO5gOhQ/U8oss4jjz+46Cpx qCWuIgsRFD0OpR/DzZ0cbzB64Pa/vzg9Sb3NP5YrQxvcI/WYJVDHhLuc/rvyfBsD zi/H4ODIOb9ptZ5fbJyQGbmZUHArdUJ9FaN57PYB0Y7KQOejhYE3qjqk/IjIXm7e mMAVVyHf8EVadcEFy+D+CxpIBXQIgjrzy5Amhrw/3q9DPs3OHoXWAGU8/ApDZiVP yo01Pm3ZnlnXfFw3csf0PJEMKAkE3wKb/9YzXWBXNHHND0+zRKNyCCB8RE+hDDnu M72Lj1zrXkFHhAWbPM3gsGHzGY8bsTswYDvOGrB8cTf8KcF54m8ruJb/lzdesHh3 l0cyTUKkwuWkZ4LJ63oI7FIsL4bTGt/bBvjf3FF0iFIK0OFxuGuvKtZpdi9xek8i ihy/f0r+AlPA5pU1nMkTsOhS1v61GKLF1ygXBK0PLBeHX5wnnnxqchS4yVkjSRup fwPmb0u2gLD8gbPINXi46sePuCwn8acBFdIvNoz9v4APYGrLgnS7rWinrjrOCHTq EsuZ6fYs5Lnr48tPlv3WxmpHM9UNknio1zy+Bk3vrNL/43ppjkJYXVVE/JstmcYk NjrHeUuQkdENzBZvRODx =CcA2 -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Guys, This problem is kind of a deal breaker... I was counting on OpenStack Havana (and with Ubuntu) for my first public cloud that I'm (was) about to announce / launch but, this problem changed everything. I can not put Havana with Ubuntu LTS into production because of this network issue. This is a very serious problem for me... Since all sites, or even ssh connections, that pass through the Floating IPs entering into the tenant's subnets, are very slow and, all the connections freezes for seconds, every minute. Again, I'm seeing that there is no way to put Havana into production (using Per-Tenant Routers with Private Networks), *because the Network Node is broken*. At least when with Ubuntu... I'll try it with Debian 7, or CentOS (I don't like it), just to see if the problem persist but, I prefer Ubuntu distro since Warty Warthog...:-/ So, what is being done to fix it? I already tried everything I could, without any kind of success... Also, I followed this doc (to triple * triple re-check my env): http://docs.openstack.org/havana/install-guide/install/apt/content/section_networking-routers-with-private-networks.html but, it does not work as expected. BTW, I can give full access into my environment for you guys, no problem... I can build a lab from scratch, following your instructions, I can also give root access to OpenStack experts... Just, let me know...=) Thanks! Thiago On 6 November 2013 09:20, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hello Stackers! Sorry to not back on this topic last week, too many things to do... So, instead of trying this and that, reply this, reply again... I made a video about this problem, I hope that helps more than those e-mails that I'm writing!=P Honestly, I don't know the source of this problem, if it is with OpenStack / Neutron, or with Linux / Namespace / OVS... It would be great to test it alone, Ubuntu Linux + Namespace + OVS (without Neutron), to see if the problem persist but, I have no idea about how to setup everything, just like Neutron does. Maybe, I just need to reproduce the Namespace and OVS bridges / ports / VXLAN - as is, without Neutron?! I can try that... Also, my Grizzly setup is gone, I deleted it... Sorry about that... I know it works because it is the first time I'm seeing this problem... I had used Grizzly for ~5 months with only 1 problem (related to MTU 1400) but, this problem with Havana is totally different... Video: OpenStack Havana L3 Router problem - Ubuntu 12.04.3 LTS: http://www.youtube.com/watch?v=jVjiphMuuzM * After 5 minutes, I inserted a new video, showing how I fixed it, by running Squid within the Tenant router. You guys can see that, using the default Tenant router (10:30), it will take about 1 hour to finish the apt-get download and, with Squid (09:27), it goes down to about 3 minutes (no, it is still not cached, I clean it for each test). Sorry about the size of the video, it is about 12 minutes and high-res (to see the screen details) but, it is a serious problem and I think it worth watching it... NOTE: Sorry about my English! It is very hard to speak a non-native language, handling an Android phone and typing the keyboard...:-) Best! Thiago On 28 October 2013 07:00, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: Thiago, some more answers below. Btw: I saw the problem with a qemu-nbd -c process using all the cpu on the compute. It happened just once - must be a bug in it. You can disable libvirt injection if you don't want it by setting libvirt_inject_partition = -2 in nova.conf. On Saturday, 26 October 2013, 16:58, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, on the same net-node machine, Grizzly works, Havana don't... But, for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and OVS 1.4.0-1ubuntu1.6. so we don't know if the problem is due to Neutron, the Ubuntu kernel or OVS. I suspect the kernel as it implements the routing/natting, interfaces and namespaces. I don't think Neutron Havana changes how these things are setup too much. Can you try running Havana on a network node with the Linux 3.2 kernel? If I replace the Havana net-node hardware entirely, the problem persist (i.e. it follows Havana net-node), so, I think, it can not be related to the hardware. I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg). My logs (including Open vSwitch) right after starting an Instance (nothing at OVS logs): http://paste.openstack.org/show/49870/ I tried everything, including installing the Network Node on top of a KVM virtual machine or directly on a dedicated server, same result, the problem follows Hanava node (virtual or physical). Grizzly Network Node works both on a KVM VM or on a dedicated server. Regards, Thiago On 26 October 2013 06:28, Darragh OReilly wrote:
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Thiago, some more answers below. Btw: I saw the problem with a qemu-nbd -c process using all the cpu on the compute. It happened just once - must be a bug in it. You can disable libvirt injection if you don't want it by setting libvirt_inject_partition = -2 in nova.conf. On Saturday, 26 October 2013, 16:58, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, on the same net-node machine, Grizzly works, Havana don't... But, for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and OVS 1.4.0-1ubuntu1.6. so we don't know if the problem is due to Neutron, the Ubuntu kernel or OVS. I suspect the kernel as it implements the routing/natting, interfaces and namespaces. I don't think Neutron Havana changes how these things are setup too much. Can you try running Havana on a network node with the Linux 3.2 kernel? If I replace the Havana net-node hardware entirely, the problem persist (i.e. it follows Havana net-node), so, I think, it can not be related to the hardware. I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg). My logs (including Open vSwitch) right after starting an Instance (nothing at OVS logs): http://paste.openstack.org/show/49870/ I tried everything, including installing the Network Node on top of a KVM virtual machine or directly on a dedicated server, same result, the problem follows Hanava node (virtual or physical). Grizzly Network Node works both on a KVM VM or on a dedicated server. Regards, Thiago On 26 October 2013 06:28, Darragh OReilly wrote: Hi Thiago, so just to confirm - on the same netnode machine, with the same OS, kernal and OVS versions - Grizzly is ok and Havana is not? Also, on the network node, are there any errors in the neutron logs, the syslog, or /var/log/openvswitch/* ? Re, Darragh. On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: LOL... One day, Internet via Quantum Entanglement! Oops, Neutron! =P I'll ignore the problems related to the performance between two instances on different hypervisors for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works. I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP! :-/ Why I'm giving it up (of L3 / NS) for now? Because I tried: The option tenant_network_type with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged). From the instances, the connection with External network is always slow, no matter if I choose for Tenants, GRE, VXLAN or VLAN. For example, right now, I'm using VLAN, same problem. Don't you guys think that this can be a problem with the bridge br-ex and its internals ? Since I swapped the Tenant Network Type 3 times, same result... But I still did not removed the br-ex from the scene. If someone wants to debug it, I can give the root password, no problem, it is just a lab... =) Thanks! Thiago On 25 October 2013 19:45, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote: WOW!! Thank you for your time Rick! Awesome answer!! =D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! One or two problems I cannot say. Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction. All I can really say is I liked it better when we were called Quantum, because then I could refer to it as Spooky networking at a distance. Sadly, describing Neutron as Networking with no inherent charge doesn't work as well :) rick jones ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Stackers, I have a small report from my latest tests. Tests: * Namespace (br-ex) *-* Internet - OK * Namespace (vxlan,gre,vlan) *-* Tenant - OK * Tenant *-* Namespace *-* Internet - *NOT-OK* (Very slow / Unstable / Intermittent) Since the connectivity from Tenant to its Namespace is fine AND, from its Namespace to the Internet is also fine too, then, come to my mind: Hey, why not run Squid WITHIN the Tenant Namespace as a workaround?! And... Voialá! There I Fixed It!=P New Test: Tenant *-* *Namespace with Squid* *-* Internet - OK! *NOTE:* I'm sure that the entire ethernet path (without L3, Namespace, OVS, VXLANs, GREs, or Linux bridges, just plain Linux + IPs), *from the hypervisor to the Internet*, *passing trough the same Network Node hardware / path*, is working smoothly. I mean, I tested the entire path BEFORE installing OpenStack Havana... So, I it can not be a infrastructure / hardware issue, it must be something else, located at the software layer running within the Network Node itself. I'm about to send more info about this problem. Thanks! Thiago On 26 October 2013 13:57, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, on the same net-node machine, Grizzly works, Havana don't... But, for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and OVS 1.4.0-1ubuntu1.6. If I replace the Havana net-node hardware entirely, the problem persist (i.e. it follows Havana net-node), so, I think, it can not be related to the hardware. I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg). My logs (including Open vSwitch) right after starting an Instance (nothing at OVS logs): http://paste.openstack.org/show/49870/ I tried everything, including installing the Network Node on top of a KVM virtual machine or directly on a dedicated server, same result, the problem follows Hanava node (virtual or physical). Grizzly Network Node works both on a KVM VM or on a dedicated server. Regards, Thiago On 26 October 2013 06:28, Darragh OReilly darragh.orei...@yahoo.comwrote: Hi Thiago, so just to confirm - on the same netnode machine, with the same OS, kernal and OVS versions - Grizzly is ok and Havana is not? Also, on the network node, are there any errors in the neutron logs, the syslog, or /var/log/openvswitch/* ? Re, Darragh. On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: LOL... One day, Internet via Quantum Entanglement! Oops, Neutron! =P I'll ignore the problems related to the performance between two instances on different hypervisors for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works. I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP!:-/ Why I'm giving it up (of L3 / NS) for now? Because I tried: The option tenant_network_type with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged). From the instances, the connection with External network *is always slow*, no matter if I choose for Tenants, GRE, VXLAN or VLAN. For example, right now, I'm using VLAN, same problem. Don't you guys think that this can be a problem with the bridge br-ex and its internals ? Since I swapped the Tenant Network Type 3 times, same result... But I still did not removed the br-ex from the scene. If someone wants to debug it, I can give the root password, no problem, it is just a lab... =) Thanks! Thiago On 25 October 2013 19:45, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote: WOW!! Thank you for your time Rick! Awesome answer!!=D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! One or two problems I cannot say.Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction. All I can really say is I liked it better when we were called Quantum, because then I could refer to it as Spooky networking at a distance. Sadly, describing Neutron as Networking with no inherent charge doesn't work as well :) rick jones ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe :
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Thiago, so just to confirm - on the same netnode machine, with the same OS, kernal and OVS versions - Grizzly is ok and Havana is not? Also, on the network node, are there any errors in the neutron logs, the syslog, or /var/log/openvswitch/* ? Re, Darragh. On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: LOL... One day, Internet via Quantum Entanglement! Oops, Neutron! =P I'll ignore the problems related to the performance between two instances on different hypervisors for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works. I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP! :-/ Why I'm giving it up (of L3 / NS) for now? Because I tried: The option tenant_network_type with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged). From the instances, the connection with External network is always slow, no matter if I choose for Tenants, GRE, VXLAN or VLAN. For example, right now, I'm using VLAN, same problem. Don't you guys think that this can be a problem with the bridge br-ex and its internals ? Since I swapped the Tenant Network Type 3 times, same result... But I still did not removed the br-ex from the scene. If someone wants to debug it, I can give the root password, no problem, it is just a lab... =) Thanks! Thiago On 25 October 2013 19:45, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote: WOW!! Thank you for your time Rick! Awesome answer!! =D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! One or two problems I cannot say. Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction. All I can really say is I liked it better when we were called Quantum, because then I could refer to it as Spooky networking at a distance. Sadly, describing Neutron as Networking with no inherent charge doesn't work as well :) rick jones ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Darragh, Yes, on the same net-node machine, Grizzly works, Havana don't... But, for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and OVS 1.4.0-1ubuntu1.6. If I replace the Havana net-node hardware entirely, the problem persist (i.e. it follows Havana net-node), so, I think, it can not be related to the hardware. I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg). My logs (including Open vSwitch) right after starting an Instance (nothing at OVS logs): http://paste.openstack.org/show/49870/ I tried everything, including installing the Network Node on top of a KVM virtual machine or directly on a dedicated server, same result, the problem follows Hanava node (virtual or physical). Grizzly Network Node works both on a KVM VM or on a dedicated server. Regards, Thiago On 26 October 2013 06:28, Darragh OReilly darragh.orei...@yahoo.com wrote: Hi Thiago, so just to confirm - on the same netnode machine, with the same OS, kernal and OVS versions - Grizzly is ok and Havana is not? Also, on the network node, are there any errors in the neutron logs, the syslog, or /var/log/openvswitch/* ? Re, Darragh. On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: LOL... One day, Internet via Quantum Entanglement! Oops, Neutron! =P I'll ignore the problems related to the performance between two instances on different hypervisors for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works. I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP!:-/ Why I'm giving it up (of L3 / NS) for now? Because I tried: The option tenant_network_type with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged). From the instances, the connection with External network *is always slow*, no matter if I choose for Tenants, GRE, VXLAN or VLAN. For example, right now, I'm using VLAN, same problem. Don't you guys think that this can be a problem with the bridge br-ex and its internals ? Since I swapped the Tenant Network Type 3 times, same result... But I still did not removed the br-ex from the scene. If someone wants to debug it, I can give the root password, no problem, it is just a lab... =) Thanks! Thiago On 25 October 2013 19:45, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote: WOW!! Thank you for your time Rick! Awesome answer!!=D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! One or two problems I cannot say.Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction. All I can really say is I liked it better when we were called Quantum, because then I could refer to it as Spooky networking at a distance. Sadly, describing Neutron as Networking with no inherent charge doesn't work as well :) rick jones ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Thiago, you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them? There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes. What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k iface-used-for-gre'. What kernal are you using: 'uname -a'? Re, Darragh. Hi Daniel, I followed that page, my Instances MTU is lowered by DHCP Agent but, same result: poor network performance (internal between Instances and when trying to reach the Internet). No matter if I use dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf + dhcp-option-force=26,1400 for my Neutron DHCP agent, or not (i.e. MTU = 1500), the result is almost the same. I'll try VXLAN (or just VLANs) this weekend to see if I can get better results... Thanks! Thiago ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Thiago, for the VIF error: you will need to change qemu.conf as described here: http://openvswitch.org/openstack/documentation/ Re, Darragh. On Friday, 25 October 2013, 15:14, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, Instances are getting MTU 1400. I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! The LibvirtOpenVswitchDriver doesn't work, look: http://paste.openstack.org/show/49709/ http://paste.openstack.org/show/49710/ My NICs are RTL8111/8168/8411 PCI Express Gigabit Ethernet, Hypervisors motherboard are MSI-890FXA-GD70. The command ethtool -K eth1 gro off did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built). My Linux version is Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu, same kernel on Network Node and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment). The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1 3Com (manageable) switch connecting the two. Thanks! Thiago On 25 October 2013 07:15, Darragh O'Reilly dara2002-openst...@yahoo.com wrote: Hi Thiago, you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them? There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes. What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k iface-used-for-gre'. What kernal are you using: 'uname -a'? Re, Darragh. Hi Daniel, I followed that page, my Instances MTU is lowered by DHCP Agent but, same result: poor network performance (internal between Instances and when trying to reach the Internet). No matter if I use dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf + dhcp-option-force=26,1400 for my Neutron DHCP agent, or not (i.e. MTU = 1500), the result is almost the same. I'll try VXLAN (or just VLANs) this weekend to see if I can get better results... Thanks! Thiago ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
the uneven ssh performance is strange - maybe learning on the tunnel mesh is not stablizing. It is easy to mess it up by giving a wrong local_ip in the ovs-plugin config file. Check the tunnels ports on br-tun with 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1 gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on paste.openstack if there are not to many? Re, Darragh. On Friday, 25 October 2013, 16:20, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: I think can say... YAY!! :-D With LibvirtOpenVswitchDriver my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to 400Mbit/s (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now. The command ethtool -K eth1 gro off still makes no difference. So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again out-of-nothing... I just detect a weird behavior, when I run apt-get update from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run apt-get update AND, all my others ssh connections also stops working too! For a few seconds... This means that when I run apt-get update from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace. BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half... Thank you! Thiago On 25 October 2013 12:28, Darragh O'Reilly dara2002-openst...@yahoo.com wrote: Hi Thiago, for the VIF error: you will need to change qemu.conf as described here: http://openvswitch.org/openstack/documentation/ Re, Darragh. On Friday, 25 October 2013, 15:14, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, Instances are getting MTU 1400. I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! The LibvirtOpenVswitchDriver doesn't work, look: http://paste.openstack.org/show/49709/ http://paste.openstack.org/show/49710/ My NICs are RTL8111/8168/8411 PCI Express Gigabit Ethernet, Hypervisors motherboard are MSI-890FXA-GD70. The command ethtool -K eth1 gro off did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built). My Linux version is Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu, same kernel on Network Node and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment). The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1 3Com (manageable) switch connecting the two. Thanks! Thiago On 25 October 2013 07:15, Darragh O'Reilly dara2002-openst...@yahoo.com wrote: Hi Thiago, you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them? There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes. What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k iface-used-for-gre'. What kernal are you using: 'uname -a'? Re, Darragh. Hi Daniel, I followed that page, my Instances MTU is lowered by DHCP Agent but, same result: poor network performance (internal between Instances and when trying to reach the Internet). No matter if I use dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf + dhcp-option-force=26,1400 for my Neutron DHCP agent, or not (i.e. MTU = 1500), the result is almost the same. I'll try VXLAN (or just VLANs) this weekend to see if I can get better results... Thanks! Thiago ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Here we go: --- root@net-node-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.52 root@net-node-1:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.52 --- --- root@hypervisor-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.53 root@hypervisor-1:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.53 --- --- root@hypervisor-2:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.57 root@hypervisor-2:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.57 --- Each ovs-vsctl show: net-node-1: http://paste.openstack.org/show/49727/ hypervisor-1: http://paste.openstack.org/show/49728/ hypervisor-2: http://paste.openstack.org/show/49729/ Best, Thiago On 25 October 2013 14:11, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: the uneven ssh performance is strange - maybe learning on the tunnel mesh is not stablizing. It is easy to mess it up by giving a wrong local_ip in the ovs-plugin config file. Check the tunnels ports on br-tun with 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1 gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on paste.openstack if there are not to many? Re, Darragh. On Friday, 25 October 2013, 16:20, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: I think can say... YAY!!:-D With LibvirtOpenVswitchDriver my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to *400Mbit/s*(with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now. The command ethtool -K eth1 gro off still makes no difference. So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again out-of-nothing... I just detect a weird behavior, when I run apt-get update from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run apt-get update AND, *all my others ssh connections also stops working too!* For a few seconds... This means that when I run apt-get update from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace. BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half... Thank you! Thiago On 25 October 2013 12:28, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: Hi Thiago, for the VIF error: you will need to change qemu.conf as described here: http://openvswitch.org/openstack/documentation/ Re, Darragh. On Friday, 25 October 2013, 15:14, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, Instances are getting MTU 1400. I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! The LibvirtOpenVswitchDriver doesn't work, look: http://paste.openstack.org/show/49709/ http://paste.openstack.org/show/49710/ My NICs are RTL8111/8168/8411 PCI Express Gigabit Ethernet, Hypervisors motherboard are MSI-890FXA-GD70. The command ethtool -K eth1 gro off did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built). My Linux version is Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu, same kernel on Network Node and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment). The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1 3Com (manageable) switch connecting the two. Thanks! Thiago On 25 October 2013 07:15, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: Hi Thiago, you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them? There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
ok, the tunnels look fine. One thing that looks funny on the network node are these untagged tap* devices. I guess you switched to using veths and then switched back to not using them. I don't know if they matter, but you should clean them up by stopping everthing, running neutron-ovs-cleanup (check bridges empty) and reboot. Bridge br-int Port tapa1376f61-05 Interface tapa1376f61-05 ... Port qr-a1376f61-05 tag: 1 Interface qr-a1376f61-05 type: internal Re, Darragh. On Friday, 25 October 2013, 17:28, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Here we go: --- root@net-node-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.52 root@net-node-1:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.52 --- --- root@hypervisor-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.53 root@hypervisor-1:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.53 --- --- root@hypervisor-2:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.57 root@hypervisor-2:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.57 --- Each ovs-vsctl show: net-node-1: http://paste.openstack.org/show/49727/ hypervisor-1: http://paste.openstack.org/show/49728/ hypervisor-2: http://paste.openstack.org/show/49729/ Best, Thiago On 25 October 2013 14:11, Darragh O'Reilly dara2002-openst...@yahoo.com wrote: the uneven ssh performance is strange - maybe learning on the tunnel mesh is not stablizing. It is easy to mess it up by giving a wrong local_ip in the ovs-plugin config file. Check the tunnels ports on br-tun with 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1 gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on paste.openstack if there are not to many? Re, Darragh. On Friday, 25 October 2013, 16:20, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: I think can say... YAY!! :-D With LibvirtOpenVswitchDriver my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to 400Mbit/s (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now. The command ethtool -K eth1 gro off still makes no difference. So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again out-of-nothing... I just detect a weird behavior, when I run apt-get update from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run apt-get update AND, all my others ssh connections also stops working too! For a few seconds... This means that when I run apt-get update from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace. BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half... Thank you! Thiago On 25 October 2013 12:28, Darragh O'Reilly dara2002-openst...@yahoo.com wrote: Hi Thiago, for the VIF error: you will need to change qemu.conf as described here: http://openvswitch.org/openstack/documentation/ Re, Darragh. On Friday, 25 October 2013, 15:14, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, Instances are getting MTU 1400. I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! The LibvirtOpenVswitchDriver doesn't work, look: http://paste.openstack.org/show/49709/ http://paste.openstack.org/show/49710/ My NICs are RTL8111/8168/8411 PCI Express Gigabit Ethernet, Hypervisors motherboard are MSI-890FXA-GD70. The command ethtool -K eth1 gro off did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built). My Linux version is Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu, same kernel on Network Node and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment). The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Okay, cool! tap** removed, neutron-ovs-cleanup ok, bridges empty, all nodes rebooted. BUT, still poor performance when reaching External network from within a Instance (plus SSH lags)... [?] I'll install a new Network Node, in another hardware, to test it more... Weird thing is, my Grizzly Network Node works perfectly on this very same hardware (same OpenStack Network topology, of course)... Hardware of my current net-node-1: * Grizzly - Okay * Havana - Fails... ;-( Best, Thiago On 25 October 2013 15:28, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: ok, the tunnels look fine. One thing that looks funny on the network node are these untagged tap* devices. I guess you switched to using veths and then switched back to not using them. I don't know if they matter, but you should clean them up by stopping everthing, running neutron-ovs-cleanup (check bridges empty) and reboot. Bridge br-int Port tapa1376f61-05 Interface tapa1376f61-05 ... Port qr-a1376f61-05 tag: 1 Interface qr-a1376f61-05 type: internal Re, Darragh. On Friday, 25 October 2013, 17:28, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Here we go: --- root@net-node-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.52 root@net-node-1:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.52 --- --- root@hypervisor-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.53 root@hypervisor-1:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.53 --- --- root@hypervisor-2:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini local_ip = 10.20.2.57 root@hypervisor-2:~# ip r | grep 10.\20 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.57 --- Each ovs-vsctl show: net-node-1: http://paste.openstack.org/show/49727/ hypervisor-1: http://paste.openstack.org/show/49728/ hypervisor-2: http://paste.openstack.org/show/49729/ Best, Thiago On 25 October 2013 14:11, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: the uneven ssh performance is strange - maybe learning on the tunnel mesh is not stablizing. It is easy to mess it up by giving a wrong local_ip in the ovs-plugin config file. Check the tunnels ports on br-tun with 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1 gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on paste.openstack if there are not to many? Re, Darragh. On Friday, 25 October 2013, 16:20, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: I think can say... YAY!!:-D With LibvirtOpenVswitchDriver my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to *400Mbit/s*(with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now. The command ethtool -K eth1 gro off still makes no difference. So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again out-of-nothing... I just detect a weird behavior, when I run apt-get update from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run apt-get update AND, *all my others ssh connections also stops working too!* For a few seconds... This means that when I run apt-get update from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace. BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half... Thank you! Thiago On 25 October 2013 12:28, Darragh O'Reilly dara2002-openst...@yahoo.comwrote: Hi Thiago, for the VIF error: you will need to change qemu.conf as described here: http://openvswitch.org/openstack/documentation/ Re, Darragh. On Friday, 25 October 2013, 15:14, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi Darragh, Yes, Instances are getting MTU 1400. I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! The LibvirtOpenVswitchDriver doesn't work, look: http://paste.openstack.org/show/49709/ http://paste.openstack.org/show/49710/ My NICs are RTL8111/8168/8411 PCI Express Gigabit Ethernet, Hypervisors motherboard are MSI-890FXA-GD70. The command ethtool -K eth1 gro off did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built). My Linux version is
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi Rick, On 25 October 2013 13:44, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 08:19 AM, Martinx - ジェームズ wrote: I think can say... YAY!!:-D With LibvirtOpenVswitchDriver my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to *_400Mbit/s_* (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now. The command ethtool -K eth1 gro off still makes no difference. Does GRO happen if there isn't RX CKO on the NIC? Ouch! I missed that lesson... hehe No idea, how can I check / test this? If I disable RX CKO (using ethtool?) on the NIC, how can I verify if the GRO is actually happening or not? Anyway, I'm goggling about all this stuff right now. Thanks for pointing it out! Refs: * JLS2009: Generic receive offload - http://lwn.net/Articles/358910/ Can your NIC peer-into a GRE tunnel (?) to do CKO on the encapsulated traffic? Again, no idea... No idea... :-/ Listen, maybe this sounds too dumb from my part but, it is the first time I'm talking about this stuff (like NIC peer-into GRE ?, or GRO / CKO... GRE tunnels sounds too damn complex and problematic... I guess it is time to try VXLAN (or NVP ?)... If you guys say: VXLAN is a completely different beast (i.e. it does not touch with ANY GRE tunnel), and it works smoothly (without GRO / CKO / MTU / lags / low speed troubles and issues), I'll move to it right now (is VXLAN docs ready?). NOTE: I don't want to hijack this thread because of other (internal communication VS Directional network performance issues with Neutron + OpenvSwitch thread subject) problems with my OpenStack environment, please, let me know if this becomes a problem for you guys. So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again out-of-nothing... I just detect a weird behavior, when I run apt-get update from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run apt-get update AND, _all my others ssh connections also stops working too!_ For a few seconds... This means that when I run apt-get update from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace. BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half... I would suggest checking for individual CPUs maxing-out during the 400 Mbit/s transfers. Okay, I'll. rick jones Thiago ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
You can use ethtool -k eth0 to view the setting and use ethtool -K eth0 gro off to turn off GRO. On Fri, Oct 25, 2013 at 3:03 PM, Martinx - ジェームズ thiagocmarti...@gmail.comwrote: Hi Rick, On 25 October 2013 13:44, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 08:19 AM, Martinx - ジェームズ wrote: I think can say... YAY!!:-D With LibvirtOpenVswitchDriver my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to *_400Mbit/s_* (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now. The command ethtool -K eth1 gro off still makes no difference. Does GRO happen if there isn't RX CKO on the NIC? Ouch! I missed that lesson... hehe No idea, how can I check / test this? If I disable RX CKO (using ethtool?) on the NIC, how can I verify if the GRO is actually happening or not? Anyway, I'm goggling about all this stuff right now. Thanks for pointing it out! Refs: * JLS2009: Generic receive offload - http://lwn.net/Articles/358910/ Can your NIC peer-into a GRE tunnel (?) to do CKO on the encapsulated traffic? Again, no idea... No idea... :-/ Listen, maybe this sounds too dumb from my part but, it is the first time I'm talking about this stuff (like NIC peer-into GRE ?, or GRO / CKO... GRE tunnels sounds too damn complex and problematic... I guess it is time to try VXLAN (or NVP ?)... If you guys say: VXLAN is a completely different beast (i.e. it does not touch with ANY GRE tunnel), and it works smoothly (without GRO / CKO / MTU / lags / low speed troubles and issues), I'll move to it right now (is VXLAN docs ready?). NOTE: I don't want to hijack this thread because of other (internal communication VS Directional network performance issues with Neutron + OpenvSwitch thread subject) problems with my OpenStack environment, please, let me know if this becomes a problem for you guys. So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again out-of-nothing... I just detect a weird behavior, when I run apt-get update from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run apt-get update AND, _all my others ssh connections also stops working too!_ For a few seconds... This means that when I run apt-get update from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace. BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half... I would suggest checking for individual CPUs maxing-out during the 400 Mbit/s transfers. Okay, I'll. rick jones Thiago ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Listen, maybe this sounds too dumb from my part but, it is the first time I'm talking about this stuff (like NIC peer-into GRE ?, or GRO / CKO... No worries. So, a slightly brief history of stateless offloads in NICs. It may be too basic, and I may get some details wrong, but it should give the gist. Go back to the old days - 10 Mbit/s Ethernet was it (all you Token Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a fair margin. 100 BT came out and it wasn't all that long before systems were faster than that, but things like interrupt rates were starting to get to be an issue for performance, so 100 BT NICs started implementing interrupt avoidance heuristics. The next bump in network speed to 1000 Mbit/s managed to get well out ahead of the systems. All this time, while the link speeds were increasing, the IEEE was doing little to nothing to make sending and receiving Ethernet traffic any easier on the end stations (eg increasing the MTU). It was taking just as many CPU cycles to send/receive a frame over 1000BT as it did over 100BT as it did over 10BT. insert segque about how FDDI was doing things to make life easier, as well as what the FDDI NIC vendors were doing to enable copy-free networking, here So the Ethernet NIC vendors started getting creative and started borrowing some techniques from FDDI. The base of it all is CKO - ChecKsum Offload. Offloading the checksum calculation for the TCP and UDP checksums. In broad handwaving terms, for inbound packets, the NIC is made either smart enough to recognize an incoming frame as TCP segment (UDP datagram) or it performs the Internet Checksum across the entire frame and leaves it to the driver to fixup. For outbound traffic, the stack, via the driver, tells the NIC a starting value (perhaps), where to start computing the checksum, how far to go, and where to stick it... So, we can save the CPU cycles used calculating/verifying the checksums. In rough terms, in the presence of copies, that is perhaps 10% or 15% savings. Systems still needed more. It was just as many trips up and down the protocol stack in the host to send a MB of data as it was before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors came-up with Jumbo Frames - I think the first may have been Alteon and their AceNICs and switches. A 9000 byte MTU allows one to send bulk data across the network in ~1/6 the number of trips up and down the protocol stack. But that has problems - in particular you have to have support for Jumbo Frames from end to end. So someone, I don't recall who, had the flash of inspiration - What If... the NIC could perform the TCP segmentation on behalf of the stack? When sending a big chunk of data over TCP in one direction, the only things which change from TCP segment to TCP segment are the sequence number, and the checksum insert some handwaving about the IP datagram ID here. The NIC already knows how to compute the checksum, so let's teach it how to very simply increment the TCP sequence number. Now we can give it A Lot of Data (tm) in one trip down the protocol stack and save even more CPU cycles than Jumbo Frames. Now the NIC has to know a little bit more about the traffic - it has to know that it is TCP so it can know where the TCP sequence number goes. We also tell it the MSS to use when it is doing the segmentation on our behalf. Thus was born TCP Segmentation Offload, aka TSO or Poor Man's Jumbo Frames That works pretty well for servers at the time - they tend to send more data than they receive. The clients receiving the data don't need to be able to keep up at 1000 Mbit/s and the server can be sending to multiple clients. However, we get another order of magnitude bump in link speeds, to 1 Mbit/s. Now people need/want to receive at the higher speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image of TSO and call it LRO - Large Receive Offload. The LRO NIC will coalesce several, consequtive TCP segments into one uber segment and hand that to the host. There are some issues with LRO though - for example when a system is acting as a router, so in Linux, and perhaps other stacks, LRO is taken out of the hands of the NIC and given to the stack in the form of 'GRO - Generic Receive Offload. GRO operates above the NIC/driver, but below IP. It detects the consecutive segments and coalesces them before passing them further up the stack. It becomes possible to receive data at link-rate over 10 GbE. All is happiness and joy. OK, so now we have all these stateless offloads that know about the basic traffic flow. They are all built on the foundation of CKO. They are all dealing with *un* encapsulated traffic. (They also don't to anything for small packets.) Now, toss-in some encapsulation. Take your pick, in the abstract it doesn't really matter which I suspect, at least for a little longer. What is arriving at the NIC on inbound is no longer a TCP segment in an IP
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
WOW!! Thank you for your time Rick! Awesome answer!!=D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! Thanks! Thiago On 25 October 2013 18:56, Rick Jones rick.jon...@hp.com wrote: Listen, maybe this sounds too dumb from my part but, it is the first time I'm talking about this stuff (like NIC peer-into GRE ?, or GRO / CKO... No worries. So, a slightly brief history of stateless offloads in NICs. It may be too basic, and I may get some details wrong, but it should give the gist. Go back to the old days - 10 Mbit/s Ethernet was it (all you Token Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a fair margin. 100 BT came out and it wasn't all that long before systems were faster than that, but things like interrupt rates were starting to get to be an issue for performance, so 100 BT NICs started implementing interrupt avoidance heuristics. The next bump in network speed to 1000 Mbit/s managed to get well out ahead of the systems. All this time, while the link speeds were increasing, the IEEE was doing little to nothing to make sending and receiving Ethernet traffic any easier on the end stations (eg increasing the MTU). It was taking just as many CPU cycles to send/receive a frame over 1000BT as it did over 100BT as it did over 10BT. insert segque about how FDDI was doing things to make life easier, as well as what the FDDI NIC vendors were doing to enable copy-free networking, here So the Ethernet NIC vendors started getting creative and started borrowing some techniques from FDDI. The base of it all is CKO - ChecKsum Offload. Offloading the checksum calculation for the TCP and UDP checksums. In broad handwaving terms, for inbound packets, the NIC is made either smart enough to recognize an incoming frame as TCP segment (UDP datagram) or it performs the Internet Checksum across the entire frame and leaves it to the driver to fixup. For outbound traffic, the stack, via the driver, tells the NIC a starting value (perhaps), where to start computing the checksum, how far to go, and where to stick it... So, we can save the CPU cycles used calculating/verifying the checksums. In rough terms, in the presence of copies, that is perhaps 10% or 15% savings. Systems still needed more. It was just as many trips up and down the protocol stack in the host to send a MB of data as it was before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors came-up with Jumbo Frames - I think the first may have been Alteon and their AceNICs and switches. A 9000 byte MTU allows one to send bulk data across the network in ~1/6 the number of trips up and down the protocol stack. But that has problems - in particular you have to have support for Jumbo Frames from end to end. So someone, I don't recall who, had the flash of inspiration - What If... the NIC could perform the TCP segmentation on behalf of the stack? When sending a big chunk of data over TCP in one direction, the only things which change from TCP segment to TCP segment are the sequence number, and the checksum insert some handwaving about the IP datagram ID here. The NIC already knows how to compute the checksum, so let's teach it how to very simply increment the TCP sequence number. Now we can give it A Lot of Data (tm) in one trip down the protocol stack and save even more CPU cycles than Jumbo Frames. Now the NIC has to know a little bit more about the traffic - it has to know that it is TCP so it can know where the TCP sequence number goes. We also tell it the MSS to use when it is doing the segmentation on our behalf. Thus was born TCP Segmentation Offload, aka TSO or Poor Man's Jumbo Frames That works pretty well for servers at the time - they tend to send more data than they receive. The clients receiving the data don't need to be able to keep up at 1000 Mbit/s and the server can be sending to multiple clients. However, we get another order of magnitude bump in link speeds, to 1 Mbit/s. Now people need/want to receive at the higher speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image of TSO and call it LRO - Large Receive Offload. The LRO NIC will coalesce several, consequtive TCP segments into one uber segment and hand that to the host. There are some issues with LRO though - for example when a system is acting as a router, so in Linux, and perhaps other stacks, LRO is taken out of the hands of the NIC and given to the stack in the form of 'GRO - Generic Receive Offload. GRO operates above
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
LOL... One day, Internet via Quantum Entanglement! Oops, Neutron! =P I'll ignore the problems related to the performance between two instances on different hypervisors for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works. I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP!:-/ Why I'm giving it up (of L3 / NS) for now? Because I tried: The option tenant_network_type with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged). From the instances, the connection with External network *is always slow*, no matter if I choose for Tenants, GRE, VXLAN or VLAN. For example, right now, I'm using VLAN, same problem. Don't you guys think that this can be a problem with the bridge br-ex and its internals ? Since I swapped the Tenant Network Type 3 times, same result... But I still did not removed the br-ex from the scene. If someone wants to debug it, I can give the root password, no problem, it is just a lab... =) Thanks! Thiago On 25 October 2013 19:45, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote: WOW!! Thank you for your time Rick! Awesome answer!!=D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! One or two problems I cannot say.Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction. All I can really say is I liked it better when we were called Quantum, because then I could refer to it as Spooky networking at a distance. Sadly, describing Neutron as Networking with no inherent charge doesn't work as well :) rick jones ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
I was able to enable ovs_use_veth and start Instances (VXLAN / DHCP / Metadata Okay)... But, same problem when accessing External network. BTW, I have valid Floating IPs and easy access to the Internet from the Network Node, if someone wants to debug, just ping a message. On 26 October 2013 02:25, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: LOL... One day, Internet via Quantum Entanglement! Oops, Neutron! =P I'll ignore the problems related to the performance between two instances on different hypervisors for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works. I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP!:-/ Why I'm giving it up (of L3 / NS) for now? Because I tried: The option tenant_network_type with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged). From the instances, the connection with External network *is always slow*, no matter if I choose for Tenants, GRE, VXLAN or VLAN. For example, right now, I'm using VLAN, same problem. Don't you guys think that this can be a problem with the bridge br-ex and its internals ? Since I swapped the Tenant Network Type 3 times, same result... But I still did not removed the br-ex from the scene. If someone wants to debug it, I can give the root password, no problem, it is just a lab... =) Thanks! Thiago On 25 October 2013 19:45, Rick Jones rick.jon...@hp.com wrote: On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote: WOW!! Thank you for your time Rick! Awesome answer!!=D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?! I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! One or two problems I cannot say.Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction. All I can really say is I liked it better when we were called Quantum, because then I could refer to it as Spooky networking at a distance. Sadly, describing Neutron as Networking with no inherent charge doesn't work as well :) rick jones ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Ok so that says that PMTUd is failing, probably due to a bug/limitation in openvswitch. Can we please make sure a bug is filed - both on Neutron and on the upstream component as soon as someone tracks it down : Manual MTU lowering is only needed when a network component is failing to report failed delivery of DF packets correctly. -Rob On 25 October 2013 08:38, Speichert,Daniel djs...@drexel.edu wrote: We managed to bring the upload speed back to maximum on the instances through the use of this guide: http://docs.openstack.org/trunk/openstack-network/admin/content/openvswitch_plugin.html Basically, the MTU needs to be lowered for GRE tunnels. It can be done with DHCP as explained in the new trunk manual. Regards, Daniel From: annegen...@justwriteclick.com [mailto:annegen...@justwriteclick.com] On Behalf Of Anne Gentle Sent: Thursday, October 24, 2013 12:08 PM To: Martinx - ジェームズ Cc: Speichert,Daniel; openstack@lists.openstack.org Subject: Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch On Thu, Oct 24, 2013 at 10:37 AM, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Precisely! The doc currently says to disable Namespace when using GRE, never did this before, look: http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plugin.ovs.gre.html But on this very same doc, they say to enable it... Who knows?! =P http://docs.openstack.org/trunk/install-guide/install/apt/content/section_networking-routers-with-private-networks.html I stick with Namespace enabled... Just a reminder, /trunk/ links are works in progress, thanks for bringing the mismatch to our attention, and we already have a doc bug filed: https://bugs.launchpad.net/openstack-manuals/+bug/1241056 Review this patch: https://review.openstack.org/#/c/53380/ Anne Let me ask you something, when you enable ovs_use_veth, que Metadata and DHCP still works?! Cheers! Thiago On 24 October 2013 12:22, Speichert,Daniel djs...@drexel.edu wrote: Hello everyone, It seems we also ran into the same issue. We are running Ubuntu Saucy with OpenStack Havana from Ubuntu Cloud archives (precise-updates). The download speed to the VMs increased from 5 Mbps to maximum after enabling ovs_use_veth. Upload speed from the VMs is still terrible (max 1 Mbps, usually 0.04 Mbps). Here is the iperf between the instance and L3 agent (network node) inside namespace. root@cloud:~# ip netns exec qrouter-a29e0200-d390-40d1-8cf7-7ac1cef5863a iperf -c 10.1.0.24 -r Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) Client connecting to 10.1.0.24, TCP port 5001 TCP window size: 585 KByte (default) [ 7] local 10.1.0.1 port 37520 connected with 10.1.0.24 port 5001 [ ID] Interval Transfer Bandwidth [ 7] 0.0-10.0 sec 845 MBytes 708 Mbits/sec [ 6] local 10.1.0.1 port 5001 connected with 10.1.0.24 port 53006 [ 6] 0.0-31.4 sec 256 KBytes 66.7 Kbits/sec We are using Neutron OpenVSwitch with GRE and namespaces. A side question: the documentation says to disable namespaces with GRE and enable them with VLANs. It was always working well for us on Grizzly with GRE and namespaces and we could never get it to work without namespaces. Is there any specific reason why the documentation is advising to disable it? Regards, Daniel From: Martinx - ジェームズ [mailto:thiagocmarti...@gmail.com] Sent: Thursday, October 24, 2013 3:58 AM To: Aaron Rosen Cc: openstack@lists.openstack.org Subject: Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch Hi Aaron, Thanks for answering! =) Lets work... --- TEST #1 - iperf between Network Node and its Uplink router (Data Center's gateway Internet) - OVS br-ex / eth2 # Tenant Namespace route table root@net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 ip route default via 172.16.0.1 dev qg-50b615b7-c2 172.16.0.0/20 dev qg-50b615b7-c2 proto kernel scope link src 172.16.0.2 192.168.210.0/24 dev qr-a1376f61-05 proto kernel scope link src 192.168.210.1 # there is a iperf -s running at 172.16.0.1 Internet, testing it root@net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -c 172.16.0.1 Client connecting to 172.16.0.1, TCP port 5001 TCP window size: 22.9 KByte (default) [ 5] local 172.16.0.2 port 58342 connected with 172.16.0.1 port 5001 [ ID] Interval Transfer Bandwidth [ 5] 0.0
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
James, I think I'm hitting this problem. I'm using Per-Tenant Routers with Private Networks, GRE tunnels and L3+DHCP Network Node. The connectivity from behind my Instances is very slow. It takes an eternity to finish apt-get update. If I run apt-get update from within tenant's Namespace, it goes fine. If I enable ovs_use_veth, Metadata (and/or DHCP) stops working and I and unable to start new Ubuntu Instances and login into them... Look: -- cloud-init start running: Tue, 22 Oct 2013 05:57:39 +. up 4.01 seconds 2013-10-22 06:01:42,989 - util.py[WARNING]: ' http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]: url error [[Errno 113] No route to host] 2013-10-22 06:01:45,988 - util.py[WARNING]: ' http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [6/120s]: url error [[Errno 113] No route to host] -- Is this problem still around?! Should I stay away from GRE tunnels when with Havana + Ubuntu 12.04.3? Is it possible to re-enable Metadata when ovs_use_veth = true ? Thanks! Thiago On 3 October 2013 06:27, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 02/10/13 22:49, James Page wrote: sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221 traceroute -n 10.5.0.2 -p 4 --mtu traceroute to 10.5.0.2 (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950 ms F=1500 0.598 ms 0.566 ms The PMTU from the l3 gateway to the instance looks OK to me. I spent a bit more time debugging this; performance from within the router netns on the L3 gateway node looks good in both directions when accessing via the tenant network (10.5.0.2) over the qr-X interface, but when accessing through the external network from within the netns I see the same performance choke upstream into the tenant network. Which would indicate that my problem lies somewhere around the qg-X interface in the router netns - just trying to figure out exactly what - maybe iptables is doing something wonky? OK - I found a fix but I'm not sure why this makes a difference; neither my l3-agent or dhcp-agent configuration had 'ovs_use_veth = True'; I switched this on, clearing everything down, rebooted and now I seem symmetric good performance across all neutron routers. This would point to some sort of underlying bug when ovs_use_veth = False. - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTTh6AAoJEL/srsug59jDmpEP/jaB5/yn9+Xm12XrVu0Q3IV5 fLGOuBboUgykVVsfkWccI/oygNlBaXIcDuak/E4jxPcoRhLAdY1zpX8MQ8wSsGKd CjSeuW8xxnXubdfzmsCKSs3FCIBhDkSYzyiJd/raLvCfflyy8Cl7KN2x22mGHJ6z qZ9APcYfm9qCVbEssA3BHcUL+st1iqMJ0YhVZBk03+QEXaWu3FFbjpjwx3X1ZvV5 Vbac7enqy7Lr4DSAIJVldeVuRURfv3YE3iJZTIXjaoUCCVTQLm5OmP9TrwBNHLsA 7W+LceQri+Vh0s4dHPKx5MiHsV3RCydcXkSQFYhx7390CXypMQ6WwXEY/a8Egssg SuxXByHwEcQFa+9sCwPQ+RXCmC0O6kUi8EPmwadjI5Gc1LoKw5Wov/SEen86fDUW P9pRXonseYyWN9I4MT4aG1ez8Dqq/SiZyWBHtcITxKI2smD92G9CwWGo4L9oGqJJ UcHRwQaTHgzy3yETPO25hjax8ZWZGNccHBixMCZKegr9p2dhR+7qF8G7mRtRQLxL 0fgOAExn/SX59ZT4RaYi9fI6Gng13RtSyI87CJC/50vfTmqoraUUK1aoSjIY4Dt+ DYEMMLp205uLEj2IyaNTzykR0yh3t6dvfpCCcRA/xPT9slfa0a7P8LafyiWa4/5c jkJM4Y1BUV+2L5Rrf3sc =4lO4 -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 03/10/13 04:43, Martinx - ジェームズ wrote: Mmm... I am unable to compile openvswitch-datapath-dkms from Havana Ubuntu Cloud Archive (on top of a fresh install of Ubuntu 12.04.3), look: There is a bug in that version; I'm deploying from ppa:ubuntu-cloud-archive/havana-staging which has a version that does work - we are testing everything prior to push through to proposed and updates for rc1 (i.e. this week). - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTQh2AAoJEL/srsug59jDYvgQAIFpc/NTKGHBUSCRX3JiRVru iBK2EuPZeNhh9Y4oXO14/zhDNp4/vnDQcJMNAZskUxuA5HcAnLp9oZbleKqG/r7W w0s9fpkPzzYabaKR431QzJhm+3NIuMqtSgNy0ZX7zO9om3vkSAtLLTUlyYIHxTj3 owPpndN527XUuYalwFF7ffdZK0oIOX65XEUehmX1SPEeOGNhrWjnLH8rcr5XcCbL VaGPMcqkJLjW+aKTjr4Xi0R6geQ+BjM7g+FNtu7BR4V+laxLyKz9f+WPdrdfcFQP PLt6gBG6/OVzmZD8Fxs2iD0ox/KaC7gfhxF7ffF1aFwZIhzMZhUYtmCxNSPx80lG FXOG9R54kDzvPzPNdZLS+dYUcuSBjFLw3Wjrplxzlok+cLjlqjfoABHXlhFjfcuM Qr5QeUnJc9at+2p8JBjBRK1uxLgV2G+R7umIcjS9SIiD0kK9mKHGDbdKHJ4pvto8 sMAtIDAYMT+hEPWZ7i7x3lqbd/G2ipwKi2exgKy2VVfxB11qTY07boqNztd905NG iOpusyvFqouHZZJ4SC5OziTTa3rcy2nhta2uYT946aS22z3BxESePlzi/PCJ5faU h6HA7qIZyr4aUH75I/FBBmDasFrSKA7xJUYXPHa5wV1pnBvSs6QA14P0q43OsmwX OQyC1OFfgRfE49kX14QZ =TjDN -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Cool! The `ppa:ubuntu-cloud-archive/havana-staging' is the repository I was looking for. It works now... Thanks! On 3 October 2013 03:02, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 03/10/13 04:43, Martinx - ジェームズ wrote: Mmm... I am unable to compile openvswitch-datapath-dkms from Havana Ubuntu Cloud Archive (on top of a fresh install of Ubuntu 12.04.3), look: There is a bug in that version; I'm deploying from ppa:ubuntu-cloud-archive/havana-staging which has a version that does work - we are testing everything prior to push through to proposed and updates for rc1 (i.e. this week). - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTQh2AAoJEL/srsug59jDYvgQAIFpc/NTKGHBUSCRX3JiRVru iBK2EuPZeNhh9Y4oXO14/zhDNp4/vnDQcJMNAZskUxuA5HcAnLp9oZbleKqG/r7W w0s9fpkPzzYabaKR431QzJhm+3NIuMqtSgNy0ZX7zO9om3vkSAtLLTUlyYIHxTj3 owPpndN527XUuYalwFF7ffdZK0oIOX65XEUehmX1SPEeOGNhrWjnLH8rcr5XcCbL VaGPMcqkJLjW+aKTjr4Xi0R6geQ+BjM7g+FNtu7BR4V+laxLyKz9f+WPdrdfcFQP PLt6gBG6/OVzmZD8Fxs2iD0ox/KaC7gfhxF7ffF1aFwZIhzMZhUYtmCxNSPx80lG FXOG9R54kDzvPzPNdZLS+dYUcuSBjFLw3Wjrplxzlok+cLjlqjfoABHXlhFjfcuM Qr5QeUnJc9at+2p8JBjBRK1uxLgV2G+R7umIcjS9SIiD0kK9mKHGDbdKHJ4pvto8 sMAtIDAYMT+hEPWZ7i7x3lqbd/G2ipwKi2exgKy2VVfxB11qTY07boqNztd905NG iOpusyvFqouHZZJ4SC5OziTTa3rcy2nhta2uYT946aS22z3BxESePlzi/PCJ5faU h6HA7qIZyr4aUH75I/FBBmDasFrSKA7xJUYXPHa5wV1pnBvSs6QA14P0q43OsmwX OQyC1OFfgRfE49kX14QZ =TjDN -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 02/10/13 22:49, James Page wrote: sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221 traceroute -n 10.5.0.2 -p 4 --mtu traceroute to 10.5.0.2 (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950 ms F=1500 0.598 ms 0.566 ms The PMTU from the l3 gateway to the instance looks OK to me. I spent a bit more time debugging this; performance from within the router netns on the L3 gateway node looks good in both directions when accessing via the tenant network (10.5.0.2) over the qr-X interface, but when accessing through the external network from within the netns I see the same performance choke upstream into the tenant network. Which would indicate that my problem lies somewhere around the qg-X interface in the router netns - just trying to figure out exactly what - maybe iptables is doing something wonky? OK - I found a fix but I'm not sure why this makes a difference; neither my l3-agent or dhcp-agent configuration had 'ovs_use_veth = True'; I switched this on, clearing everything down, rebooted and now I seem symmetric good performance across all neutron routers. This would point to some sort of underlying bug when ovs_use_veth = False. - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTTh6AAoJEL/srsug59jDmpEP/jaB5/yn9+Xm12XrVu0Q3IV5 fLGOuBboUgykVVsfkWccI/oygNlBaXIcDuak/E4jxPcoRhLAdY1zpX8MQ8wSsGKd CjSeuW8xxnXubdfzmsCKSs3FCIBhDkSYzyiJd/raLvCfflyy8Cl7KN2x22mGHJ6z qZ9APcYfm9qCVbEssA3BHcUL+st1iqMJ0YhVZBk03+QEXaWu3FFbjpjwx3X1ZvV5 Vbac7enqy7Lr4DSAIJVldeVuRURfv3YE3iJZTIXjaoUCCVTQLm5OmP9TrwBNHLsA 7W+LceQri+Vh0s4dHPKx5MiHsV3RCydcXkSQFYhx7390CXypMQ6WwXEY/a8Egssg SuxXByHwEcQFa+9sCwPQ+RXCmC0O6kUi8EPmwadjI5Gc1LoKw5Wov/SEen86fDUW P9pRXonseYyWN9I4MT4aG1ez8Dqq/SiZyWBHtcITxKI2smD92G9CwWGo4L9oGqJJ UcHRwQaTHgzy3yETPO25hjax8ZWZGNccHBixMCZKegr9p2dhR+7qF8G7mRtRQLxL 0fgOAExn/SX59ZT4RaYi9fI6Gng13RtSyI87CJC/50vfTmqoraUUK1aoSjIY4Dt+ DYEMMLp205uLEj2IyaNTzykR0yh3t6dvfpCCcRA/xPT9slfa0a7P8LafyiWa4/5c jkJM4Y1BUV+2L5Rrf3sc =4lO4 -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
On 10/02/2013 02:14 AM, James Page wrote: I tcpdump'ed the traffic and I see alot of duplicate acks which makes me suspect some sort of packet fragmentation but its got me puzzled. Anyone have any ideas about how to debug this further? or has anyone seen anything like this before? Duplicate ACKs can be triggered by missing or out-of-order TCP segments. Presumably that would show-up in the tcpdump trace though it might be easier to see if you run the .pcap file through tcptrace -G. Iperf may have a similar option, but if there are actual TCP retransmissions during the run, netperf can be told to tell you about them (when running under Linux): netperf -H remote -t TCP_STREAM -- -o throughput,local_transport_retrans,remote_transport_retrans will give to remote and netperf -H remote -t TCP_MAERTS -- -o throughput,local_transport_retrans,remote_transport_retrans will give from remote. Or you can take snapshots of netstat -s output from before and after your iperf run(s) and do the math by hand. rick jones if the netperf in multiverse isn't new enough to grok the -o option, you can grab the top-of-trunk from http://www.netperf.org/svn/netperf2/trunk via svn. ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi James, have you tried setting the MTU to a lower number of bytes, instead of a higher-than-1500 setting? Say... 1454 instead of 1546? Curious to see if that resolves the issue. If it does, then perhaps there is a path somewhere that had a 1546 PMTU? -jay On 10/02/2013 05:14 AM, James Page wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Folks I'm seeing an odd direction performance issue with my Havana test rig which I'm struggling to debug; details: Ubuntu 12.04 with Linux 3.8 backports kernel, Havana Cloud Archive (currently Havana b3, OpenvSwitch 1.10.2), OpenvSwitch plugin with GRE overlay networks. I've configured the MTU's on all of the physical host network interfaces to 1546 to add capacity for the GRE network headers. Performance between instances within a single tenant network on different physical hosts is as I would expect (near 1GBps), but I see issues when data transits the Neutron L3 gateway - in the example below churel is a physical host on the same network as the layer 3 gateway: ubuntu@churel:~$ scp hardware.dump 10.98.191.103: hardware.dump 100% 67MB 4.8MB/s 00:14 ubuntu@churel:~$ scp 10.98.191.103:hardware.dump . hardware.dump 100% 67MB 66.8MB/s 00:01 As you can see, pushing data to the instance (via a floating ip 10.98.191.103) is painfully slow, whereas pulling the same data is x10+ faster (and closer to what I would expect). iperf confirms the same: ubuntu@churel:~$ iperf -c 10.98.191.103 -m - Client connecting to 10.98.191.103, TCP port 5001 TCP window size: 22.9 KByte (default) - [ 3] local 10.98.191.11 port 55330 connected with 10.98.191.103 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 60.8 MBytes 50.8 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) ubuntu@james-page-bastion:~$ iperf -c 10.98.191.11 -m - Client connecting to 10.98.191.11, TCP port 5001 TCP window size: 23.3 KByte (default) - [ 3] local 10.5.0.2 port 52190 connected with 10.98.191.11 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.07 GBytes 918 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) 918Mbit vs 50Mbits. I tcpdump'ed the traffic and I see alot of duplicate acks which makes me suspect some sort of packet fragmentation but its got me puzzled. Anyone have any ideas about how to debug this further? or has anyone seen anything like this before? Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSS+QSAAoJEL/srsug59jD8ZcQAKbZDVU8KKa7hsic7+ulqWQQ EFbq8Im5x4mQY7htIvIOM26BR0ktAO5luE7zMBXsA4AwPud1BQSGhw89/NvNhADT TLcGdQADsomeiBpJebzwUmvL/tYUoMDRA3O96mUn2pi0fySWbEuEgMDjDJ/ow23D Y7nEv0mItaZ4MBSI9RZcqsDUl7UbbdlGejSWhJcwp/127HMU9nYwWNz5UHJjsGZ1 eITyv1WZH/dYPQ1SES41qD1WvkTBugopGJvptEyrcO62A+akGOvnqpsHgPECbLb+ b/8rk8nB1HB74Wh+tQP4WRQCZYso15nB6ukIyIU24Qti2tXtXDdKwszEoblCwCT3 YZJTERNOENURlUEFwgi6FNL+nZomSG0UJU6qqDGiUJkbSF7SwJm4y8/XRlJM2Ihn wyxFB0qe3YdMqgDLZn11GwCDqn3g11hYaocHNUyRaj/tgxhGKbOFvix5kz3I4V7T gd+sqUySMVd9wCRXBzDDhCuG9xf/QY2ZQxXzyfPJWd9svPh/O6osTSQzaI1eZl9/ jVRejMAFr6Rl11GPKd3DYi32GXa896QELjBmJ9Kof0NDlCcDuUKpVeifIhcbQZZV sWyQmbb6Z/ypFV9xXiLRfH2fW2bAQQHgiQGvy9apoE78BWYdnsD8Q3Ekwag6lFqp yUwt/RcRXS1PbLG4EGFW =HTvW -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Gangur On 02/10/13 17:24, Gangur, Hrushikesh (R D HP Cloud) wrote: http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html Yeah - - I read that already: sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221 traceroute -n 10.5.0.2 -p 4 --mtu traceroute to 10.5.0.2 (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950 ms F=1500 0.598 ms 0.566 ms The PMTU from the l3 gateway to the instance looks OK to me. On 02/10/13 16:37, Jay Pipes wrote: Hi James, have you tried setting the MTU to a lower number of bytes, instead of a higher-than-1500 setting? Say... 1454 instead of 1546? Curious to see if that resolves the issue. If it does, then perhaps there is a path somewhere that had a 1546 PMTU? Do you mean in instances, or on the physical servers? For context I hit this problem prior to tweaking MTU's (defaults of 1500 everywhere). - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTErmAAoJEL/srsug59jDLzEQALqIhfbVeWwUCe/s+P/CLN3k EIH5koGJ69RiQDFhcIBSRzQw7FbwWznBAHHeemVn5OW/LcCKJQo9wLNX1K742pjz G2pDwVeJnwX/QVK95chyJ/4zZENpSiT/2fzlNje7H95eiKdRd6mvDSPsIjoEQ5Ci Cz4R1nvOoJj9cWOt5xCHtsmb5PX7O2D9zpCj/Al6ELH95zNfe7eyFSUcwZ/MEo9t e8VxAaKlg+AQ6bdYokssIrHU6osdHDGXY1/9z6ffbcrVXJnlDkzHx0DmN81qIPXV ros8OPZA51cVqVpEw2TvFbl5DZHukjOLGePsTKN6IcQ/2TtMdqqgbGdWAxO9iVFR SAQdVp9yM6J7XM4kZ//gj4Oc3g/jN9EHr8rP0tEFWlypomiBjG8sQeEuHlp6DFxQ IYacqOfWCozTDuQroj77Q9QUf4VV+ykVvTPFBHG7FiLAZyXRV5ueOlwHgAdysiyO rIYcxXYrU6RAAmuqXXnyu5awFd/s2qisuAXTjhQpN9mUuVB9ge/BRGLa1di4S/Wz sHAhT18h/JAxvyzARq9Qa0X8go87mM3Xoe5fivnvQrTNPQsoOxgaK6JVbTNG0pP2 bJbnRTBEjudSNlRo1WEfopsiz1HxYsN5tlpG0BabnkAsUqVjKP36tUQphe3e7S9R dFBngsPowBFLcBuBY7tp =FDK3 -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
On 10/02/2013 12:17 PM, James Page wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Jay On 02/10/13 16:37, Jay Pipes wrote: Hi James, have you tried setting the MTU to a lower number of bytes, instead of a higher-than-1500 setting? Say... 1454 instead of 1546? Curious to see if that resolves the issue. If it does, then perhaps there is a path somewhere that had a 1546 PMTU? Do you mean in instances, or on the physical servers? I mean on the instance vNICs. For context I hit this problem prior to tweaking MTU's (defaults of 1500 everywhere). Right, I'm just curious :) -jay ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 02/10/13 17:28, Jay Pipes wrote: On 02/10/13 16:37, Jay Pipes wrote: Hi James, have you tried setting the MTU to a lower number of bytes, instead of a higher-than-1500 setting? Say... 1454 instead of 1546? Curious to see if that resolves the issue. If it does, then perhaps there is a path somewhere that had a 1546 PMTU? Do you mean in instances, or on the physical servers? I mean on the instance vNICs. Yeah - thats what I thought - that makes no difference either. - -- James Page Technical Lead Ubuntu Server Team james.p...@canonical.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTE7zAAoJEL/srsug59jDwswP/AwQarblKhDnAe+aGYVn1hKs g/BPiyqovNtBNNXKj5FLIaDDQnpLueIDxoX0lHPZkKLpDJybrsQBtqwnol2qcBa3 rBfb/yt92vL8wDlRBEsbh1qr/2EmErksFjcIMIltqBNXP5gGR3ADS9DIJ65GUIFY Aipsk03bu3pn2FiCJo/cbbKBT96bbQg9vNgbUi8Eu8vWW7wpEq90njlDrVh02u/o ioME0Ja8DnFrPNmIx8kaaOdXSY9e3YmWfjImQbi/O7lVwUHV7ZA+4szSrQiCmPn3 eHUGTblLP2yEmETu3rF7hxB1bn2H3bxZ+C1vg7k3ABNlTMrDPHTQv+iRSCA9WDcf yMNjCD5dTI10gx+OTDjEIg+z2yEA4fqmYqHgHsuPyCBdRs6CX1qIJPywFZlFDglC AC1R6PMtpVTlcUXlLX/3QJc63/n+3nX6R56iOmAxgDIaVLy5+Hh52g+5vY1T5Nl8 B0aqM60Duxvpf6/9wkgSHcjp7MHBp1IEoT8b+aD5xwSZjG+gqW2wClCGx6ktOfnN vwxmaTT+rY2vqLNXd51PF2Tfl5+cfK2Sws3lnmJwh5PxZtcwfY42wiBAJWbuJMDT EIurmHqSPhBkylZlONWto7oNyDSaiqYczbTXGM3eYw/ZqTpgN/X9JuCpMAxt51oI ALR0na+J0AIQcRUS0P4M =CQbq -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 02/10/13 17:33, James Page wrote: On 02/10/13 17:24, Gangur, Hrushikesh (R D HP Cloud) wrote: http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html Yeah - I read that already: sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221 traceroute -n 10.5.0.2 -p 4 --mtu traceroute to 10.5.0.2 (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950 ms F=1500 0.598 ms 0.566 ms The PMTU from the l3 gateway to the instance looks OK to me. I spent a bit more time debugging this; performance from within the router netns on the L3 gateway node looks good in both directions when accessing via the tenant network (10.5.0.2) over the qr-X interface, but when accessing through the external network from within the netns I see the same performance choke upstream into the tenant network. Which would indicate that my problem lies somewhere around the qg-X interface in the router netns - just trying to figure out exactly what - maybe iptables is doing something wonky? - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSTJTzAAoJEL/srsug59jDXIoQAIqd5Msoyubvs0Y270PeYHwJ vsmjw0Fzyf+428KTo2RcfWKGarkBmn/3kbygzPJH2aVHZx/+s2dHY1YJu1gH7B4i 0yCIQZWhur+CdXN7QplqhJLgq+ZVyC4/GV4RA/C2NpHzGZg/avx5BPMhzfnSnRtB Xy49umZkG90622WhW2hlXW5J06YIEsO1EuwonXxIXzXu2CYsvLKk2GguU7tejC7Q DfW36gkCVv2z/71vVXgpjNt76MNsA8IVmaB4vv08Ai4yyUMNpvUc/SWu5DwzuoZx vGxkCFv419rzO64L6EbYcmnUBXa+wFnSTp8hCNfl8fsDMJb6kynwLAWqCiIKKS8/ ozZfZ7eQ4CmyctckXjxBchmybh0aMRrzYANvE/9vkub3aAF7fpeCus+Nw59TLe62 tlfAZKPhmLikGbbIia6SX6j9PS9x2mSagfinjQs0BHDV0Pyww5qotWbWLbCFD7Cz yhLjAGAhOnB5CQlEqX9XdM2/YGvhTIzLMMkPeQVicNlUXx/TXqJ2cvcIjdoBASFC i6lfhhwXU9n9zi0THOxHQozksaMKc/diWULkcewqdbqYgLbZ5x8+SADf2Zd7WFzZ MKe54y7fmhKWnL+zTN9tLwG8qnLWpIWJ5M4V99a8HL6zgTyeRJ/9bgMsl/2ghTra EGO8vL6+zj8cAYTFB3oF =Fp5N -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Hi James, Let me ask you something... Are you using the package `openvswitch-datapath-dkms' from Havana Ubuntu Cloud Archive with Linux 3.8? I am unable to compile that module on top of Ubuntu 12.04.3 (with Linux 3.8) and I'm wondering if it is still required or not... Thanks! Thiago On 2 October 2013 06:14, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Folks I'm seeing an odd direction performance issue with my Havana test rig which I'm struggling to debug; details: Ubuntu 12.04 with Linux 3.8 backports kernel, Havana Cloud Archive (currently Havana b3, OpenvSwitch 1.10.2), OpenvSwitch plugin with GRE overlay networks. I've configured the MTU's on all of the physical host network interfaces to 1546 to add capacity for the GRE network headers. Performance between instances within a single tenant network on different physical hosts is as I would expect (near 1GBps), but I see issues when data transits the Neutron L3 gateway - in the example below churel is a physical host on the same network as the layer 3 gateway: ubuntu@churel:~$ scp hardware.dump 10.98.191.103: hardware.dump 100% 67MB 4.8MB/s 00:14 ubuntu@churel:~$ scp 10.98.191.103:hardware.dump . hardware.dump 100% 67MB 66.8MB/s 00:01 As you can see, pushing data to the instance (via a floating ip 10.98.191.103) is painfully slow, whereas pulling the same data is x10+ faster (and closer to what I would expect). iperf confirms the same: ubuntu@churel:~$ iperf -c 10.98.191.103 -m - Client connecting to 10.98.191.103, TCP port 5001 TCP window size: 22.9 KByte (default) - [ 3] local 10.98.191.11 port 55330 connected with 10.98.191.103 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 60.8 MBytes 50.8 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) ubuntu@james-page-bastion:~$ iperf -c 10.98.191.11 -m - Client connecting to 10.98.191.11, TCP port 5001 TCP window size: 23.3 KByte (default) - [ 3] local 10.5.0.2 port 52190 connected with 10.98.191.11 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.07 GBytes 918 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) 918Mbit vs 50Mbits. I tcpdump'ed the traffic and I see alot of duplicate acks which makes me suspect some sort of packet fragmentation but its got me puzzled. Anyone have any ideas about how to debug this further? or has anyone seen anything like this before? Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSS+QSAAoJEL/srsug59jD8ZcQAKbZDVU8KKa7hsic7+ulqWQQ EFbq8Im5x4mQY7htIvIOM26BR0ktAO5luE7zMBXsA4AwPud1BQSGhw89/NvNhADT TLcGdQADsomeiBpJebzwUmvL/tYUoMDRA3O96mUn2pi0fySWbEuEgMDjDJ/ow23D Y7nEv0mItaZ4MBSI9RZcqsDUl7UbbdlGejSWhJcwp/127HMU9nYwWNz5UHJjsGZ1 eITyv1WZH/dYPQ1SES41qD1WvkTBugopGJvptEyrcO62A+akGOvnqpsHgPECbLb+ b/8rk8nB1HB74Wh+tQP4WRQCZYso15nB6ukIyIU24Qti2tXtXDdKwszEoblCwCT3 YZJTERNOENURlUEFwgi6FNL+nZomSG0UJU6qqDGiUJkbSF7SwJm4y8/XRlJM2Ihn wyxFB0qe3YdMqgDLZn11GwCDqn3g11hYaocHNUyRaj/tgxhGKbOFvix5kz3I4V7T gd+sqUySMVd9wCRXBzDDhCuG9xf/QY2ZQxXzyfPJWd9svPh/O6osTSQzaI1eZl9/ jVRejMAFr6Rl11GPKd3DYi32GXa896QELjBmJ9Kof0NDlCcDuUKpVeifIhcbQZZV sWyQmbb6Z/ypFV9xXiLRfH2fW2bAQQHgiQGvy9apoE78BWYdnsD8Q3Ekwag6lFqp yUwt/RcRXS1PbLG4EGFW =HTvW -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
I believe it's still needed: upstream kernel have pushed back against the modules it provides, but neutron needs them to deliver the gre tunnels. -Rob On 3 October 2013 13:15, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi James, Let me ask you something... Are you using the package `openvswitch-datapath-dkms' from Havana Ubuntu Cloud Archive with Linux 3.8? I am unable to compile that module on top of Ubuntu 12.04.3 (with Linux 3.8) and I'm wondering if it is still required or not... Thanks! Thiago On 2 October 2013 06:14, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Folks I'm seeing an odd direction performance issue with my Havana test rig which I'm struggling to debug; details: Ubuntu 12.04 with Linux 3.8 backports kernel, Havana Cloud Archive (currently Havana b3, OpenvSwitch 1.10.2), OpenvSwitch plugin with GRE overlay networks. I've configured the MTU's on all of the physical host network interfaces to 1546 to add capacity for the GRE network headers. Performance between instances within a single tenant network on different physical hosts is as I would expect (near 1GBps), but I see issues when data transits the Neutron L3 gateway - in the example below churel is a physical host on the same network as the layer 3 gateway: ubuntu@churel:~$ scp hardware.dump 10.98.191.103: hardware.dump 100% 67MB 4.8MB/s 00:14 ubuntu@churel:~$ scp 10.98.191.103:hardware.dump . hardware.dump 100% 67MB 66.8MB/s 00:01 As you can see, pushing data to the instance (via a floating ip 10.98.191.103) is painfully slow, whereas pulling the same data is x10+ faster (and closer to what I would expect). iperf confirms the same: ubuntu@churel:~$ iperf -c 10.98.191.103 -m - Client connecting to 10.98.191.103, TCP port 5001 TCP window size: 22.9 KByte (default) - [ 3] local 10.98.191.11 port 55330 connected with 10.98.191.103 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 60.8 MBytes 50.8 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) ubuntu@james-page-bastion:~$ iperf -c 10.98.191.11 -m - Client connecting to 10.98.191.11, TCP port 5001 TCP window size: 23.3 KByte (default) - [ 3] local 10.5.0.2 port 52190 connected with 10.98.191.11 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.07 GBytes 918 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) 918Mbit vs 50Mbits. I tcpdump'ed the traffic and I see alot of duplicate acks which makes me suspect some sort of packet fragmentation but its got me puzzled. Anyone have any ideas about how to debug this further? or has anyone seen anything like this before? Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJSS+QSAAoJEL/srsug59jD8ZcQAKbZDVU8KKa7hsic7+ulqWQQ EFbq8Im5x4mQY7htIvIOM26BR0ktAO5luE7zMBXsA4AwPud1BQSGhw89/NvNhADT TLcGdQADsomeiBpJebzwUmvL/tYUoMDRA3O96mUn2pi0fySWbEuEgMDjDJ/ow23D Y7nEv0mItaZ4MBSI9RZcqsDUl7UbbdlGejSWhJcwp/127HMU9nYwWNz5UHJjsGZ1 eITyv1WZH/dYPQ1SES41qD1WvkTBugopGJvptEyrcO62A+akGOvnqpsHgPECbLb+ b/8rk8nB1HB74Wh+tQP4WRQCZYso15nB6ukIyIU24Qti2tXtXDdKwszEoblCwCT3 YZJTERNOENURlUEFwgi6FNL+nZomSG0UJU6qqDGiUJkbSF7SwJm4y8/XRlJM2Ihn wyxFB0qe3YdMqgDLZn11GwCDqn3g11hYaocHNUyRaj/tgxhGKbOFvix5kz3I4V7T gd+sqUySMVd9wCRXBzDDhCuG9xf/QY2ZQxXzyfPJWd9svPh/O6osTSQzaI1eZl9/ jVRejMAFr6Rl11GPKd3DYi32GXa896QELjBmJ9Kof0NDlCcDuUKpVeifIhcbQZZV sWyQmbb6Z/ypFV9xXiLRfH2fW2bAQQHgiQGvy9apoE78BWYdnsD8Q3Ekwag6lFqp yUwt/RcRXS1PbLG4EGFW =HTvW -END PGP SIGNATURE- ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch
Mmm... I am unable to compile openvswitch-datapath-dkms from Havana Ubuntu Cloud Archive (on top of a fresh install of Ubuntu 12.04.3), look: -- root@havabuntu-1:~# uname -a Linux havabuntu-1 3.8.0-31-generic #46~precise1-Ubuntu SMP Wed Sep 11 18:21:16 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux root@havabuntu-1:~# dpkg -l | grep openvswitch-datapath-dkms ii openvswitch-datapath-dkms1.10.2-0ubuntu1~cloud0Open vSwitch datapath module source - DKMS version root@havabuntu-1:~# dpkg-reconfigure openvswitch-datapath-dkms -- Deleting module version: 1.10.2 completely from the DKMS tree. -- Done. Creating symlink /var/lib/dkms/openvswitch/1.10.2/source - /usr/src/openvswitch-1.10.2 DKMS: add completed. Kernel preparation unnecessary for this kernel. Skipping... Building module: cleaning build area(bad exit status: 2) ./configure --with-linux='/lib/modules/3.8.0-31-generic/build' make -C datapath/linux(bad exit status: 2) Error! Bad return status for module build on kernel: 3.8.0-31-generic (x86_64) Consult /var/lib/dkms/openvswitch/1.10.2/build/make.log for more information. -- Contents of /var/lib/dkms/openvswitch/1.10.2/build/make.log: http://paste.openstack.org/show/47888/ I also have the packages: build-essential, linux-headers, etc, installed... So, James, have you this module compiled on your test environment? I mean, does this command: dpkg-reconfigure openvswitch-datapath-dkms works for you?! NOTE: It also doesn't compiles when with Linux 3.2 (Ubuntu 12.04.1). Thanks, Thiago On 2 October 2013 22:28, Robert Collins robe...@robertcollins.net wrote: I believe it's still needed: upstream kernel have pushed back against the modules it provides, but neutron needs them to deliver the gre tunnels. -Rob On 3 October 2013 13:15, Martinx - ジェームズ thiagocmarti...@gmail.com wrote: Hi James, Let me ask you something... Are you using the package `openvswitch-datapath-dkms' from Havana Ubuntu Cloud Archive with Linux 3.8? I am unable to compile that module on top of Ubuntu 12.04.3 (with Linux 3.8) and I'm wondering if it is still required or not... Thanks! Thiago On 2 October 2013 06:14, James Page james.p...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Folks I'm seeing an odd direction performance issue with my Havana test rig which I'm struggling to debug; details: Ubuntu 12.04 with Linux 3.8 backports kernel, Havana Cloud Archive (currently Havana b3, OpenvSwitch 1.10.2), OpenvSwitch plugin with GRE overlay networks. I've configured the MTU's on all of the physical host network interfaces to 1546 to add capacity for the GRE network headers. Performance between instances within a single tenant network on different physical hosts is as I would expect (near 1GBps), but I see issues when data transits the Neutron L3 gateway - in the example below churel is a physical host on the same network as the layer 3 gateway: ubuntu@churel:~$ scp hardware.dump 10.98.191.103: hardware.dump 100% 67MB 4.8MB/s 00:14 ubuntu@churel:~$ scp 10.98.191.103:hardware.dump . hardware.dump 100% 67MB 66.8MB/s 00:01 As you can see, pushing data to the instance (via a floating ip 10.98.191.103) is painfully slow, whereas pulling the same data is x10+ faster (and closer to what I would expect). iperf confirms the same: ubuntu@churel:~$ iperf -c 10.98.191.103 -m - Client connecting to 10.98.191.103, TCP port 5001 TCP window size: 22.9 KByte (default) - [ 3] local 10.98.191.11 port 55330 connected with 10.98.191.103 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 60.8 MBytes 50.8 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) ubuntu@james-page-bastion:~$ iperf -c 10.98.191.11 -m - Client connecting to 10.98.191.11, TCP port 5001 TCP window size: 23.3 KByte (default) - [ 3] local 10.5.0.2 port 52190 connected with 10.98.191.11 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.07 GBytes 918 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) 918Mbit vs 50Mbits. I tcpdump'ed the traffic and I see alot of duplicate acks which makes me suspect some sort of packet fragmentation but its got me puzzled. Anyone have any ideas about how to debug this further? or has anyone seen anything like this before? Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com