Re: [Openstack] A Grizzly GRE failure [SOLVED]

2013-05-12 Thread Greg Chavez
I've had a terrible time getting the community to help me with this
problem.  So special thanks to Darragh O'Reilly and rkeene on
#openstack who was mean and a bit of a wisenheimer (I'd use different
words elsewhere), but at least he talked to me and got me to think
twice about my GRE setup.

But enough of that, problem solved and a bug report has been
submitted: https://bugs.launchpad.net/quantum/+bug/1179223..  I added
an s to the front of persists in the subject, but whatever.  I
always leave one thing in the hotel room, and I always leave one
embarrassing typo.

Here's the part explaining how it was fixed:

SOLUTION:

mysql delete from ovs_tunnel_endpoints where id = 1;
Query OK, 1 row affected (0.00 sec)

mysql select * from ovs_tunnel_endpoints;
+-++
| ip_address | id |
+-++
| 192.168.239.110 | 3 |
| 192.168.239.114 | 4 |
| 192.168.239.115 | 5 |
| 192.168.239.99 | 2 |
+-++
4 rows in set (0.00 sec)

* After doing that, I simply restarted the quantum ovs agents on the
network and compute nodes. The old GRE tunnel is not re-created.
Thereafter, VM network traffic to and from the external network
proceeds without incident.

* Should these tables be cleaned up as well, I wonder:

mysql select * from ovs_network_bindings;
+--+--+--+-+
| network_id | network_type | physical_network | segmentation_id |
+--+--+--+-+
| 4e8aacca-8b38-40ac-a628-18cac3168fe6 | gre | NULL | 2 |
| af224f3f-8de6-4e0d-b043-6bcd5cb014c5 | gre | NULL | 1 |
+--+--+--+-+
2 rows in set (0.00 sec)

mysql select * from ovs_tunnel_allocations where allocated != 0;
+---+---+
| tunnel_id | allocated |
+---+---+
| 1 | 1 |
| 2 | 1 |
+---+---+
2 rows in set (0.00 sec)

Cheers, and happy openstacking.  Even you, rkeene!

--Greg Chavez

On Sat, May 11, 2013 at 2:28 PM, Greg Chavez greg.cha...@gmail.com wrote:
 So to be clear:

 * I have a three nics on my network node.  The VM traffic goes out the
 1st nic on 192.168.239.99/24 to the other compute nodes, while
 management traffic goes out the 2nd nic on 192.168.241.99. The 3rd nic
 is external and has no IP.

 * I have four GRE endpoints on the VM network, one at the network node
 (192.168.239.99) and three on compute nodes
 (192.168.239.{110,114,115}), all with IDs 2-5.

 * I have a fifth GRE endpoint with id 1 to 192.168.241.99, the network
 node's management interface.  This was the first tunnel created when I
 deployed the network node because that is how I set the remote_ip in
 the ovs plugin ini.  I corrected the setting later, but the
 192.168.241.99 endpoint persists and,  as your response implies, *this
 extraneous endpoint is the cause of my troubles*.

 My next question then is what is happening? My guess:

 * I ping a guest from the external network using its floater (10.21.166.4).

 * It gets NAT'd at the tenant router on the network node to
 192.168.252.3, at which point an arp request is sent over the unified
 GRE broadcast domain.

 * On a compute node, the arp request is received by the VM, which then
 sends a reply to the tenant router's MAC (which I verified with
 tcpdumps).

 * There are four endpoints for the packet to go down:

 Bridge br-tun
 Port br-tun
 Interface br-tun
 type: internal
 Port gre-1
 Interface gre-1
 type: gre
 options: {in_key=flow, out_key=flow, 
 remote_ip=192.168.241.99}
 Port gre-4
 Interface gre-4
 type: gre
 options: {in_key=flow, out_key=flow,
 remote_ip=192.168.239.114}
 Port gre-3
 Interface gre-3
 type: gre
 options: {in_key=flow, out_key=flow,
 remote_ip=192.168.239.110}
 Port patch-int
 Interface patch-int
 type: patch
 options: {peer=patch-tun}
 Port gre-2
 Interface gre-2
 type: gre
 options: {in_key=flow, out_key=flow, 
 remote_ip=192.168.239.99}

 Here's where I get confused.  Does it know that gre-1 is a different
 broadcast domain than the others, or does is see all endpoints as the
 same domain?

 What happens here?  Is this the cause of my network timeouts on
 external connections to the VMs? Does this also explain the sporadic
 nature of the timeouts, why they aren't consistent in frequency or
 duration?

 Finally, what happens when I remove the oddball endpoint from the DB?
 Sounds risky!

 Thanks for your help
 --Greg Chavez

 On Fri, May 10, 2013 at 7:17 PM, Darragh O'Reilly
 dara2002-openst...@yahoo.com wrote:
 I'm not sure how to rectify that. You may have to delete the bad row from 
 the DB and 

Re: [Openstack] A Grizzly GRE failure

2013-05-11 Thread Greg Chavez
So to be clear:

* I have a three nics on my network node.  The VM traffic goes out the
1st nic on 192.168.239.99/24 to the other compute nodes, while
management traffic goes out the 2nd nic on 192.168.241.99. The 3rd nic
is external and has no IP.

* I have four GRE endpoints on the VM network, one at the network node
(192.168.239.99) and three on compute nodes
(192.168.239.{110,114,115}), all with IDs 2-5.

* I have a fifth GRE endpoint with id 1 to 192.168.241.99, the network
node's management interface.  This was the first tunnel created when I
deployed the network node because that is how I set the remote_ip in
the ovs plugin ini.  I corrected the setting later, but the
192.168.241.99 endpoint persists and,  as your response implies, *this
extraneous endpoint is the cause of my troubles*.

My next question then is what is happening? My guess:

* I ping a guest from the external network using its floater (10.21.166.4).

* It gets NAT'd at the tenant router on the network node to
192.168.252.3, at which point an arp request is sent over the unified
GRE broadcast domain.

* On a compute node, the arp request is received by the VM, which then
sends a reply to the tenant router's MAC (which I verified with
tcpdumps).

* There are four endpoints for the packet to go down:

Bridge br-tun
Port br-tun
Interface br-tun
type: internal
Port gre-1
Interface gre-1
type: gre
options: {in_key=flow, out_key=flow, remote_ip=192.168.241.99}
Port gre-4
Interface gre-4
type: gre
options: {in_key=flow, out_key=flow,
remote_ip=192.168.239.114}
Port gre-3
Interface gre-3
type: gre
options: {in_key=flow, out_key=flow,
remote_ip=192.168.239.110}
Port patch-int
Interface patch-int
type: patch
options: {peer=patch-tun}
Port gre-2
Interface gre-2
type: gre
options: {in_key=flow, out_key=flow, remote_ip=192.168.239.99}

Here's where I get confused.  Does it know that gre-1 is a different
broadcast domain than the others, or does is see all endpoints as the
same domain?

What happens here?  Is this the cause of my network timeouts on
external connections to the VMs? Does this also explain the sporadic
nature of the timeouts, why they aren't consistent in frequency or
duration?

Finally, what happens when I remove the oddball endpoint from the DB?
Sounds risky!

Thanks for your help
--Greg Chavez

On Fri, May 10, 2013 at 7:17 PM, Darragh O'Reilly
dara2002-openst...@yahoo.com wrote:
 I'm not sure how to rectify that. You may have to delete the bad row from the 
 DB and restart the agents:

 mysql use quantum;
 mysql select * from ovs_tunnel_endpoints;
 ...

On Fri, May 10, 2013 at 6:43 PM, Greg Chavez greg.cha...@gmail.com wrote:
  I'm refactoring my question once again (see  A Grizzly arping
  failure and Failure to arp by quantum router).

  Quickly, the problem is in a multi-node Grizzly+Raring setup with a
  separate network node and a dedicated VLAN for VM traffic.  External
  connections time out within a minute and dont' resume until traffic is
  initiated from the VM.

  I got some rather annoying and hostile assistance just now on IRC and
  while it didn't result in a fix, it got me to realize that the problem
  is possibly with my GRE setup.

  I made a mistake when I originally set this up, assigning the mgmt
  interface of the network node (192.168.241.99) as its GRE remote_ip
  instead if the vm_config network interface (192.168.239.99).  I
  realized my mistake and reconfigured the OVS plugin on the network
  node and moved one.  But now, taking a look at my OVS bridges on the
  network node, I see that the old remote IP is still there!

  Bridge br-tun
  snip
  Port gre-1
  Interface gre-1
  type: gre
  options: {in_key=flow, out_key=flow, 
 remote_ip=192.168.241.99}
  snip

  This is also on all the compute nodes.

  ( Full ovs-vsctl show output here: http://pastebin.com/xbre1fNV)

  What's more, I have this error every time I restart OVS:

  2013-05-10 18:21:24ERROR [quantum.agent.linux.ovs_lib] Unable to
  execute ['ovs-vsctl', '--timeout=2', 'add-port', 'br-tun', 'gre-5'].
  Exception:
  Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf',
  'ovs-vsctl', '--timeout=2', 'add-port', 'br-tun', 'gre-5']
  Exit code: 1
  Stdout: ''
  Stderr: 'ovs-vsctl: cannot create a port named gre-5 because a port
  named gre-5 already exists on bridge br-tun\n'

  Could that be because grep-1 is vestigial and possibly fouling up the
  works by creating two possible paths for VM traffic?

  Is it as simple as removing it with ovs-vsctl or is something else required?

  Or is this actually needed for some reason?  Argh... help!


[Openstack] A Grizzly GRE failure

2013-05-10 Thread Greg Chavez
I'm refactoring my question once again (see  A Grizzly arping
failure and Failure to arp by quantum router).

Quickly, the problem is in a multi-node Grizzly+Raring setup with a
separate network node and a dedicated VLAN for VM traffic.  External
connections time out within a minute and dont' resume until traffic is
initiated from the VM.

I got some rather annoying and hostile assistance just now on IRC and
while it didn't result in a fix, it got me to realize that the problem
is possibly with my GRE setup.

I made a mistake when I originally set this up, assigning the mgmt
interface of the network node (192.168.241.99) as its GRE remote_ip
instead if the vm_config network interface (192.168.239.99).  I
realized my mistake and reconfigured the OVS plugin on the network
node and moved one.  But now, taking a look at my OVS bridges on the
network node, I see that the old remote IP is still there!

Bridge br-tun
snip
Port gre-1
Interface gre-1
type: gre
options: {in_key=flow, out_key=flow, remote_ip=192.168.241.99}
snip

This is also on all the compute nodes.

( Full ovs-vsctl show output here: http://pastebin.com/xbre1fNV)

What's more, I have this error every time I restart OVS:

2013-05-10 18:21:24ERROR [quantum.agent.linux.ovs_lib] Unable to
execute ['ovs-vsctl', '--timeout=2', 'add-port', 'br-tun', 'gre-5'].
Exception:
Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf',
'ovs-vsctl', '--timeout=2', 'add-port', 'br-tun', 'gre-5']
Exit code: 1
Stdout: ''
Stderr: 'ovs-vsctl: cannot create a port named gre-5 because a port
named gre-5 already exists on bridge br-tun\n'

Could that be because grep-1 is vestigial and possibly fouling up the
works by creating two possible paths for VM traffic?

Is it as simple as removing it with ovs-vsctl or is something else required?

Or is this actually needed for some reason?  Argh... help!

--
\*..+.-
--Greg Chavez
+//..;};

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] A Grizzly GRE failure

2013-05-10 Thread Darragh O'Reilly
I'm not sure how to rectify that. You may have to delete the bad row from the 
DB and restart the agents:

mysql use quantum;
mysql select * from ovs_tunnel_endpoints;
...


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp