Re: [openstack-dev] [Neutron] l2pop problems

2014-08-06 Thread Mathieu Rohon
Hi Zang,

On Tue, Aug 5, 2014 at 1:18 PM, Zang MingJie zealot0...@gmail.com wrote:
 Hi Mathieu:

 We have deployed the new l2pop described in the previous mail in our
 environment, and works pretty well. It solved the timing problem, and
 also reduces lots of l2pop rpc calls. I'm going to file a blueprint to
 propose the changes.

great, I would be pleased to review this BP.

 On Fri, Jul 18, 2014 at 10:26 PM, Mathieu Rohon mathieu.ro...@gmail.com 
 wrote:
 Hi Zang,

 On Wed, Jul 16, 2014 at 4:43 PM, Zang MingJie zealot0...@gmail.com wrote:
 Hi, all:

 While resolving ovs restart rebuild br-tun flows[1], we have found
 several l2pop problems:

 1. L2pop is depending on agent_boot_time to decide whether send all
 port information or not, but the agent_boot_time is unreliable, for
 example if the service receives port up message before agent status
 report, the agent won't receive any port on other agents forever.

 you're right, there a race condition here, if the agent has more than
 1 port on the same network and if the agent sends its
 update_device_up() on every port before it sends its report_state(),
 it won't receive fdb concerning these network. Is it the race you are
 mentionning above?
 Since the report_state is done in a dedicated greenthread, and is
 launched before the greenthread that manages ovsdb_monitor, the state
 of the agent should be updated before the agent gets aware of its
 ports and sends get_device_details()/update_device_up(), am I wrong?
 So, after a restart of an agent, the agent_uptime() should be less
 than the agent_boot_time configured by default in the conf when the
 agent sent its first update_device_up(), the l2pop MD will be aware of
 this restart and trigger the cast of all fdb entries to the restarted
 agent.

 But I agree that it might relies on enventlet thread managment and on
 agent_boot_time that can be misconfigured by the provider.

 2. If the openvswitch restarted, all flows will be lost, including all
 l2pop flows, the agent is unable to fetch or recreate the l2pop flows.

 To resolve the problems, I'm suggesting some changes:

 1. Because the agent_boot_time is unreliable, the service can't decide
 whether to send flooding entry or not. But the agent can build up the
 flooding entries from unicast entries, it has already been
 implemented[2]

 2. Create a rpc from agent to service which fetch all fdb entries, the
 agent calls the rpc in `provision_local_vlan`, before setting up any
 port.[3]

 After these changes, the l2pop service part becomes simpler and more
 robust, mainly 2 function: first, returns all fdb entries at once when
 requested; second, broadcast fdb single entry when a port is up/down.

 That's an implementation that we have been thinking about during the
 l2pop implementation.
 Our purpose was to minimize RPC calls. But if this implementation is
 buggy due to uncontrolled thread order and/or bad usage of the
 agent_boot_time parameter, it's worth investigating your proposal [3].
 However, I don't get why [3] depends on [2]. couldn't we have a
 network_sync() sent by the agent during provision_local_vlan() which
 will reconfigure ovs when the agent and/or the ovs restart?

 actual, [3] doesn't strictly depend [2], we have encountered l2pop
 problems several times where the unicast is correct, but the broadcast
 fails, so we decide completely ignore the broadcast entries in rpc,
 only deal unicast entries, and use unicast entries to build broadcast
 rules.

Understood, but i could be interesting to understand why the MD sends
wrong broadcast entries. Do you have any clue?




 [1] https://bugs.launchpad.net/neutron/+bug/1332450
 [2] https://review.openstack.org/#/c/101581/
 [3] https://review.openstack.org/#/c/107409/

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] l2pop problems

2014-08-05 Thread Zang MingJie
Hi Mathieu:

We have deployed the new l2pop described in the previous mail in our
environment, and works pretty well. It solved the timing problem, and
also reduces lots of l2pop rpc calls. I'm going to file a blueprint to
propose the changes.

On Fri, Jul 18, 2014 at 10:26 PM, Mathieu Rohon mathieu.ro...@gmail.com wrote:
 Hi Zang,

 On Wed, Jul 16, 2014 at 4:43 PM, Zang MingJie zealot0...@gmail.com wrote:
 Hi, all:

 While resolving ovs restart rebuild br-tun flows[1], we have found
 several l2pop problems:

 1. L2pop is depending on agent_boot_time to decide whether send all
 port information or not, but the agent_boot_time is unreliable, for
 example if the service receives port up message before agent status
 report, the agent won't receive any port on other agents forever.

 you're right, there a race condition here, if the agent has more than
 1 port on the same network and if the agent sends its
 update_device_up() on every port before it sends its report_state(),
 it won't receive fdb concerning these network. Is it the race you are
 mentionning above?
 Since the report_state is done in a dedicated greenthread, and is
 launched before the greenthread that manages ovsdb_monitor, the state
 of the agent should be updated before the agent gets aware of its
 ports and sends get_device_details()/update_device_up(), am I wrong?
 So, after a restart of an agent, the agent_uptime() should be less
 than the agent_boot_time configured by default in the conf when the
 agent sent its first update_device_up(), the l2pop MD will be aware of
 this restart and trigger the cast of all fdb entries to the restarted
 agent.

 But I agree that it might relies on enventlet thread managment and on
 agent_boot_time that can be misconfigured by the provider.

 2. If the openvswitch restarted, all flows will be lost, including all
 l2pop flows, the agent is unable to fetch or recreate the l2pop flows.

 To resolve the problems, I'm suggesting some changes:

 1. Because the agent_boot_time is unreliable, the service can't decide
 whether to send flooding entry or not. But the agent can build up the
 flooding entries from unicast entries, it has already been
 implemented[2]

 2. Create a rpc from agent to service which fetch all fdb entries, the
 agent calls the rpc in `provision_local_vlan`, before setting up any
 port.[3]

 After these changes, the l2pop service part becomes simpler and more
 robust, mainly 2 function: first, returns all fdb entries at once when
 requested; second, broadcast fdb single entry when a port is up/down.

 That's an implementation that we have been thinking about during the
 l2pop implementation.
 Our purpose was to minimize RPC calls. But if this implementation is
 buggy due to uncontrolled thread order and/or bad usage of the
 agent_boot_time parameter, it's worth investigating your proposal [3].
 However, I don't get why [3] depends on [2]. couldn't we have a
 network_sync() sent by the agent during provision_local_vlan() which
 will reconfigure ovs when the agent and/or the ovs restart?

actual, [3] doesn't strictly depend [2], we have encountered l2pop
problems several times where the unicast is correct, but the broadcast
fails, so we decide completely ignore the broadcast entries in rpc,
only deal unicast entries, and use unicast entries to build broadcast
rules.



 [1] https://bugs.launchpad.net/neutron/+bug/1332450
 [2] https://review.openstack.org/#/c/101581/
 [3] https://review.openstack.org/#/c/107409/

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] l2pop problems

2014-07-18 Thread Mathieu Rohon
Hi Zang,

On Wed, Jul 16, 2014 at 4:43 PM, Zang MingJie zealot0...@gmail.com wrote:
 Hi, all:

 While resolving ovs restart rebuild br-tun flows[1], we have found
 several l2pop problems:

 1. L2pop is depending on agent_boot_time to decide whether send all
 port information or not, but the agent_boot_time is unreliable, for
 example if the service receives port up message before agent status
 report, the agent won't receive any port on other agents forever.

you're right, there a race condition here, if the agent has more than
1 port on the same network and if the agent sends its
update_device_up() on every port before it sends its report_state(),
it won't receive fdb concerning these network. Is it the race you are
mentionning above?
Since the report_state is done in a dedicated greenthread, and is
launched before the greenthread that manages ovsdb_monitor, the state
of the agent should be updated before the agent gets aware of its
ports and sends get_device_details()/update_device_up(), am I wrong?
So, after a restart of an agent, the agent_uptime() should be less
than the agent_boot_time configured by default in the conf when the
agent sent its first update_device_up(), the l2pop MD will be aware of
this restart and trigger the cast of all fdb entries to the restarted
agent.

But I agree that it might relies on enventlet thread managment and on
agent_boot_time that can be misconfigured by the provider.

 2. If the openvswitch restarted, all flows will be lost, including all
 l2pop flows, the agent is unable to fetch or recreate the l2pop flows.

 To resolve the problems, I'm suggesting some changes:

 1. Because the agent_boot_time is unreliable, the service can't decide
 whether to send flooding entry or not. But the agent can build up the
 flooding entries from unicast entries, it has already been
 implemented[2]

 2. Create a rpc from agent to service which fetch all fdb entries, the
 agent calls the rpc in `provision_local_vlan`, before setting up any
 port.[3]

 After these changes, the l2pop service part becomes simpler and more
 robust, mainly 2 function: first, returns all fdb entries at once when
 requested; second, broadcast fdb single entry when a port is up/down.

That's an implementation that we have been thinking about during the
l2pop implementation.
Our purpose was to minimize RPC calls. But if this implementation is
buggy due to uncontrolled thread order and/or bad usage of the
agent_boot_time parameter, it's worth investigating your proposal [3].
However, I don't get why [3] depends on [2]. couldn't we have a
network_sync() sent by the agent during provision_local_vlan() which
will reconfigure ovs when the agent and/or the ovs restart?


 [1] https://bugs.launchpad.net/neutron/+bug/1332450
 [2] https://review.openstack.org/#/c/101581/
 [3] https://review.openstack.org/#/c/107409/

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Neutron] l2pop problems

2014-07-16 Thread Zang MingJie
Hi, all:

While resolving ovs restart rebuild br-tun flows[1], we have found
several l2pop problems:

1. L2pop is depending on agent_boot_time to decide whether send all
port information or not, but the agent_boot_time is unreliable, for
example if the service receives port up message before agent status
report, the agent won't receive any port on other agents forever.

2. If the openvswitch restarted, all flows will be lost, including all
l2pop flows, the agent is unable to fetch or recreate the l2pop flows.

To resolve the problems, I'm suggesting some changes:

1. Because the agent_boot_time is unreliable, the service can't decide
whether to send flooding entry or not. But the agent can build up the
flooding entries from unicast entries, it has already been
implemented[2]

2. Create a rpc from agent to service which fetch all fdb entries, the
agent calls the rpc in `provision_local_vlan`, before setting up any
port.[3]

After these changes, the l2pop service part becomes simpler and more
robust, mainly 2 function: first, returns all fdb entries at once when
requested; second, broadcast fdb single entry when a port is up/down.

[1] https://bugs.launchpad.net/neutron/+bug/1332450
[2] https://review.openstack.org/#/c/101581/
[3] https://review.openstack.org/#/c/107409/

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev