Re: [openstack-dev] [Neutron] l2pop problems

2014-08-06 Thread Mathieu Rohon
Hi Zang,

On Tue, Aug 5, 2014 at 1:18 PM, Zang MingJie  wrote:
> Hi Mathieu:
>
> We have deployed the new l2pop described in the previous mail in our
> environment, and works pretty well. It solved the timing problem, and
> also reduces lots of l2pop rpc calls. I'm going to file a blueprint to
> propose the changes.

great, I would be pleased to review this BP.

> On Fri, Jul 18, 2014 at 10:26 PM, Mathieu Rohon  
> wrote:
>> Hi Zang,
>>
>> On Wed, Jul 16, 2014 at 4:43 PM, Zang MingJie  wrote:
>>> Hi, all:
>>>
>>> While resolving ovs restart rebuild br-tun flows[1], we have found
>>> several l2pop problems:
>>>
>>> 1. L2pop is depending on agent_boot_time to decide whether send all
>>> port information or not, but the agent_boot_time is unreliable, for
>>> example if the service receives port up message before agent status
>>> report, the agent won't receive any port on other agents forever.
>>
>> you're right, there a race condition here, if the agent has more than
>> 1 port on the same network and if the agent sends its
>> update_device_up() on every port before it sends its report_state(),
>> it won't receive fdb concerning these network. Is it the race you are
>> mentionning above?
>> Since the report_state is done in a dedicated greenthread, and is
>> launched before the greenthread that manages ovsdb_monitor, the state
>> of the agent should be updated before the agent gets aware of its
>> ports and sends get_device_details()/update_device_up(), am I wrong?
>> So, after a restart of an agent, the agent_uptime() should be less
>> than the agent_boot_time configured by default in the conf when the
>> agent sent its first update_device_up(), the l2pop MD will be aware of
>> this restart and trigger the cast of all fdb entries to the restarted
>> agent.
>>
>> But I agree that it might relies on enventlet thread managment and on
>> agent_boot_time that can be misconfigured by the provider.
>>
>>> 2. If the openvswitch restarted, all flows will be lost, including all
>>> l2pop flows, the agent is unable to fetch or recreate the l2pop flows.
>>>
>>> To resolve the problems, I'm suggesting some changes:
>>>
>>> 1. Because the agent_boot_time is unreliable, the service can't decide
>>> whether to send flooding entry or not. But the agent can build up the
>>> flooding entries from unicast entries, it has already been
>>> implemented[2]
>>>
>>> 2. Create a rpc from agent to service which fetch all fdb entries, the
>>> agent calls the rpc in `provision_local_vlan`, before setting up any
>>> port.[3]
>>>
>>> After these changes, the l2pop service part becomes simpler and more
>>> robust, mainly 2 function: first, returns all fdb entries at once when
>>> requested; second, broadcast fdb single entry when a port is up/down.
>>
>> That's an implementation that we have been thinking about during the
>> l2pop implementation.
>> Our purpose was to minimize RPC calls. But if this implementation is
>> buggy due to uncontrolled thread order and/or bad usage of the
>> agent_boot_time parameter, it's worth investigating your proposal [3].
>> However, I don't get why [3] depends on [2]. couldn't we have a
>> network_sync() sent by the agent during provision_local_vlan() which
>> will reconfigure ovs when the agent and/or the ovs restart?
>
> actual, [3] doesn't strictly depend [2], we have encountered l2pop
> problems several times where the unicast is correct, but the broadcast
> fails, so we decide completely ignore the broadcast entries in rpc,
> only deal unicast entries, and use unicast entries to build broadcast
> rules.

Understood, but i could be interesting to understand why the MD sends
wrong broadcast entries. Do you have any clue?

>
>>
>>
>>> [1] https://bugs.launchpad.net/neutron/+bug/1332450
>>> [2] https://review.openstack.org/#/c/101581/
>>> [3] https://review.openstack.org/#/c/107409/
>>>
>>> ___
>>> OpenStack-dev mailing list
>>> OpenStack-dev@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] l2pop problems

2014-08-05 Thread Zang MingJie
Hi Mathieu:

We have deployed the new l2pop described in the previous mail in our
environment, and works pretty well. It solved the timing problem, and
also reduces lots of l2pop rpc calls. I'm going to file a blueprint to
propose the changes.

On Fri, Jul 18, 2014 at 10:26 PM, Mathieu Rohon  wrote:
> Hi Zang,
>
> On Wed, Jul 16, 2014 at 4:43 PM, Zang MingJie  wrote:
>> Hi, all:
>>
>> While resolving ovs restart rebuild br-tun flows[1], we have found
>> several l2pop problems:
>>
>> 1. L2pop is depending on agent_boot_time to decide whether send all
>> port information or not, but the agent_boot_time is unreliable, for
>> example if the service receives port up message before agent status
>> report, the agent won't receive any port on other agents forever.
>
> you're right, there a race condition here, if the agent has more than
> 1 port on the same network and if the agent sends its
> update_device_up() on every port before it sends its report_state(),
> it won't receive fdb concerning these network. Is it the race you are
> mentionning above?
> Since the report_state is done in a dedicated greenthread, and is
> launched before the greenthread that manages ovsdb_monitor, the state
> of the agent should be updated before the agent gets aware of its
> ports and sends get_device_details()/update_device_up(), am I wrong?
> So, after a restart of an agent, the agent_uptime() should be less
> than the agent_boot_time configured by default in the conf when the
> agent sent its first update_device_up(), the l2pop MD will be aware of
> this restart and trigger the cast of all fdb entries to the restarted
> agent.
>
> But I agree that it might relies on enventlet thread managment and on
> agent_boot_time that can be misconfigured by the provider.
>
>> 2. If the openvswitch restarted, all flows will be lost, including all
>> l2pop flows, the agent is unable to fetch or recreate the l2pop flows.
>>
>> To resolve the problems, I'm suggesting some changes:
>>
>> 1. Because the agent_boot_time is unreliable, the service can't decide
>> whether to send flooding entry or not. But the agent can build up the
>> flooding entries from unicast entries, it has already been
>> implemented[2]
>>
>> 2. Create a rpc from agent to service which fetch all fdb entries, the
>> agent calls the rpc in `provision_local_vlan`, before setting up any
>> port.[3]
>>
>> After these changes, the l2pop service part becomes simpler and more
>> robust, mainly 2 function: first, returns all fdb entries at once when
>> requested; second, broadcast fdb single entry when a port is up/down.
>
> That's an implementation that we have been thinking about during the
> l2pop implementation.
> Our purpose was to minimize RPC calls. But if this implementation is
> buggy due to uncontrolled thread order and/or bad usage of the
> agent_boot_time parameter, it's worth investigating your proposal [3].
> However, I don't get why [3] depends on [2]. couldn't we have a
> network_sync() sent by the agent during provision_local_vlan() which
> will reconfigure ovs when the agent and/or the ovs restart?

actual, [3] doesn't strictly depend [2], we have encountered l2pop
problems several times where the unicast is correct, but the broadcast
fails, so we decide completely ignore the broadcast entries in rpc,
only deal unicast entries, and use unicast entries to build broadcast
rules.

>
>
>> [1] https://bugs.launchpad.net/neutron/+bug/1332450
>> [2] https://review.openstack.org/#/c/101581/
>> [3] https://review.openstack.org/#/c/107409/
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] l2pop problems

2014-07-18 Thread Mathieu Rohon
Hi Zang,

On Wed, Jul 16, 2014 at 4:43 PM, Zang MingJie  wrote:
> Hi, all:
>
> While resolving ovs restart rebuild br-tun flows[1], we have found
> several l2pop problems:
>
> 1. L2pop is depending on agent_boot_time to decide whether send all
> port information or not, but the agent_boot_time is unreliable, for
> example if the service receives port up message before agent status
> report, the agent won't receive any port on other agents forever.

you're right, there a race condition here, if the agent has more than
1 port on the same network and if the agent sends its
update_device_up() on every port before it sends its report_state(),
it won't receive fdb concerning these network. Is it the race you are
mentionning above?
Since the report_state is done in a dedicated greenthread, and is
launched before the greenthread that manages ovsdb_monitor, the state
of the agent should be updated before the agent gets aware of its
ports and sends get_device_details()/update_device_up(), am I wrong?
So, after a restart of an agent, the agent_uptime() should be less
than the agent_boot_time configured by default in the conf when the
agent sent its first update_device_up(), the l2pop MD will be aware of
this restart and trigger the cast of all fdb entries to the restarted
agent.

But I agree that it might relies on enventlet thread managment and on
agent_boot_time that can be misconfigured by the provider.

> 2. If the openvswitch restarted, all flows will be lost, including all
> l2pop flows, the agent is unable to fetch or recreate the l2pop flows.
>
> To resolve the problems, I'm suggesting some changes:
>
> 1. Because the agent_boot_time is unreliable, the service can't decide
> whether to send flooding entry or not. But the agent can build up the
> flooding entries from unicast entries, it has already been
> implemented[2]
>
> 2. Create a rpc from agent to service which fetch all fdb entries, the
> agent calls the rpc in `provision_local_vlan`, before setting up any
> port.[3]
>
> After these changes, the l2pop service part becomes simpler and more
> robust, mainly 2 function: first, returns all fdb entries at once when
> requested; second, broadcast fdb single entry when a port is up/down.

That's an implementation that we have been thinking about during the
l2pop implementation.
Our purpose was to minimize RPC calls. But if this implementation is
buggy due to uncontrolled thread order and/or bad usage of the
agent_boot_time parameter, it's worth investigating your proposal [3].
However, I don't get why [3] depends on [2]. couldn't we have a
network_sync() sent by the agent during provision_local_vlan() which
will reconfigure ovs when the agent and/or the ovs restart?


> [1] https://bugs.launchpad.net/neutron/+bug/1332450
> [2] https://review.openstack.org/#/c/101581/
> [3] https://review.openstack.org/#/c/107409/
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev