Re: [ovs-discuss] [ovs-dev] connect VM on OVN/OVS and BMS on L2

2023-05-19 Thread Tony Liu via discuss
Hi Numan,

Yes, BMS is Bear Metal Server. Another case is SRIOV. I think they are
the same case and I am looking for a solution to cover both of them.

VTEP is the solution I used to have. Provider VLAN works for L2,
but not L3. And it doesn't support multi-tenancy.
Now I am looking for the solution supported by OVN.
I will look into the reference you provided.


Thanks!
Tony

From: Numan Siddique 
Sent: May 19, 2023 04:27 PM
To: Tony Liu
Cc: ovs-dev; ovs-discuss; Vladislav Odintsov
Subject: Re: [ovs-dev] [ovs-discuss] connect VM on OVN/OVS and BMS on L2

On Fri, May 19, 2023 at 12:09 AM Tony Liu  wrote:
>
> Hi Numan,
>
> Provider VLAN networks is able to connect VM and BMS on L2.
> I am going to push this topic further.
>
> Provider VLAN network is different from regular virtual network.
> It seems that I can't create a logical router to connect a provider VLAN
> and regular VN. The way I am using provider VLAN is as external network
> whose GW is on physical router. Also in a multi-tenancy cloud, provider
> VLAN network can't be created by user. I wonder if we can build a regular
> VN to connect VM and BMS?
>
> OVN is using Geneve which is not commonly supported by networking devices.
> VxLAN doesn't seem to be an option cause OVN needs Geneve to carry metadata.
> I see some vxlan supports in OVN but not sure how it works or for which case 
> exactly.
>
> Tungsten Fabric supports this because it uses vxlan as the overlay. To 
> connect VM
> to BMS, vrouter will create a vxlan from compute node to BMS VTEP (typically 
> the ToR).
> That's how BMS is brought into overlay by vxlan, and will be treated just 
> like a VM.
> With the EVPN support in control plane, routing info is populated between 
> vrouter
> and VTEPand. And with some orchestration to networking devices, the networking
> support to BMS is seamless. One concern is that no SG for BMS, which can be
> actually supported by networking device.
>
> Can the similar supported by OVN or any other overlay solution supported by 
> OVN
> to connect BMS?
>

I think I should have asked this question earlier.  What is BMS ?   I
presumed it to be Bare metal server
and thought that you want to communicate a bare metal server on your
L2 network and a VM in OVN logical switch with localnet port.

OVN supports ovn-controller-vtep to connect OVN to a vtep switch.  I
don't have much experience there.

Maybe you can check it out ?
https://www.ovn.org/support/dist-docs/ovn-controller-vtep.8.html

Adding @Vladislav Odintsov  to the thread who has been using
ovn-controller-vtep to connect to a vtep switch and if he has any
comments.

Thanks
Numan

>
> Thanks!
> Tony
> 
> From: Numan Siddique 
> Sent: May 18, 2023 11:18 AM
> To: Tony Liu
> Cc: ovs-dev; ovs-discuss
> Subject: Re: [ovs-discuss] [ovs-dev] connect VM on OVN/OVS and BMS on L2
>
>
>
> On Thu, May 18, 2023, 1:15 PM Tony Liu via discuss 
> mailto:ovs-discuss@openvswitch.org>> wrote:
> Hi Numan,
>
> Good to see you pick it up, no need to bother OpenStack alias.
> My ultimate target is to support VM and BMS L2 connectivity with OpenStack.
> I used to make that work with other virtual networking stack, not sure how 
> much
> it's supported by OVN/OVS. Any comments in that context
>
>
> It is definitely supported with open stack.
>
> I think you need to create a provider vlan neutron network.
>
> Thanks
> Numan
>
>
> Thanks!
> Tony
> 
> From: Numan Siddique mailto:num...@ovn.org>>
> Sent: May 18, 2023 09:51 AM
> To: Tony Liu
> Cc: ovs-discuss; ovs-dev
> Subject: Re: [ovs-dev] connect VM on OVN/OVS and BMS on L2
>
> On Thu, May 18, 2023 at 12:19 PM Tony Liu 
> mailto:tonyliu0...@hotmail.com>> wrote:
> >
> > Hi,
> >
> > Could you anyone share experiences or point to some reference about how
> > to connect VM on OVN/OVS and BMS on L2? Or say, how can I connect BMS
> > to a logical switch on OVN/OVS?
>
> For this you need to create a localnet port in the logical switch.
>
> Something like this:
>
> ovn-nbctl ls-add public
> # localnet port
> ovn-nbctl lsp-add public ln-public
> ovn-nbctl lsp-set-type ln-public localnet
> ovn-nbctl lsp-set-addresses ln-public unknown
> ovn-nbctl lsp-set-options ln-public network_name=public
>
> # create a few VM ports
>
> ovn-nbctl lsp-add public pub-port1
> ovn-nbctl lsp-set-addresses pub-port1 "50:54:00:00:00:03
> 172.168.0.100"  (assuming your L2 network is 
> 172.168.0.0/24<http://172.168.0.0/24>)
>
> ovn-nbctl lsp-add public pub-port2
> ovn-nbctl lsp-set-addresses pub-port2 "50:54:00:00:00:04 172.168.0.

Re: [ovs-discuss] [ovs-dev] connect VM on OVN/OVS and BMS on L2

2023-05-18 Thread Tony Liu via discuss
Hi Numan,

Provider VLAN networks is able to connect VM and BMS on L2.
I am going to push this topic further.

Provider VLAN network is different from regular virtual network.
It seems that I can't create a logical router to connect a provider VLAN
and regular VN. The way I am using provider VLAN is as external network
whose GW is on physical router. Also in a multi-tenancy cloud, provider
VLAN network can't be created by user. I wonder if we can build a regular
VN to connect VM and BMS?

OVN is using Geneve which is not commonly supported by networking devices.
VxLAN doesn't seem to be an option cause OVN needs Geneve to carry metadata.
I see some vxlan supports in OVN but not sure how it works or for which case 
exactly.

Tungsten Fabric supports this because it uses vxlan as the overlay. To connect 
VM
to BMS, vrouter will create a vxlan from compute node to BMS VTEP (typically 
the ToR).
That's how BMS is brought into overlay by vxlan, and will be treated just like 
a VM.
With the EVPN support in control plane, routing info is populated between 
vrouter
and VTEPand. And with some orchestration to networking devices, the networking
support to BMS is seamless. One concern is that no SG for BMS, which can be
actually supported by networking device.

Can the similar supported by OVN or any other overlay solution supported by OVN
to connect BMS?


Thanks!
Tony

From: Numan Siddique 
Sent: May 18, 2023 11:18 AM
To: Tony Liu
Cc: ovs-dev; ovs-discuss
Subject: Re: [ovs-discuss] [ovs-dev] connect VM on OVN/OVS and BMS on L2



On Thu, May 18, 2023, 1:15 PM Tony Liu via discuss 
mailto:ovs-discuss@openvswitch.org>> wrote:
Hi Numan,

Good to see you pick it up, no need to bother OpenStack alias.
My ultimate target is to support VM and BMS L2 connectivity with OpenStack.
I used to make that work with other virtual networking stack, not sure how much
it's supported by OVN/OVS. Any comments in that context


It is definitely supported with open stack.

I think you need to create a provider vlan neutron network.

Thanks
Numan


Thanks!
Tony

From: Numan Siddique mailto:num...@ovn.org>>
Sent: May 18, 2023 09:51 AM
To: Tony Liu
Cc: ovs-discuss; ovs-dev
Subject: Re: [ovs-dev] connect VM on OVN/OVS and BMS on L2

On Thu, May 18, 2023 at 12:19 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Hi,
>
> Could you anyone share experiences or point to some reference about how
> to connect VM on OVN/OVS and BMS on L2? Or say, how can I connect BMS
> to a logical switch on OVN/OVS?

For this you need to create a localnet port in the logical switch.

Something like this:

ovn-nbctl ls-add public
# localnet port
ovn-nbctl lsp-add public ln-public
ovn-nbctl lsp-set-type ln-public localnet
ovn-nbctl lsp-set-addresses ln-public unknown
ovn-nbctl lsp-set-options ln-public network_name=public

# create a few VM ports

ovn-nbctl lsp-add public pub-port1
ovn-nbctl lsp-set-addresses pub-port1 "50:54:00:00:00:03
172.168.0.100"  (assuming your L2 network is 
172.168.0.0/24<http://172.168.0.0/24>)

ovn-nbctl lsp-add public pub-port2
ovn-nbctl lsp-set-addresses pub-port2 "50:54:00:00:00:04 172.168.0.101"

# On the compute node(s) where you create the VMs

ovs-vsctl set open . external_ids:ovn-bridge-mappings="public:br-ex"

ovs-vsctl add-br br-ex
ovs-vsctl add-port eth1  # assuming eth1 is your physical interface
connecting to your L2 switch

After this connectivity from your VM (bound to logical port pub-port1)
should be able to communicate to your BMS.


Thanks
Numan

>
>
> Thanks!
> Tony
> ___
> dev mailing list
> d...@openvswitch.org<mailto:d...@openvswitch.org>
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
discuss mailing list
disc...@openvswitch.org<mailto:disc...@openvswitch.org>
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] connect VM on OVN/OVS and BMS on L2

2023-05-18 Thread Tony Liu via discuss
Hi Numan,

Good to see you pick it up, no need to bother OpenStack alias.
My ultimate target is to support VM and BMS L2 connectivity with OpenStack.
I used to make that work with other virtual networking stack, not sure how much
it's supported by OVN/OVS. Any comments in that context?


Thanks!
Tony

From: Numan Siddique 
Sent: May 18, 2023 09:51 AM
To: Tony Liu
Cc: ovs-discuss; ovs-dev
Subject: Re: [ovs-dev] connect VM on OVN/OVS and BMS on L2

On Thu, May 18, 2023 at 12:19 PM Tony Liu  wrote:
>
> Hi,
>
> Could you anyone share experiences or point to some reference about how
> to connect VM on OVN/OVS and BMS on L2? Or say, how can I connect BMS
> to a logical switch on OVN/OVS?

For this you need to create a localnet port in the logical switch.

Something like this:

ovn-nbctl ls-add public
# localnet port
ovn-nbctl lsp-add public ln-public
ovn-nbctl lsp-set-type ln-public localnet
ovn-nbctl lsp-set-addresses ln-public unknown
ovn-nbctl lsp-set-options ln-public network_name=public

# create a few VM ports

ovn-nbctl lsp-add public pub-port1
ovn-nbctl lsp-set-addresses pub-port1 "50:54:00:00:00:03
172.168.0.100"  (assuming your L2 network is 172.168.0.0/24)

ovn-nbctl lsp-add public pub-port2
ovn-nbctl lsp-set-addresses pub-port2 "50:54:00:00:00:04 172.168.0.101"

# On the compute node(s) where you create the VMs

ovs-vsctl set open . external_ids:ovn-bridge-mappings="public:br-ex"

ovs-vsctl add-br br-ex
ovs-vsctl add-port eth1  # assuming eth1 is your physical interface
connecting to your L2 switch

After this connectivity from your VM (bound to logical port pub-port1)
should be able to communicate to your BMS.


Thanks
Numan

>
>
> Thanks!
> Tony
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] connect VM on OVN/OVS and BMS on L2

2023-05-18 Thread Tony Liu via discuss
Hi,

Could you anyone share experiences or point to some reference about how
to connect VM on OVN/OVS and BMS on L2? Or say, how can I connect BMS
to a logical switch on OVN/OVS?


Thanks!
Tony
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [External] : Re: SR-IOV OVN OpenStack Mellanox

2022-02-15 Thread Tony Liu
You want virtual networking data plane on SmartNIC.
This idea came up a few years ago, but I am not aware of any SmartNIC supporting
any virtual networking implementation. I think, it's because that, it's lots 
easier to
achieve the same with physical networking device.

Tony

From: Brendan Doyle 
Sent: February 15, 2022 10:06 AM
To: Tony Liu; Satish Patel
Cc: ovs-discuss
Subject: Re: [ovs-discuss] [External] : Re: SR-IOV OVN OpenStack Mellanox


I want the libvert/KVM/QEMU VMs that I currently create that "hook" into
an OVN
overlay network using the libvert 'openvswitch' network  to work in the same
OVN overlay Network with the various Logical Switches, routers Gateways,
NAT rules
ACLs etc to work as is except that they bypass br-int and go to the
SmartNIC. So
somehow the OVS flows that OVN creates in br-int, will need to be
implemented on
the smart NIC.

It is not just enough to have the smart NIC do the encapsulation, the
OVN Logical Gateways
do various NAT and routing operations, the OVN Logical switches have
various ACLs
rules for security. etc. If traffic by-passes br-int and goes directly
to the SmartNIC All these
still need to be "honored" by the smart NIC . Like I said I though I saw
a Mellanox presentation
that talks about doing something like this with representor ports. But I
don't know
the details.


On 15/02/2022 17:27, Tony Liu wrote:
> OVN is virtual networking implementation.
> With OpenStack Neutron, follow the doc, VM will directly connect to NIC VF.
> I don't see OVN integration is needed.
>
> It's another story if you want connectivity between OVN-based VM and
> SRIOV-based VM.
>
> Tony
> 
> From: Brendan Doyle 
> Sent: February 15, 2022 08:53 AM
> To: Tony Liu; Satish Patel
> Cc: ovs-discuss
> Subject: Re: [ovs-discuss] [External] : Re: SR-IOV OVN OpenStack Mellanox
>
>
> Yes but how is that integrated into OVN?
>
> On 15/02/2022 16:41, Tony Liu wrote:
>> SRIOV connects VM directly to NIC VF and bypasses virtual networking stack.
>> SmartNIC is another story where virtual networking stack can be installed on 
>> the NIC.
>> SRIOV is supported by Neutron.
>> https://urldefense.com/v3/__https://docs.openstack.org/neutron/xena/admin/config-sriov.html__;!!ACWV5N9M2RV99hQ!eHoUiPE3yK4G7wru3NHOe4xZEY8JZCTBMzsMhhoNnkr7oZQDSf-4QrFyS4Y5t7g5lTU$
>>
>> Tony
>> 
>> From: discuss  on behalf of Brendan 
>> Doyle 
>> Sent: February 15, 2022 07:10 AM
>> To: Satish Patel
>> Cc: ovs-discuss
>> Subject: Re: [ovs-discuss] [External] : Re:  SR-IOV OVN OpenStack Mellanox
>>
>> Kinda looking for at a high level yes it is possible and is integrated into
>> ovn control plane, or  not not there yet. At a high level first.
>> And is anyone doing this.
>>
>>
>> On 15/02/2022 14:03, Satish Patel wrote:
>>> Not sure if this is what you are looking for
>>> https://urldefense.com/v3/__https://docs.nvidia.com/networking/display/TAN10/ASAP*OVS*Offload__;Kys!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlp-RCKC10$
>>>
>>> On Tue, Feb 15, 2022 at 5:47 AM Brendan Doyle  
>>> wrote:
>>>> Hi,
>>>>
>>>> I'm trying to understand if OVN supports SR-IOV. I found some OpentStack
>>>> documentation:
>>>> https://urldefense.com/v3/__https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/wallaby/app-ovn.html*configuration__;Iw!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlpwu75q1Q$
>>>> that suggests it might, but it is short on details, with specifics
>>>> abstracted via the OpenStack CMS.
>>>>
>>>> Also in the OVN Architecture documentation there are hints to support:
>>>>
>>>> "For  instances  connected through  representor  ports, typically used
>>>> with hardware
>>>>   offload, the ovn-controller may on CMS direction  consult   a  VIF
>>>> plug provider for
>>>>   representor port lookup and plug   them into the integration bridge
>>>> (please refer  to
>>>>  Docu mentation/topics/vif-plug-providers/vif-plug-providers.rst for
>>>> more information)."
>>>>
>>>> But again short on details.
>>>>
>>>> So I believe something like a CX-5/6/7 would have the capability to do
>>>> this, but here would have to be some
>>>> sort of OVN hook for the OVS flows created by OVN  to be "copied/moved"
>>>> to the H/W so that encapsulation,
>>>> NAT, distributed routin

Re: [ovs-discuss] [External] : Re: SR-IOV OVN OpenStack Mellanox

2022-02-15 Thread Tony Liu
OVN is virtual networking implementation.
With OpenStack Neutron, follow the doc, VM will directly connect to NIC VF.
I don't see OVN integration is needed.

It's another story if you want connectivity between OVN-based VM and
SRIOV-based VM.

Tony

From: Brendan Doyle 
Sent: February 15, 2022 08:53 AM
To: Tony Liu; Satish Patel
Cc: ovs-discuss
Subject: Re: [ovs-discuss] [External] : Re: SR-IOV OVN OpenStack Mellanox


Yes but how is that integrated into OVN?

On 15/02/2022 16:41, Tony Liu wrote:
> SRIOV connects VM directly to NIC VF and bypasses virtual networking stack.
> SmartNIC is another story where virtual networking stack can be installed on 
> the NIC.
> SRIOV is supported by Neutron.
> https://urldefense.com/v3/__https://docs.openstack.org/neutron/xena/admin/config-sriov.html__;!!ACWV5N9M2RV99hQ!eHoUiPE3yK4G7wru3NHOe4xZEY8JZCTBMzsMhhoNnkr7oZQDSf-4QrFyS4Y5t7g5lTU$
>
> Tony
> 
> From: discuss  on behalf of Brendan 
> Doyle 
> Sent: February 15, 2022 07:10 AM
> To: Satish Patel
> Cc: ovs-discuss
> Subject: Re: [ovs-discuss] [External] : Re:  SR-IOV OVN OpenStack Mellanox
>
> Kinda looking for at a high level yes it is possible and is integrated into
> ovn control plane, or  not not there yet. At a high level first.
> And is anyone doing this.
>
>
> On 15/02/2022 14:03, Satish Patel wrote:
>> Not sure if this is what you are looking for
>> https://urldefense.com/v3/__https://docs.nvidia.com/networking/display/TAN10/ASAP*OVS*Offload__;Kys!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlp-RCKC10$
>>
>> On Tue, Feb 15, 2022 at 5:47 AM Brendan Doyle  
>> wrote:
>>> Hi,
>>>
>>> I'm trying to understand if OVN supports SR-IOV. I found some OpentStack
>>> documentation:
>>> https://urldefense.com/v3/__https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/wallaby/app-ovn.html*configuration__;Iw!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlpwu75q1Q$
>>> that suggests it might, but it is short on details, with specifics
>>> abstracted via the OpenStack CMS.
>>>
>>> Also in the OVN Architecture documentation there are hints to support:
>>>
>>> "For  instances  connected through  representor  ports, typically used
>>> with hardware
>>>  offload, the ovn-controller may on CMS direction  consult   a  VIF
>>> plug provider for
>>>  representor port lookup and plug   them into the integration bridge
>>> (please refer  to
>>> Docu mentation/topics/vif-plug-providers/vif-plug-providers.rst for
>>> more information)."
>>>
>>> But again short on details.
>>>
>>> So I believe something like a CX-5/6/7 would have the capability to do
>>> this, but here would have to be some
>>> sort of OVN hook for the OVS flows created by OVN  to be "copied/moved"
>>> to the H/W so that encapsulation,
>>> NAT, distributed routing ACLs etc is done in the hardware. I can't find
>>> any details on this nor what would the
>>> control plane for programing the hardware be to do that, ovn-nbctl?,
>>> ovsdbapp? some other out of band control plane?.
>>> Also from what I gather from the OpenStack docs this seems experimental
>>> and limited to VXLAN encapsulation?
>>>
>>> At present I use a libvirt OVN hook that hooks KVM/QEMU VMs into OVN
>>> br-int but these are using
>>> software VIFs. I'm trying to ascertain if can have these VM use SR-IOV,
>>> and still have them integrated
>>> into the OVN logical networks.
>>>
>>>
>>> Any pointers would welcome.
>>>
>>> Thanks
>>>
>>>
>>> Brendan
>>>
>>> ___
>>> discuss mailing list
>>> disc...@openvswitch.org
>>> https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlpNIZip9k$
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!eHoUiPE3yK4G7wru3NHOe4xZEY8JZCTBMzsMhhoNnkr7oZQDSf-4QrFyS4Y5EkKIAjo$

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [External] : Re: SR-IOV OVN OpenStack Mellanox

2022-02-15 Thread Tony Liu
SRIOV connects VM directly to NIC VF and bypasses virtual networking stack.
SmartNIC is another story where virtual networking stack can be installed on 
the NIC.
SRIOV is supported by Neutron.
https://docs.openstack.org/neutron/xena/admin/config-sriov.html

Tony

From: discuss  on behalf of Brendan Doyle 

Sent: February 15, 2022 07:10 AM
To: Satish Patel
Cc: ovs-discuss
Subject: Re: [ovs-discuss] [External] : Re:  SR-IOV OVN OpenStack Mellanox

Kinda looking for at a high level yes it is possible and is integrated into
ovn control plane, or  not not there yet. At a high level first.
And is anyone doing this.


On 15/02/2022 14:03, Satish Patel wrote:
> Not sure if this is what you are looking for
> https://urldefense.com/v3/__https://docs.nvidia.com/networking/display/TAN10/ASAP*OVS*Offload__;Kys!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlp-RCKC10$
>
> On Tue, Feb 15, 2022 at 5:47 AM Brendan Doyle  
> wrote:
>> Hi,
>>
>> I'm trying to understand if OVN supports SR-IOV. I found some OpentStack
>> documentation:
>> https://urldefense.com/v3/__https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/wallaby/app-ovn.html*configuration__;Iw!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlpwu75q1Q$
>> that suggests it might, but it is short on details, with specifics
>> abstracted via the OpenStack CMS.
>>
>> Also in the OVN Architecture documentation there are hints to support:
>>
>> "For  instances  connected through  representor  ports, typically used
>> with hardware
>> offload, the ovn-controller may on CMS direction  consult   a  VIF
>> plug provider for
>> representor port lookup and plug   them into the integration bridge
>> (please refer  to
>>Docu mentation/topics/vif-plug-providers/vif-plug-providers.rst for
>> more information)."
>>
>> But again short on details.
>>
>> So I believe something like a CX-5/6/7 would have the capability to do
>> this, but here would have to be some
>> sort of OVN hook for the OVS flows created by OVN  to be "copied/moved"
>> to the H/W so that encapsulation,
>> NAT, distributed routing ACLs etc is done in the hardware. I can't find
>> any details on this nor what would the
>> control plane for programing the hardware be to do that, ovn-nbctl?,
>> ovsdbapp? some other out of band control plane?.
>> Also from what I gather from the OpenStack docs this seems experimental
>> and limited to VXLAN encapsulation?
>>
>> At present I use a libvirt OVN hook that hooks KVM/QEMU VMs into OVN
>> br-int but these are using
>> software VIFs. I'm trying to ascertain if can have these VM use SR-IOV,
>> and still have them integrated
>> into the OVN logical networks.
>>
>>
>> Any pointers would welcome.
>>
>> Thanks
>>
>>
>> Brendan
>>
>> ___
>> discuss mailing list
>> disc...@openvswitch.org
>> https://urldefense.com/v3/__https://mail.openvswitch.org/mailman/listinfo/ovs-discuss__;!!ACWV5N9M2RV99hQ!dFnukkvR8ggDcHotwyqXNVu8B3dWlc7LBGXbc5fECYqHdok6NIeoKkz5aSlpNIZip9k$

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Timeout waiting for network-vif-plugged for instance with vm_state building and task_state spawning

2022-01-03 Thread Tony Liu
You can send SIGHUP to Neutron to apply configuration without restarting it.
Enable debug in configuration and "kill -s SIGHUP " will get you
all debugging logs.

Tony

From: discuss  on behalf of Christian 
Stelter 
Sent: January 3, 2022 08:54 AM
To: ovs-discuss@openvswitch.org
Subject: Re: [ovs-discuss] Timeout waiting for network-vif-plugged for instance 
with vm_state building and task_state spawning

Hi!

A small update after restarting neutron server (problem vanished since the 
restart last week as in your case) and reading the posts/bug reports you send.

We observed an additional phenomenon when running into the problem which lets 
us wonder whether ovn is may be also part of the problem and which isn't 
mentioned in the posts/bugs.

Neutron gets the port up/network-vif-plugged report from ovn twice and at least 
in one case with a delay of >5h.

We will have a closer look when this problem reoccurs and verify whether the 
multiple up/plugged reports will be also reproduced.

Kind regards,

Christian Stelter

On Wed, Dec 29, 2021 at 11:28 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
I've been hitting this issue from time to time. There are discussions and bugs 
for this.

http://lists.openstack.org/pipermail/openstack-discuss/2021-November/025941.html
http://lists.openstack.org/pipermail/openstack-discuss/2021-November/025892.html
https://bugs.launchpad.net/heat/+bug/1694371
https://bugs.launchpad.net/nova/+bug/1944779

I don't think it's about OVN/OVS, because the port is up in OVN NB.
And restarting Neutron will make it work again. I doubt Neutron either
doesn't pick it up or doesn't send it out to Nova compute.

When I hit this issue, I enable debugging for Neutron and restart it, but that 
makes
problem gone. If I keep logging on debug level till next time, it will dump too 
much
loggings. I would be good for Neutron to support enable debug logging 
on-the-fly.

If it happens to you more frequently, you can try to enable debug and see if 
that
captures anything helpful.


Tony


From: discuss 
mailto:ovs-discuss-boun...@openvswitch.org>>
 on behalf of Christian Stelter 
mailto:refu...@last-refuge.net>>
Sent: December 29, 2021 09:51 AM
To: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Subject: [ovs-discuss] Timeout waiting for network-vif-plugged for instance 
with vm_state building and task_state spawning

Hi!

We observe in our well aged OpenStack installation (kolla-ansible deployed 
victoria release) strange timeouts while creating new instances. It seems that 
the feedback from the ovn controller(?) doesn't reach the nova compute/libvirt 
process within 300s.

This does not happen all the time, but 1/3 to 1/5 of the launches have problems.

This is what we see in the logs:

---  nova-compute ---
2021-12-29 07:27:06.972 7 INFO nova.compute.claims 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Claim successful on node compute07
2021-12-29 07:27:07.757 7 INFO nova.virt.libvirt.driver 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Creating image
2021-12-29 07:27:12.514 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] VM Started (Lifecycle Event)
2021-12-29 07:27:12.559 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] VM Paused (Lifecycle Event)
2021-12-29 07:27:12.649 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] During sync_power_state the instance has 
a pending task (spawning). Skip.
2021-12-29 07:32:12.518 7 WARNING nova.virt.libvirt.driver 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Timeout waiting for 
[('network-vif-plugged', 'c641bae9-edcc-4e64-93ec-2dcbf5010009')] for instance 
with vm_state building and task_state spawning: eventlet.timeout.Timeout: 300 
seconds
2021-12-29 07:32:12.522 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] VM Resumed (Lifecycle Event)
2021-12-29 07:32:12.526 7 INFO nova.virt.libvirt.driver [-] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Instance spawned successfully.
2021-12-29 07:32:12.527 7 INFO nova.compute.manager 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4

Re: [ovs-discuss] Timeout waiting for network-vif-plugged for instance with vm_state building and task_state spawning

2021-12-29 Thread Tony Liu
I've been hitting this issue from time to time. There are discussions and bugs 
for this.

http://lists.openstack.org/pipermail/openstack-discuss/2021-November/025941.html
http://lists.openstack.org/pipermail/openstack-discuss/2021-November/025892.html
https://bugs.launchpad.net/heat/+bug/1694371
https://bugs.launchpad.net/nova/+bug/1944779

I don't think it's about OVN/OVS, because the port is up in OVN NB.
And restarting Neutron will make it work again. I doubt Neutron either
doesn't pick it up or doesn't send it out to Nova compute.

When I hit this issue, I enable debugging for Neutron and restart it, but that 
makes
problem gone. If I keep logging on debug level till next time, it will dump too 
much
loggings. I would be good for Neutron to support enable debug logging 
on-the-fly.

If it happens to you more frequently, you can try to enable debug and see if 
that
captures anything helpful.


Tony


From: discuss  on behalf of Christian 
Stelter 
Sent: December 29, 2021 09:51 AM
To: ovs-discuss@openvswitch.org
Subject: [ovs-discuss] Timeout waiting for network-vif-plugged for instance 
with vm_state building and task_state spawning

Hi!

We observe in our well aged OpenStack installation (kolla-ansible deployed 
victoria release) strange timeouts while creating new instances. It seems that 
the feedback from the ovn controller(?) doesn't reach the nova compute/libvirt 
process within 300s.

This does not happen all the time, but 1/3 to 1/5 of the launches have problems.

This is what we see in the logs:

---  nova-compute ---
2021-12-29 07:27:06.972 7 INFO nova.compute.claims 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Claim successful on node compute07
2021-12-29 07:27:07.757 7 INFO nova.virt.libvirt.driver 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Creating image
2021-12-29 07:27:12.514 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] VM Started (Lifecycle Event)
2021-12-29 07:27:12.559 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] VM Paused (Lifecycle Event)
2021-12-29 07:27:12.649 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] During sync_power_state the instance has 
a pending task (spawning). Skip.
2021-12-29 07:32:12.518 7 WARNING nova.virt.libvirt.driver 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Timeout waiting for 
[('network-vif-plugged', 'c641bae9-edcc-4e64-93ec-2dcbf5010009')] for instance 
with vm_state building and task_state spawning: eventlet.timeout.Timeout: 300 
seconds
2021-12-29 07:32:12.522 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] VM Resumed (Lifecycle Event)
2021-12-29 07:32:12.526 7 INFO nova.virt.libvirt.driver [-] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Instance spawned successfully.
2021-12-29 07:32:12.527 7 INFO nova.compute.manager 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Took 304.77 seconds to spawn the instance 
on the hypervisor.
2021-12-29 07:32:12.613 7 INFO nova.compute.manager 
[req-01948206-8ff3-404c-b71f-86ed7d1791d4 - - - - -] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] During sync_power_state the instance has 
a pending task (spawning). Skip.
2021-12-29 07:32:12.632 7 INFO nova.compute.manager 
[req-57321447-272f-4820-813d-925a43d69462 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] [instance: 
aef0a1b3-99d6-43e2-89db-b7bd4e0516d9] Took 305.70 seconds to build instance.
---  nova-compute ---

--- neutron-server ---
2021-12-29 07:27:09.115 23 INFO neutron.db.ovn_revision_numbers_db 
[req-cb418ddf-a068-4511-9fed-f0914945c76d 0b16e2645828414d86f094c2a7a693eb 
66fcb198228e4d9d97f09dc7ce1d5130 - default default] Successfully bumped 
revision number for resource c641bae9-edcc-4e64-93ec-2dcbf5010009 (type: ports) 
to 1
2021-12-29 07:27:10.054 23 INFO neutron.db.ovn_revision_numbers_db 
[req-8e3f0f54-8164-4d6d-acab-3be912b0afb6 217d8068ede5469ebfbebaf3e2242840 
8775aa7d41654c84b8b17b3a570adf4d - default default] Successfully bumped 
revision number for resource c641bae9-edcc-4e64-93ec-2dcbf5010009 (type: ports) 
to 2
2021-12-29 07:27:10.393 23 INFO neutron.db.ovn_revision_numbers_db 

Re: [ovs-discuss] [ovn] recommendation to gateway chassis

2021-04-15 Thread Tony Liu
Any comments?

Thanks!
Tony
> -Original Message-
> From: discuss  On Behalf Of Tony
> Liu
> Sent: Saturday, April 3, 2021 12:05 PM
> To: ovs-discuss 
> Subject: [ovs-discuss] [ovn] recommendation to gateway chassis
> 
> Hi,
> 
> For now, I have a setup with DVR to route FIP traffic directly on
> compute node and dedicated central GWs to handle SNAT traffic.
> Then I realize that I can put GW chassis on compute node as well.
> Is this recommended over dedicated GW chassis?
> 
> If yes, is there any difference between 1) picking up a few compute
> nodes as GW and 2) putting GW on all compute nodes (like 200 nodes), in
> terms of the impact to OVN control plane and data plane, like DB sizing
> and handling, performance, health check peering, etc.?
> 
> Or the question is that, what's the comfortable scale of gateway chassis
> in OVN. If it's a few hundreds, then having GW on all compute nodes
> would be the best option.
> 
> Any comment is appreciated.
> 
> 
> Thanks!
> Tony
> 
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [ovn] recommendation to gateway chassis

2021-04-03 Thread Tony Liu
Hi,

For now, I have a setup with DVR to route FIP traffic directly
on compute node and dedicated central GWs to handle SNAT traffic.
Then I realize that I can put GW chassis on compute node as well.
Is this recommended over dedicated GW chassis?

If yes, is there any difference between 1) picking up a few compute
nodes as GW and 2) putting GW on all compute nodes (like 200 nodes),
in terms of the impact to OVN control plane and data plane, like
DB sizing and handling, performance, health check peering, etc.?

Or the question is that, what's the comfortable scale of gateway
chassis in OVN. If it's a few hundreds, then having GW on all
compute nodes would be the best option.

Any comment is appreciated.


Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] [ovn] no tunnel from GW to compute

2020-10-28 Thread Tony Liu
Thanks Numan for the hint! I thought it's about OF flow and DP flow.
Didn't realize the problem could be in upstream. I checked NB DB and
found a LRP leftover from previous deletion. Someone was testing a
deployment with network and VMs. They destroy and recreate the
deployment back-to-back. It's likely some issue in Neutron OVN ML2
driver who is responsible for programming NB DB. Somehow, the driver
didn't have a chance to completely finish the deletion while start
the new creation. I manually delete that stale LRP, everything starts
working fine.

Tony
> -Original Message-
> From: Numan Siddique 
> Sent: Wednesday, October 28, 2020 1:56 AM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-dev] [ovn] no tunnel from GW to compute
> 
> On Wed, Oct 28, 2020 at 11:29 AM Tony Liu 
> wrote:
> >
> > Checked OF flows for working and non-working FIPs, can't find any
> > difference. DF flow is installed by vswitchd based on OF flow when
> > processing the first packet. I enabled debug logging, no log for
> > working FIP, a few logs for non-working FIP. Does that mean something
> > wrong about non-working FIP?
> 
> Could you share your OVN NB DB ? or ovn-nbctl commands to create the
> resources so that I can try it out locally ?
> 
> Thanks
> Numan
> 
> > ==
> > 2020-10-28T05:15:22.058Z|01203|dpif(handler8)|DBG|system@ovs-system:
> miss upcall:
> > recirc_id(0),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),ct_stat
> > e(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=e8:1c:ba:9f:b7:c6,dst=f
> > a:16:3e:45:da:61),eth_type(0x0800),ipv4(src=172.16.160.1,dst=10.59.53.
> > 8,proto=1,tos=0,ttl=125,frag=no),icmp(type=8,code=0)
> > icmp,vlan_tci=0x,dl_src=e8:1c:ba:9f:b7:c6,dl_dst=fa:16:3e:45:da:61
> > ,nw_src=172.16.160.1,nw_dst=10.59.53.8,nw_tos=0,nw_ecn=0,nw_ttl=125,ic
> mp_type=8,icmp_code=0 icmp_csum:6013 ..
> > 2020-10-28T05:15:22.059Z|00808|vconn|DBG|unix#1: sent (Success):
> > NXT_PACKET_IN2 (OF1.3) (xid=0x0): table_id=24 cookie=0x9173f925
> > total_len=74
> > ct_state=new|trk|dnat,ct_zone=507,ct_nw_src=172.16.160.1,ct_nw_dst=10.
> > 59.53.8,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,ip,reg0=0xa0a000a,reg1=0
> > xa0a0001,reg9=0x8,reg10=0x1,reg11=0x1fb,reg12=0x1fa,reg14=0x1,reg15=0x
> > 7,metadata=0x121,in_port=1 (via action) data_len=74 (unbuffered)
> >
> > userdata=00.00.00.00.00.00.00.00.00.19.00.10.80.00.06.06.ff.ff.ff.ff.f
> > f.ff.00.00.ff.ff.00.18.00.00.23.20.00.06.00.20.00.40.00.00.00.01.de.10
> > .00.00.20.04.ff.ff.00.18.00.00.23.20.00.06.00.20.00.60.00.00.00.01.de.
> > 10.00.00.22.04.00.19.00.10.80.00.2a.02.00.01.00.00.00.00.00.00.ff.ff.0
> > 0.10.00.00.23.20.00.0e.ff.f8.20.00.00.00
> > icmp,vlan_tci=0x,dl_src=fa:16:3e:93:f4:1e,dl_dst=00:00:00:00:00:00
> > ,nw_src=172.16.160.1,nw_dst=10.10.0.10,nw_tos=0,nw_ecn=0,nw_ttl=124,ic
> > mp_type=8,icmp_code=0 icmp_csum:6013
> >
> > ..
> > 2020-10-28T05:15:27.060Z|01102|dpif(handler10)|DBG|system@ovs-system:
> action upcall:
> > recirc_id(0x3ad25),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),c
> > t_state(0xa1),ct_zone(0x1fb),ct_mark(0),ct_label(0),ct_tuple4(src=172.
> > 16.160.1,dst=10.59.53.8,proto=1,tp_src=8,tp_dst=0),eth(src=fa:16:3e:93
> > :f4:1e,dst=00:00:00:00:00:00),eth_type(0x0800),ipv4(src=172.16.160.1,d
> > st=10.10.0.10,proto=1,tos=0,ttl=124,frag=no),icmp(type=8,code=0)
> > icmp,vlan_tci=0x,dl_src=fa:16:3e:93:f4:1e,dl_dst=00:00:00:00:00:00
> > ,nw_src=172.16.160.1,nw_dst=10.10.0.10,nw_tos=0,nw_ecn=0,nw_ttl=124,ic
> mp_type=8,icmp_code=0 icmp_csum:6012 ..
> > 2020-10-28T05:15:27.061Z|00834|vconn|DBG|unix#1: sent (Success):
> > NXT_PACKET_IN2 (OF1.3) (xid=0x0): table_id=24 cookie=0x9173f925
> > total_len=74
> > ct_state=new|trk|dnat,ct_zone=507,ct_nw_src=172.16.160.1,ct_nw_dst=10.
> > 59.53.8,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,ip,reg0=0xa0a000a,reg1=0
> > xa0a0001,reg9=0x8,reg10=0x1,reg11=0x1fb,reg12=0x1fa,reg14=0x1,reg15=0x
> > 7,metadata=0x121,in_port=1 (via action) data_len=74 (unbuffered)
> >
> > userdata=00.00.00.00.00.00.00.00.00.19.00.10.80.00.06.06.ff.ff.ff.ff.f
> > f.ff.00.00.ff.ff.00.18.00.00.23.20.00.06.00.20.00.40.00.00.00.01.de.10
> > .00.00.20.04.ff.ff.00.18.00.00.23.20.00.06.00.20.00.60.00.00.00.01.de.
> > 10.00.00.22.04.00.19.00.10.80.00.2a.02.00.01.00.00.00.00.00.00.ff.ff.0
> > 0.10.00.00.23.20.00.0e.ff.f8.20.00.00.00
> > icmp,vlan_tci=0x,dl_src=fa:16:3e:93:f4:1e,dl_dst=00:00:00:00:00:00
> > ,nw_src=172.16.160.1,nw_dst=10.10.0.10,nw_tos=0,nw_ecn=0,nw_ttl=124,ic
> mp_type=8,icmp_code=0 icmp_csum:6012 ..
> > 2020-10-2

Re: [ovs-discuss] [ovn] no tunnel from GW to compute

2020-10-27 Thread Tony Liu
m: dev  On Behalf Of Tony Liu
> Sent: Tuesday, October 27, 2020 2:23 PM
> To: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-dev] [ovn] no tunnel from GW to compute
> 
> Saw the same problem again. Recreate network, attach to router and
> launch VM, problem is gone. Probably some glitch happened during the
> early deployment. Any hints how to look into it?
> 
> Thanks!
> Tony
> > -Original Message-----
> > From: discuss  On Behalf Of Tony
> > Liu
> > Sent: Friday, October 16, 2020 6:48 PM
> > To: ovs-discuss 
> > Subject: [ovs-discuss] [ovn] no tunnel from GW to compute
> >
> > Hi,
> >
> > I am seeing an interesting issue today.
> > When ping a FIP from external, request arrives on GW, but no tunnel
> > from GW to compute.
> > When ping from VM to external, egress works fine, request goes through
> > tunnel from compute to GW, then to external.
> > Reply arrives at GW, no tunnel from GW back to compute.
> >
> > I checked DP flows on GW and compared working vs. non-working.
> >
> > non-working, no tunnel
> > 
> > recirc_id(0),in_port(3),ct_state(-new-est-rel-rpl-inv-
> > trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),
> > et
> > h_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=10.59.53.18,proto=1,tt
> > l= 63,frag=no),icmp(type=8/0xf8), packets:8, bytes:784, used:0.992s,
> > actions:ct_clear,ct(zone=20,nat),recirc(0x6e1)
> >
> > recirc_id(0x6e1),in_port(3),ct_state(+new-est-rel-rpl-
> > inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=10.59.53.18,f
> > inv+ra
> > g=no), packets:29, bytes:2842, used:0.992s,
> > actions:ct(commit,zone=21,nat(dst=192.168.1.8)),recirc(0x6e2)
> >
> > recirc_id(0x6e2),in_port(3),ct_state(+new-est-rel-rpl-
> > inv+trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:
> > inv+d9
> > ),eth_type(0x0800),ipv4(dst=192.168.1.8,proto=1,ttl=63,frag=no),icmp(t
> > yp e=8/0xf8), packets:8, bytes:784, used:0.992s, actions:ct_clear
> > 
> >
> > working, with tunnel
> > 
> > recirc_id(0),in_port(3),ct_state(-new-est-rel-rpl-inv-
> > trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),
> > et
> > h_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=10.59.53.14,proto=1,tt
> > l= 63,frag=no),icmp(type=8/0xf8), packets:2, bytes:196, used:3.427s,
> > actions:ct_clear,ct(zone=20,nat),recirc(0x716)
> >
> > recirc_id(0x716),in_port(3),ct_state(+new-est-rel-rpl-
> > inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=10.59.53.14,f
> > inv+ra
> > g=no), packets:2, bytes:196, used:3.428s,
> > actions:ct(commit,zone=21,nat(dst=192.168.1.5)),recirc(0x717)
> >
> > recirc_id(0x717),in_port(3),ct_state(+new-est-rel-rpl-
> > inv+trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:
> > inv+d9
> > ),eth_type(0x0800),ipv4(dst=192.168.1.5,proto=1,tos=0/0x3,ttl=63,frag=
> > no ),icmp(type=8/0xf8), packets:0, bytes:0, used:never,
> > actions:ct_clear,set(tunnel(tun_id=0x139,dst=10.6.30.63,ttl=64,tp_dst=
> > 60
> > 81,geneve({class=0x102,type=0x80,len=4,0x2000a}),flags(df|csum|key))),
> > se
> > t(eth(src=fa:16:3e:aa:2a:5d,dst=fa:16:3e:a6:79:6f)),set(ipv4(ttl=62)),
> > 1
> > 
> >
> > The difference is on the third flow (0x6e2 and 0x717).
> > In non-working case, "set(tunnel..." is missing.
> > Note, the working VM and non-working VM are on the same compute.
> > I want to trace the root cause. Any hints or comments where and how I
> > should look into it?
> >
> >
> > Thanks!
> > Tony
> >
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovn] no tunnel from GW to compute

2020-10-27 Thread Tony Liu
Saw the same problem again. Recreate network, attach to router and
launch VM, problem is gone. Probably some glitch happened during
the early deployment. Any hints how to look into it?

Thanks!
Tony
> -Original Message-
> From: discuss  On Behalf Of Tony
> Liu
> Sent: Friday, October 16, 2020 6:48 PM
> To: ovs-discuss 
> Subject: [ovs-discuss] [ovn] no tunnel from GW to compute
> 
> Hi,
> 
> I am seeing an interesting issue today.
> When ping a FIP from external, request arrives on GW, but no tunnel from
> GW to compute.
> When ping from VM to external, egress works fine, request goes through
> tunnel from compute to GW, then to external.
> Reply arrives at GW, no tunnel from GW back to compute.
> 
> I checked DP flows on GW and compared working vs. non-working.
> 
> non-working, no tunnel
> 
> recirc_id(0),in_port(3),ct_state(-new-est-rel-rpl-inv-
> trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),et
> h_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=10.59.53.18,proto=1,ttl=
> 63,frag=no),icmp(type=8/0xf8), packets:8, bytes:784, used:0.992s,
> actions:ct_clear,ct(zone=20,nat),recirc(0x6e1)
> 
> recirc_id(0x6e1),in_port(3),ct_state(+new-est-rel-rpl-
> inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=10.59.53.18,fra
> g=no), packets:29, bytes:2842, used:0.992s,
> actions:ct(commit,zone=21,nat(dst=192.168.1.8)),recirc(0x6e2)
> 
> recirc_id(0x6e2),in_port(3),ct_state(+new-est-rel-rpl-
> inv+trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9
> ),eth_type(0x0800),ipv4(dst=192.168.1.8,proto=1,ttl=63,frag=no),icmp(typ
> e=8/0xf8), packets:8, bytes:784, used:0.992s, actions:ct_clear
> 
> 
> working, with tunnel
> 
> recirc_id(0),in_port(3),ct_state(-new-est-rel-rpl-inv-
> trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),et
> h_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=10.59.53.14,proto=1,ttl=
> 63,frag=no),icmp(type=8/0xf8), packets:2, bytes:196, used:3.427s,
> actions:ct_clear,ct(zone=20,nat),recirc(0x716)
> 
> recirc_id(0x716),in_port(3),ct_state(+new-est-rel-rpl-
> inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=10.59.53.14,fra
> g=no), packets:2, bytes:196, used:3.428s,
> actions:ct(commit,zone=21,nat(dst=192.168.1.5)),recirc(0x717)
> 
> recirc_id(0x717),in_port(3),ct_state(+new-est-rel-rpl-
> inv+trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9
> ),eth_type(0x0800),ipv4(dst=192.168.1.5,proto=1,tos=0/0x3,ttl=63,frag=no
> ),icmp(type=8/0xf8), packets:0, bytes:0, used:never,
> actions:ct_clear,set(tunnel(tun_id=0x139,dst=10.6.30.63,ttl=64,tp_dst=60
> 81,geneve({class=0x102,type=0x80,len=4,0x2000a}),flags(df|csum|key))),se
> t(eth(src=fa:16:3e:aa:2a:5d,dst=fa:16:3e:a6:79:6f)),set(ipv4(ttl=62)),1
> 
> 
> The difference is on the third flow (0x6e2 and 0x717).
> In non-working case, "set(tunnel..." is missing.
> Note, the working VM and non-working VM are on the same compute.
> I want to trace the root cause. Any hints or comments where and how I
> should look into it?
> 
> 
> Thanks!
> Tony
> 
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [ovn] does not match prerequisite

2020-10-20 Thread Tony Liu
Hi,

>From ovnsb log, I see many of the following messages.
What does it mean? Is that a concern?

2020-10-20T18:52:50.483Z|00093|raft|INFO|current entry eid 
2ab3eff8-87e1-4e19-9a1f-d359ad56a9ad does not match prerequisite 
c6ffd854-6f6e-4533-a6d8-b297acb542e0 in execute_command_request


Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [ovn] no tunnel from GW to compute

2020-10-16 Thread Tony Liu
Hi,

I am seeing an interesting issue today.
When ping a FIP from external, request arrives on GW, but no
tunnel from GW to compute.
When ping from VM to external, egress works fine, request
goes through tunnel from compute to GW, then to external.
Reply arrives at GW, no tunnel from GW back to compute.

I checked DP flows on GW and compared working vs. non-working.

non-working, no tunnel

recirc_id(0),in_port(3),ct_state(-new-est-rel-rpl-inv-trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=10.59.53.18,proto=1,ttl=63,frag=no),icmp(type=8/0xf8),
 packets:8, bytes:784, used:0.992s, 
actions:ct_clear,ct(zone=20,nat),recirc(0x6e1)

recirc_id(0x6e1),in_port(3),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=10.59.53.18,frag=no),
 packets:29, bytes:2842, used:0.992s, 
actions:ct(commit,zone=21,nat(dst=192.168.1.8)),recirc(0x6e2)

recirc_id(0x6e2),in_port(3),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),eth_type(0x0800),ipv4(dst=192.168.1.8,proto=1,ttl=63,frag=no),icmp(type=8/0xf8),
 packets:8, bytes:784, used:0.992s, actions:ct_clear


working, with tunnel

recirc_id(0),in_port(3),ct_state(-new-est-rel-rpl-inv-trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=10.59.53.14,proto=1,ttl=63,frag=no),icmp(type=8/0xf8),
 packets:2, bytes:196, used:3.427s, 
actions:ct_clear,ct(zone=20,nat),recirc(0x716)

recirc_id(0x716),in_port(3),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=10.59.53.14,frag=no),
 packets:2, bytes:196, used:3.428s, 
actions:ct(commit,zone=21,nat(dst=192.168.1.5)),recirc(0x717)

recirc_id(0x717),in_port(3),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=e8:1c:ba:9f:b7:c6,dst=fa:16:3e:67:5c:d9),eth_type(0x0800),ipv4(dst=192.168.1.5,proto=1,tos=0/0x3,ttl=63,frag=no),icmp(type=8/0xf8),
 packets:0, bytes:0, used:never, 
actions:ct_clear,set(tunnel(tun_id=0x139,dst=10.6.30.63,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x2000a}),flags(df|csum|key))),set(eth(src=fa:16:3e:aa:2a:5d,dst=fa:16:3e:a6:79:6f)),set(ipv4(ttl=62)),1


The difference is on the third flow (0x6e2 and 0x717).
In non-working case, "set(tunnel..." is missing.
Note, the working VM and non-working VM are on the same compute.
I want to trace the root cause. Any hints or comments where and how
I should look into it?


Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [ovn] gateway router vs. distributed gateway port

2020-09-24 Thread Tony Liu
Hi,

Given "Distributed Gateway Ports" in [1], and my test,
central SNAT, central FIP and distributed FIP are all supported by
distributed gateway port.

I wonder in which case gateway router has to be used? 

One case is mentioned in [2], DR is used in K8S to support node
service and avoid central SNAT. But the comments in [3] suggesting
LB is supported by DGP and distributed SNAT is also possibly supported
by DGP.

Other that the above case, any other cases really required DR?

[1] http://www.openvswitch.org/support/dist-docs/ovn-architecture.7.html
[2] https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg06989.html
[3] https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg06993.html

Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovn] distributed router port and distributed SNAT

2020-09-24 Thread Tony Liu
> -Original Message-
> From: Numan Siddique 
> Sent: Wednesday, September 23, 2020 11:12 PM
> To: Tony Liu 
> Cc: ovs-discuss 
> Subject: Re: [ovs-discuss] [ovn] distributed router port and distributed
> SNAT
> 
> 
> 
> On Thu, Sep 24, 2020 at 10:14 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
>   I read through this long discussion [1].
> 
>   Here is what I am doing.
> 
>   +--+
>   |external logical switch   |
>   +-+-++-+
> | ||
>  +--+--+   +--+--+ +---++
>  |dgp1 |   |dgp2 |   ...   |dgp1000 |
>  +--+--+   +--+--+ +---++
> | ||
>   +-+-+ +-+-+  +---+---+
>   |LR1| |LR2|  |LR1000 |
>   +---+ +---+  +---+
> 
>   First of all, I see the same flow explosion in lr_in_arp_resolve
>   table. I'd like to confirm the patch [2] will also avoid explosion
>   in my case?
> 
> 
> 
> 
> I think so. Maybe Han or Dumitru can confirm. I suggest that you test it
> out yourself.
> You can stop the neutron server and run a script which sets this option
> on each logical router.
> 
> something like
> 
> for i in $(ovn-nbctl --bare --columns __uuid list logical_router) do
> ovn-nbctl set logical_router $i
> options:always_learn_from_arp_request=false
> done
> 
> 
> 
>   In my case, LRs are not bound to any specific compute chassis.
>   All DGPs are bound on the central set of gateway chassis.
>   It's central SNAT and FIP.
> 
>   I am looking for the possibility to do distributed SNAT and FIP to
>   avoid central gateway nodes. With OpenStack integration,
>   distributed FIP is supported, but not distributed SNAT. because
>   there is not chassis specific address can be used as the source
>   IP for SNAT.
> 
> 
> 
> 
> I  don't think OVN supports distributed SNAT.
> 
> 
> 
>   Given the idea in [3], DPG can be bound on compute chassis.
>   I don't need the support to have multiple DPGs on one LR.
>   Then is that going to work for distributed SNAT?
>   Any details, like how to allocate chassis specific address
>   as the source IP for SNAT, and how ARP works for that address?
> 
> 
> 
> I am not sure how easy is it going to support this.

Two pieces here, 1) multiple DPG, 2) DPG binding.
I know #1 is not supported, and I actually don't need it.
Is #2 already supported? If yes, then distributed SNAT can be
supported by that?


Thanks!
Tony
> 
> Thanks
> Numan
> 
> 
> 
>   [1] https://www.mail-archive.com/ovs-
> disc...@openvswitch.org/msg06948.html
>   [2] https://www.mail-archive.com/ovs-
> d...@openvswitch.org/msg45681.html
>   [3] https://www.mail-archive.com/ovs-
> disc...@openvswitch.org/msg06987.html
> 
>   Thanks!
>   Tony
> 
>   ___
>   discuss mailing list
>   disc...@openvswitch.org <mailto:disc...@openvswitch.org>
>   https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> 
> 

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [ovn] distributed router port and distributed SNAT

2020-09-23 Thread Tony Liu
Hi,

I read through this long discussion [1].

Here is what I am doing.

+--+
|external logical switch   |
+-+-++-+
  | ||
   +--+--+   +--+--+ +---++
   |dgp1 |   |dgp2 |   ...   |dgp1000 |
   +--+--+   +--+--+ +---++
  | ||
+-+-+ +-+-+  +---+---+
|LR1| |LR2|  |LR1000 |
+---+ +---+  +---+

First of all, I see the same flow explosion in lr_in_arp_resolve
table. I'd like to confirm the patch [2] will also avoid explosion
in my case?

In my case, LRs are not bound to any specific compute chassis.
All DGPs are bound on the central set of gateway chassis.
It's central SNAT and FIP.

I am looking for the possibility to do distributed SNAT and FIP to
avoid central gateway nodes. With OpenStack integration,
distributed FIP is supported, but not distributed SNAT. because
there is not chassis specific address can be used as the source
IP for SNAT.

Given the idea in [3], DPG can be bound on compute chassis.
I don't need the support to have multiple DPGs on one LR.
Then is that going to work for distributed SNAT?
Any details, like how to allocate chassis specific address
as the source IP for SNAT, and how ARP works for that address?

[1] https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg06948.html
[2] https://www.mail-archive.com/ovs-dev@openvswitch.org/msg45681.html
[3] https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg06987.html

Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovn] MAC in table acl and lb

2020-09-22 Thread Tony Liu
That MAC is also in table 19.
==
  table=19(ls_in_l2_lkup  ), priority=110  , match=(eth.dst == 
1a:d2:77:6e:42:98), action=(handle_svc_check(inport);)
==

Thanks!
Tony
> -Original Message-
> From: Tony Liu 
> Sent: Tuesday, September 22, 2020 7:19 PM
> To: ovs-discuss 
> Subject: [ovn] MAC in table acl and lb
> 
> Hi,
> 
> When I look at a datapath ingress pipeline, =
>   table=3 (ls_in_pre_acl  ), priority=110  , match=(eth.dst ==
> 1a:d2:77:6e:42:98), action=(next;)
>   table=3 (ls_in_pre_acl  ), priority=0, match=(1),
> action=(next;)
>   table=4 (ls_in_pre_lb   ), priority=110  , match=(eth.dst ==
> 1a:d2:77:6e:42:98), action=(next;)
>   table=4 (ls_in_pre_lb   ), priority=110  , match=(nd || nd_rs ||
> nd_ra || icmp4.type == 3 ||icmp6.type == 1 || (tcp && tcp.flags == 20)),
> action=(next;)
>   table=4 (ls_in_pre_lb   ), priority=0, match=(1),
> action=(next;)
>   table=5 (ls_in_pre_stateful ), priority=100  , match=(reg0[0] == 1),
> action=(ct_next;)
>   table=5 (ls_in_pre_stateful ), priority=0, match=(1),
> action=(next;)
>   table=6 (ls_in_acl  ), priority=34000, match=(eth.dst ==
> 1a:d2:77:6e:42:98), action=(next;)
>   table=6 (ls_in_acl  ), priority=0, match=(1),
> action=(next;)
>   table=7 (ls_in_qos_mark ), priority=0, match=(1),
> action=(next;)
>   table=8 (ls_in_qos_meter), priority=0, match=(1),
> action=(next;)
>   table=9 (ls_in_lb   ), priority=0, match=(1),
> action=(next;)
> =
> What's that MAC 1a:d2:77:6e:42:98? What's it for in acl and lb tables?
> I can't find any port with that MAC.
> This datapath is for a LS that is created from OpenStack.
> 
> Thanks!
> Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [ovn] MAC in table acl and lb

2020-09-22 Thread Tony Liu
Hi,

When I look at a datapath ingress pipeline,
=
  table=3 (ls_in_pre_acl  ), priority=110  , match=(eth.dst == 
1a:d2:77:6e:42:98), action=(next;)
  table=3 (ls_in_pre_acl  ), priority=0, match=(1), action=(next;)
  table=4 (ls_in_pre_lb   ), priority=110  , match=(eth.dst == 
1a:d2:77:6e:42:98), action=(next;)
  table=4 (ls_in_pre_lb   ), priority=110  , match=(nd || nd_rs || nd_ra || 
icmp4.type == 3 ||icmp6.type == 1 || (tcp && tcp.flags == 20)), action=(next;)
  table=4 (ls_in_pre_lb   ), priority=0, match=(1), action=(next;)
  table=5 (ls_in_pre_stateful ), priority=100  , match=(reg0[0] == 1), 
action=(ct_next;)
  table=5 (ls_in_pre_stateful ), priority=0, match=(1), action=(next;)
  table=6 (ls_in_acl  ), priority=34000, match=(eth.dst == 
1a:d2:77:6e:42:98), action=(next;)
  table=6 (ls_in_acl  ), priority=0, match=(1), action=(next;)
  table=7 (ls_in_qos_mark ), priority=0, match=(1), action=(next;)
  table=8 (ls_in_qos_meter), priority=0, match=(1), action=(next;)
  table=9 (ls_in_lb   ), priority=0, match=(1), action=(next;)
=
What's that MAC 1a:d2:77:6e:42:98? What's it for in acl and lb tables?
I can't find any port with that MAC.
This datapath is for a LS that is created from OpenStack.

Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] failed to connect 5K networks to 5K routers

2020-09-09 Thread Tony Liu
Hi,

Here is what I did.

#1 Create 5K networks, 5K routers and 1 provider/physical network.
All resources are created in NB and translated to SB properly.

#2 Connect 5K routers to provider network.
All resources are created and updated in NB properly, but SB
is not fully updated.

I did 3 snapshots of NB, snapshot-init that has no resource,
snapshot-1 and snapshot-2 after each of above two steps.

I restored snapshot-init to NB, SB is not restored because northd
didn't work properly, so I manually clean up SB.

I restored snapshot-1 to NB, SB is fully updated fairly quick.

I restored snapshot-2 to NB, SB got partial update only.

I am still with 20.03...

When restore snapshot-1, does northd read all resources from NB
and translate them to SB as one-shot or there are multiple
transitions?

When restore snapshot-2, does northd take lots more time to
translate than snapshot-1?

When restore snapshot-2, controller on gateway chassis is also
involved to update local OVSDB and chassis port status in SB.
Could that be affecting the translation from NB to SB?

Is upgrading to 20.06 going to help on this?


Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] Gateway router scale

2020-08-30 Thread Tony Liu
Hi,

Could anyone share experiences in gateway routers scaling?
How many gateway routers on a chassis have be tested, hundreds,
thousands?
What may be the bottleneck for gateway router scaling, resources
like memory, CPU, or ovn-controller, OVSDB?

BTW, the link in ovn-architecture.7 is not valid anymore.
https://github.com/ovn-org/ovn/blob/master/ovn-architecture.7.xml#L1758
Could anyone fix it?

Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] How to restart raft cluster after a complete shutdown?

2020-08-25 Thread Tony Liu
Start the first node to create the cluster.
https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ctl#L228
https://github.com/openvswitch/ovs/blob/master/utilities/ovs-lib.in#L478

Start the rest nodes to join the cluster.
https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ctl#L226
https://github.com/openvswitch/ovs/blob/master/utilities/ovs-lib.in#L478

Tony
> -Original Message-
> From: discuss  On Behalf Of Matthew
> Booth
> Sent: Tuesday, August 25, 2020 7:08 AM
> To: ovs-discuss 
> Subject: [ovs-discuss] How to restart raft cluster after a complete
> shutdown?
> 
> I'm deploying ovsdb-server (and only ovsdb-server) in K8S as a
> StatefulSet:
> 
> https://github.com/openstack-k8s-operators/dev-
> tools/blob/master/ansible/files/ocp/ovn/ovsdb.yaml
> 
> I'm going to replace this with an operator in due course, which may make
> the following simpler. I'm not necessarily constrained to only things
> which are easy to do in a StatefulSet.
> 
> I've noticed an issue when I kill all 3 pods simultaneously: it is no
> longer possible to start the cluster. The issue is presumably one of
> quorum: when a node comes up it can't contact any other node to make
> quorum, and therefore can't come up. All nodes are similarly affected,
> so the cluster stays down. Ignoring kubernetes, how is this situation
> intended to be handled? Do I have to it to a single-node deployment,
> convert that to a new cluster and re-bootstrap it? This wouldn't be
> ideal. Is there any way, for example, I can bring up the first node
> while asserting to that node that the other 2 are definitely down?
> 
> Thanks,
> 
> Matt
> --
> Matthew Booth
> Red Hat OpenStack Engineer, Compute DFG
> 
> Phone: +442070094448 (UK)
> 
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovsdb-server unix socket permission

2020-08-22 Thread Tony Liu
> -Original Message-
> From: Matthew Booth 
> Sent: Saturday, August 22, 2020 3:12 PM
> To: Tony Liu 
> Cc: ovs-discuss@openvswitch.org; ovs-dev 
> Subject: Re: [ovs-discuss] ovsdb-server unix socket permission
> 
> On Fri, 21 Aug 2020 at 20:40, Tony Liu  wrote:
> >
> > Hi,
> >
> > The ovsdb-server UNIX socket permission is 0750. It works fine for OVS
> > services, like ovs-vswitchd and ovn-controller who run as root.
> >
> > When integrate with OpenStack, neutron-ovn-metadata-agent running as
> > user "neutron" needs to connect to ovsdb-server.
> > TCP connection works fine. But, since it's local connection, it would
> > be better to use UNIX socket to get better performance and avoid
> > inactivity probe.
> 
> Are you still using RAFT? If so I think you must connect to all tcp
> endpoints, or leader-only operations will execute on the wrong node. I
> know that locking specifically doesn't work unless all clients pick the
> same node to lock on, which means they must all be connected to all
> nodes.

It has nothing to do with RAFT. This is the connection to local
ovsdb-server on compute node.

> > So, is there any option for ovsdb-server to create UNIX socket with
> > permission 0777? Or any better option for the agent to connect to UNIX
> > socket?
> 
> Assuming you're not using RAFT, can you workaround by just chowning it?

Yes, I can, then the caveat is that, since the socket is owned
by ovsdb-server, when it restarts, the socket will be recreated
and chown change will be lost.

Thanks!
Tony

> 
> Matt
> --
> Matthew Booth
> Red Hat OpenStack Engineer, Compute DFG
> 
> Phone: +442070094448 (UK)

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] ovsdb-server unix socket permission

2020-08-21 Thread Tony Liu
Hi,

The ovsdb-server UNIX socket permission is 0750. It works
fine for OVS services, like ovs-vswitchd and ovn-controller
who run as root.

When integrate with OpenStack, neutron-ovn-metadata-agent
running as user "neutron" needs to connect to ovsdb-server.
TCP connection works fine. But, since it's local connection,
it would be better to use UNIX socket to get better performance
and avoid inactivity probe.

So, is there any option for ovsdb-server to create UNIX socket
with permission 0777? Or any better option for the agent to
connect to UNIX socket?


Thanks!
Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] not-equal in ACL

2020-08-13 Thread Tony Liu
> -Original Message-
> From: discuss  On Behalf Of Tony
> Liu
> Sent: Monday, August 10, 2020 10:41 AM
> To: Numan Siddique 
> Cc: ovs-discuss@openvswitch.org
> Subject: [ovs-discuss] [OVN] not-equal in ACL
> 
> Hi Numan,
> 
> Create a new thread here to follow up ACL questions.
> 
> > > > I think this is a big problem here. We should not use "!=" in
> > > > logical flows, although OVN allows.
> > >
> > > Is this a generic recommendation or for certain cases?
> > > Is it OK to add an ACL with "!=", like below?
> > >
> > > ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d from-lport
> 1005
> > > 'ip4.dst == 192.168.0.0/16 && inport !=
> > > "d93619c3-dab9-4f6d-8261-4211f6937fd1"' drop
> >
> >
> > This is a generic recommendation. The above ACL would also result in
> > many OF flows.
> >
> > To handle cases like above, you can add a couple of ACLs like below
> with
> > high priority flow to allow the desired inport and low priority ACL to
> > drop all the traffic.
> >
> >  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d from-lport
> > 1006 'ip4.dst == 192.168.0.0/16 && inport == "d93619c3-dab9-4f6d-8261-
> > 4211f6937fd1"' allow  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-
> > e780d9dfdc0d from-lport
> > 1005 'ip4.dst == 192.168.0.0/16"' drop
> 
> In my case, two LS connect to one LR who has external access.
> There are 3 ports on each LS.
> * vm_port
> * gw_port (connect to LR)
> * svc_port (localport for DHCP and metadata)
> 
> What I want is to disable the connection between two LS while allow
> external access for them.
> 
> Option #1, create one ACL for each VM on each LS.
> 
> acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport ==
> "$vm_port"' drop
> 
> This works fine for me, but the ACL has to be per VM.
> 
> Option #2, create one ACL to exclude gw_port and svc_port.
> 
> acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport !=
> "$gw_port" && inport != "svc_port"' drop
> 
> As you mentioned, this is not recommended, cause it will result many
> OF flows. I actually tried, but I don't see any OF flows created for
> that ACL. Is there any policy in ovn-controller to not translate such
> policy to OF flow?
> 
> Option #3, as you suggested, I tried 2 ACLs.
> 
> acl-add $ls from-lport 1006 'ip4.dst == 192.168.0.0/16 && (inport ==
> "$gw_port" || inport == "svc_port")' allow
> acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16' drop
> 
> On compute node, I see the "drop" OF flow only, not the "allow" flow.
> Am I missing anything here?

Hi Numan,

This works! The '$' was missing from "svc_port"!

Thanks for the advice!

Tony
> 
> 
> Thanks!
> 
> Tony
> 
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] not-equal in ACL

2020-08-12 Thread Tony Liu
> -Original Message-
> From: Numan Siddique 
> Sent: Wednesday, August 12, 2020 10:25 AM
> To: Tony Liu 
> Cc: ovs-discuss@openvswitch.org
> Subject: Re: [ovs-discuss] [OVN] not-equal in ACL
> 
> On Wed, Aug 12, 2020 at 10:41 PM Tony Liu 
> wrote:
> >
> > > -Original Message-
> > > From: Numan Siddique 
> > > Sent: Wednesday, August 12, 2020 2:17 AM
> > > To: Tony Liu 
> > > Cc: ovs-discuss@openvswitch.org
> > > Subject: Re: [ovs-discuss] [OVN] not-equal in ACL
> > >
> > > On Mon, Aug 10, 2020 at 11:11 PM Tony Liu 
> > > wrote:
> > > >
> > > > Hi Numan,
> > > >
> > > > Create a new thread here to follow up ACL questions.
> > > >
> > > > > > > I think this is a big problem here. We should not use "!="
> > > > > > > in logical flows, although OVN allows.
> > > > > >
> > > > > > Is this a generic recommendation or for certain cases?
> > > > > > Is it OK to add an ACL with "!=", like below?
> > > > > >
> > > > > > ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d
> > > > > > from-lport
> > > 1005
> > > > > > 'ip4.dst == 192.168.0.0/16 && inport !=
> > > > > > "d93619c3-dab9-4f6d-8261-4211f6937fd1"' drop
> > > > >
> > > > >
> > > > > This is a generic recommendation. The above ACL would also
> > > > > result in many OF flows.
> > > > >
> > > > > To handle cases like above, you can add a couple of ACLs like
> > > > > below
> > > with
> > > > > high priority flow to allow the desired inport and low priority
> > > > > ACL
> > > to
> > > > > drop all the traffic.
> > > > >
> > > > >  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d
> > > > > from-lport
> > > > > 1006 'ip4.dst == 192.168.0.0/16 && inport ==
> > > > > "d93619c3-dab9-4f6d-
> > > 8261-
> > > > > 4211f6937fd1"' allow  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-
> > > > > e780d9dfdc0d from-lport
> > > > > 1005 'ip4.dst == 192.168.0.0/16"' drop
> > > >
> > > > In my case, two LS connect to one LR who has external access.
> > > > There are 3 ports on each LS.
> > > > * vm_port
> > > > * gw_port (connect to LR)
> > > > * svc_port (localport for DHCP and metadata)
> > > >
> > > > What I want is to disable the connection between two LS while
> > > > allow external access for them.
> > > >
> > > > Option #1, create one ACL for each VM on each LS.
> > > > 
> > > > acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport
> > > > ==
> > > "$vm_port"' drop
> > > > 
> > > > This works fine for me, but the ACL has to be per VM.
> > > >
> > > > Option #2, create one ACL to exclude gw_port and svc_port.
> > > > 
> > > > acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport
> > > > !=
> > > "$gw_port" && inport != "svc_port"' drop
> > > > 
> > > > As you mentioned, this is not recommended, cause it will result
> > > > many OF flows. I actually tried, but I don't see any OF flows
> > > > created for that ACL. Is there any policy in ovn-controller to not
> > > > translate such policy to OF flow?
> > > >
> > > > Option #3, as you suggested, I tried 2 ACLs.
> > > > 
> > > > acl-add $ls from-lport 1006 'ip4.dst == 192.168.0.0/16 && (inport
> > > > ==
> > > "$gw_port" || inport == "svc_port")' allow
> > > > acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16' drop
> > > >  On compute node, I see the "drop" OF flow only, not the
> > > > "allow" flow.
> > > > Am I missing anything here?
> > > >
> > >
> > > If there is a logical flow like - "inport == port1 && .",
> > > ovnm-controller which binds this logical port  converts like logical
> > > flow to OF rule.
> > > Other ovn-controller ignore this logical flow. I think that's what
> > >

Re: [ovs-discuss] [OVN] not-equal in ACL

2020-08-12 Thread Tony Liu
> -Original Message-
> From: Numan Siddique 
> Sent: Wednesday, August 12, 2020 2:17 AM
> To: Tony Liu 
> Cc: ovs-discuss@openvswitch.org
> Subject: Re: [ovs-discuss] [OVN] not-equal in ACL
> 
> On Mon, Aug 10, 2020 at 11:11 PM Tony Liu 
> wrote:
> >
> > Hi Numan,
> >
> > Create a new thread here to follow up ACL questions.
> >
> > > > > I think this is a big problem here. We should not use "!=" in
> > > > > logical flows, although OVN allows.
> > > >
> > > > Is this a generic recommendation or for certain cases?
> > > > Is it OK to add an ACL with "!=", like below?
> > > >
> > > > ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d from-lport
> 1005
> > > > 'ip4.dst == 192.168.0.0/16 && inport !=
> > > > "d93619c3-dab9-4f6d-8261-4211f6937fd1"' drop
> > >
> > >
> > > This is a generic recommendation. The above ACL would also result in
> > > many OF flows.
> > >
> > > To handle cases like above, you can add a couple of ACLs like below
> with
> > > high priority flow to allow the desired inport and low priority ACL
> to
> > > drop all the traffic.
> > >
> > >  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d from-lport
> > > 1006 'ip4.dst == 192.168.0.0/16 && inport == "d93619c3-dab9-4f6d-
> 8261-
> > > 4211f6937fd1"' allow  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-
> > > e780d9dfdc0d from-lport
> > > 1005 'ip4.dst == 192.168.0.0/16"' drop
> >
> > In my case, two LS connect to one LR who has external access.
> > There are 3 ports on each LS.
> > * vm_port
> > * gw_port (connect to LR)
> > * svc_port (localport for DHCP and metadata)
> >
> > What I want is to disable the connection between two LS while allow
> > external access for them.
> >
> > Option #1, create one ACL for each VM on each LS.
> > 
> > acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport ==
> "$vm_port"' drop
> > 
> > This works fine for me, but the ACL has to be per VM.
> >
> > Option #2, create one ACL to exclude gw_port and svc_port.
> > 
> > acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport !=
> "$gw_port" && inport != "svc_port"' drop
> > 
> > As you mentioned, this is not recommended, cause it will result many
> > OF flows. I actually tried, but I don't see any OF flows created for
> > that ACL. Is there any policy in ovn-controller to not translate such
> > policy to OF flow?
> >
> > Option #3, as you suggested, I tried 2 ACLs.
> > 
> > acl-add $ls from-lport 1006 'ip4.dst == 192.168.0.0/16 && (inport ==
> "$gw_port" || inport == "svc_port")' allow
> > acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16' drop
> > 
> > On compute node, I see the "drop" OF flow only, not the "allow" flow.
> > Am I missing anything here?
> >
> 
> If there is a logical flow like - "inport == port1 && .",
> ovnm-controller which binds this logical port  converts like logical
> flow to OF rule.
> Other ovn-controller ignore this logical flow. I think that's what
> happening in your case.

I don't quite get it. Are you saying, ovn-controller on compute
node ignores the rule because those ports are not all bound on that
chassis? The gw_port and svc_port are not bound to any chassis by
any ovn-controller.

If that's true, I'd say it's a bug. gw_port and svc_port exist on
all chassis who has VM launched on that logical switch.
ovn-controller on those chassis should not ignore the ACL.
Otherwise, those ports can't be used in ACL at all.

> I think there are many ways to solve your case.
> 
> 1. Have separate logical router for each logical switch and connect
> these logical routers to your provider network logical switch.

I thought about that. If I have 5K such networks, I will need 5K
logical routers, also 5K routes on underlay physical router pointing
to those logical routers. Without enabling BGP (I haven't tried
Neutron BGP agent), it's going to be 5K static routes on underlay
router. That's why I make a choice in the middle between one router
for all networks and one router for each network.

> 2. Add ACLs on the egress pipeline. I'd suggest this rather than on
> the ingress pipeline.

I'd like to drop the packet as early as possible to get better
performance. How much difference between droppi

[ovs-discuss] [OVN] ToR as the gateway

2020-08-11 Thread Tony Liu
Hi,

In http://www.openvswitch.org/support/dist-docs/ovn-architecture.7.html,
I see this.

For connecting to gateways, in addition to Geneve and STT, OVN
supports VXLAN, because only  VXLAN  support is common on
top-of-rack  (ToR) switches.


Does anyone have experiences in using physical ToR as the gateway
with VXLAN to connect private virtual networks to external?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] Dependency to OVS

2020-08-11 Thread Tony Liu
Hi,

Is there any version dependency between OVN and OVS,
like which OVS has to be used for certain OVN release?
Is such info shared anywhere?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] not-equal in ACL

2020-08-10 Thread Tony Liu
Hi Numan,

Create a new thread here to follow up ACL questions.

> > > I think this is a big problem here. We should not use "!=" in
> > > logical flows, although OVN allows.
> >
> > Is this a generic recommendation or for certain cases?
> > Is it OK to add an ACL with "!=", like below?
> >
> > ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d from-lport 1005
> > 'ip4.dst == 192.168.0.0/16 && inport !=
> > "d93619c3-dab9-4f6d-8261-4211f6937fd1"' drop
> 
> 
> This is a generic recommendation. The above ACL would also result in
> many OF flows.
> 
> To handle cases like above, you can add a couple of ACLs like below with
> high priority flow to allow the desired inport and low priority ACL to
> drop all the traffic.
> 
>  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-e780d9dfdc0d from-lport
> 1006 'ip4.dst == 192.168.0.0/16 && inport == "d93619c3-dab9-4f6d-8261-
> 4211f6937fd1"' allow  ovn-nbctl acl-add 12b1681c-b3e7-4ec9-b324-
> e780d9dfdc0d from-lport
> 1005 'ip4.dst == 192.168.0.0/16"' drop

In my case, two LS connect to one LR who has external access.
There are 3 ports on each LS.
* vm_port
* gw_port (connect to LR)
* svc_port (localport for DHCP and metadata)

What I want is to disable the connection between two LS while allow
external access for them.

Option #1, create one ACL for each VM on each LS.

acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport == "$vm_port"' 
drop

This works fine for me, but the ACL has to be per VM.

Option #2, create one ACL to exclude gw_port and svc_port.

acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16 && inport != "$gw_port" 
&& inport != "svc_port"' drop

As you mentioned, this is not recommended, cause it will result many
OF flows. I actually tried, but I don't see any OF flows created for
that ACL. Is there any policy in ovn-controller to not translate such
policy to OF flow?

Option #3, as you suggested, I tried 2 ACLs.

acl-add $ls from-lport 1006 'ip4.dst == 192.168.0.0/16 && (inport == "$gw_port" 
|| inport == "svc_port")' allow
acl-add $ls from-lport 1005 'ip4.dst == 192.168.0.0/16' drop

On compute node, I see the "drop" OF flow only, not the "allow" flow.
Am I missing anything here?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

2020-08-08 Thread Tony Liu
Hi,

I wonder if we can have a solution combining all your excellent ideas?
* Don't clean up flows until recompute is done.
* Use nb_cfg to determine if update is required.
* Compute and apply the diff only.
* Save ovn-controller version in ovsdb.

Two scenarios to cover, restart and upgrade.

If we can save ovn-controller version in ovsdb, we will be able to
differentiate restart and upgrade.

if version match# restart
if nb_cfg changed
1) Compute and apply the diff only, this has the minimum impact.
or
2) Clean up and rebuild all flows, after recompute is done.
   This causes some down time, but it should be acceptable.
else
return# No update is required.
else# upgrade/downgrade
Clean up and rebuild all flows to avoid any inconsistence caused by
version change, after recompute is done. For upgrade, such down time
should be acceptable. To achieve the minimum impact, it's still
possible to find out the diff for both flow and schema changes and
apply diff only, but it's too complicated and not worthy for upgrade
case.

IMHO, the highest priority is to have the minimum impact to data plane
when ovn-controller restarts for whatever reason.


Thanks!

Tony

> -Original Message-
> From: discuss  On Behalf Of Han
> Zhou
> Sent: Friday, August 7, 2020 1:04 PM
> To: Numan Siddique 
> Cc: ovn-kuberne...@googlegroups.com; ovs-discuss@openvswitch.org;
> Venugopal Iyer 
> Subject: Re: [ovs-discuss] ovn-k8s scale: how to make new ovn-controller
> process keep the previous Open Flow in br-int
> 
> 
> 
> On Fri, Aug 7, 2020 at 12:35 PM Numan Siddique   > wrote:
> 
> 
> 
> 
>   On Sat, Aug 8, 2020 at 12:16 AM Han Zhou   > wrote:
> 
> 
> 
> 
>   On Thu, Aug 6, 2020 at 10:22 AM Han Zhou   > wrote:
> 
> 
> 
> 
>   On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique
> mailto:num...@ovn.org> > wrote:
> 
> 
> 
> 
>   On Thu, Aug 6, 2020 at 9:25 PM Venugopal Iyer
> mailto:venugop...@nvidia.com> > wrote:
> 
> 
>   Hi, Han:
> 
> 
> 
>   A comment inline:
> 
> 
> 
>   From: ovn-kuberne...@googlegroups.com
>    kuberne...@googlegroups.com  >
> On Behalf Of Han Zhou
>   Sent: Wednesday, August 5, 2020 3:36 PM
>   To: Winson Wang   >
>   Cc: ovs-discuss@openvswitch.org 
>  disc...@openvswitch.org> ; ovn-kuberne...@googlegroups.com  kuberne...@googlegroups.com> ; Dumitru Ceara   >; Han Zhou   >
>   Subject: Re: ovn-k8s scale: how to make 
> new
> ovn-controller process keep the previous Open Flow in br-int
> 
> 
> 
> External email: Use caution opening links or attachments
> 
> 
> 
> 
> 
> 
> 
>   On Wed, Aug 5, 2020 at 12:58 PM Winson 
> Wang
> mailto:windson.w...@gmail.com> > wrote:
> 
>   Hello OVN Experts,
> 
> 
> 
> 
>   With ovn-k8s,  we need to keep the flows
> always on br-int which needed by running pods on the k8s node.
> 
>   Is there an ongoing project to address 
> this
> problem?
> 
>   If not,  I have one proposal not sure 
> if it
> is doable.
> 
>   Please share your thoughts.
> 
> 
>   The issue:
> 
> 
>   In large scale ovn-k8s cluster there are
> 200K+ Open Flows on br-int on every K8s node.  When we restart ovn-
> controller for upgrade using `ovs-appctl -t ovn-controller exit --
> restart`,  the remaining traffic still works fine since br-int with
> flows still be Installed.
> 
> 
> 
>   However, when a new ovn-controller 
> starts it
> will connect OVS IDL and do an engine init run,  clearing all OpenFlow
> flows and install flows based on SB DB.
> 
>   With open flows count above 200K+,  it 
> took
> more than 15 seconds to get all the flows installed br-int bridge again.
> 
> 
> 
> 
>   Proposal solution for the issue:
> 
> 
>   When the ovn-controller gets “exit 
> --start”,
> it will write a  “ovs-cond-seqno” to OVS IDL and store the value to Open
> vSwitch table in external-ids column. When new ovn-controller starts, it
> will check if the “ovs-cond-seqno” exists in the Open_vSwitch table,
> and get the seqno from OVS IDL to decide 

[ovs-discuss] [OVN] Kolla Ansible deployment

2020-08-07 Thread Tony Liu
Hi Michal,

Based on some comments from the community, I'd like to have your
attention on couple things here.

1. For now, "--db-nb-create-insecure-remote=yes" is set in
ovn-nb-db.json.j2, when deploy ovn-nb-db and ovn-sb-db.

That results a TCP method is set by ovn-ctl script to run
ovsdb-server.

It's recommended to not set TCP method in command line, instead,
create connection table for TCP method. That way, the inactivity
probe interval can be set in the connection table as well.

2. On chassis node (compute node or gateway node), ovsdb-server
opens TCP socket and ovn-controller connects to it. Since this
is local communication, it's recommended to use UNIX socket.
It also avoids inactivity probe.

How can I get update on Kolla container image updates?
For example, I am currently using ussuri, where OVN is 20.03.
I wonder if I can get upgraded ussuri image with OVN 20.06?
Or Kolla container upgrade is aligned with OpenStack release,
like I won't be able to get upgraded OVN until next OpenStack
release? If that's the case, I will just upgrade the container
by myself.


Thanks!

Tony
> -Original Message-
> From: Michał Nasiadka 
> Sent: Thursday, July 23, 2020 3:27 AM
> To: Tony Liu 
> Cc: Daniel Alvarez ; ovs-discuss@openvswitch.org
> Subject: Re: [ovs-discuss] OVN scale
> 
> Hi Tony,
> 
> I’m the core reviewer/developer behind OVN implementation in Kolla-
> Ansible.
> Just to make some clarifications to the thread - yes Kolla-Ansible
> deploys a raft cluster.
> 
> I would be happy to see some results/reports from that OVN scale test -
> and if you would see any improvements/bugs we could resolve - please
> just let me know.
> 
> Best regards,
> 
> Michal

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
Good to know. Yes, I am using 20.03.
Will try to upgrade.


Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Friday, August 7, 2020 3:39 PM
> To: Tony Liu 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> changes in sb-db
> 
> The change in external-ids is not monitored by ovn-controller any more
> after version 20.06. Probably you are still using an older version?
> 
> 
> On Fri, Aug 7, 2020 at 3:11 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Is the changes made by Neutron liveness-check in SB properly
> handled
>   by inc_proc_eng?
> 
>   Given my testing scale, even the frequency is lowered by that fix,
>   ovn-controller still takes quite lot of CPU and time to compute SB.
> 
> 
>   Thanks!
> 
>   Tony
>   > -Original Message-
>   > From: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Sent: Friday, August 7, 2020 1:25 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >; Han Zhou  <mailto:zhou...@gmail.com> >
>   > Cc: ovs-dev mailto:ovs-
> d...@openvswitch.org> >; ovs-discuss> disc...@openvswitch.org <mailto:disc...@openvswitch.org> >
>   > Subject: RE: [ovs-discuss] [OVN] ovn-controller takes 100% cpu
> while no
>   > changes in sb-db
>   >
>   > Thanks for the hint!
>   > 
>   > 2020-08-07T19:44:18.019Z|616614|jsonrpc|DBG|tcp:10.6.20.84:6642
> <http://10.6.20.84:6642> :
>   > received notification, method="update3",
>   > params=[["monid","OVN_Southbound"],"5002cb22-13e5-490a-9a64-
>   > 5d48914138ca",{"Chassis":{"0390b621-152b-48a0-a3d1-
>   >
> 2973c0b823cc":{"modify":{"external_ids":["map",[["neutron:liveness_check
>   > _at","2020-08-07T19:44:07.130233+00:00"]]]]
>   > 
>   >
>   > Nailed it...
>   > https://bugs.launchpad.net/neutron/+bug/1883554
>   >
>   >
>   > Tony
>   > > -Original Message-
>   > > From: dev mailto:ovs-dev-
> boun...@openvswitch.org> > On Behalf Of Tony Liu
>   > > Sent: Friday, August 7, 2020 1:14 PM
>   > > To: Han Zhou mailto:zhou...@gmail.com> >
>   > > Cc: ovs-dev mailto:ovs-
> d...@openvswitch.org> >; ovs-discuss> > disc...@openvswitch.org <mailto:disc...@openvswitch.org> >
>   > > Subject: Re: [ovs-dev] [ovs-discuss] [OVN] ovn-controller takes
> 100%
>   > > cpu while no changes in sb-db
>   > >
>   > >
>   > > Here is the outpu.
>   > > 
>   > > [root@gateway-1 ~]# docker exec ovn_controller ovs-appctl -t
>   > > /run/ovn/ovn-controller.6.ctl coverage/show Event coverage, avg
> rate
>   > > over last: 5 seconds, last minute, last hour,  hash=e70a83c8:
>   > > lflow_run  0.0/sec 0.083/sec
> 0.0725/sec
>   > > total: 295
>   > > miniflow_malloc0.0/sec 44356.817/sec
> 44527.3975/sec
>   > > total: 180635403
>   > > hindex_pathological0.0/sec 0.000/sec
> 0./sec
>   > > total: 7187
>   > > hindex_expand  0.0/sec 0.000/sec
> 0./sec
>   > > total: 17
>   > > hmap_pathological  0.0/sec 4.167/sec
> 4.1806/sec
>   > > total: 25091
>   > > hmap_expand0.0/sec  5366.500/sec
> 5390.0800/sec
>   > > total: 23680738
>   > > txn_unchanged  0.0/sec 0.300/sec
> 0.3175/sec
>   > > total: 11024
>   > > txn_incomplete 0.0/sec 0.100/sec
> 0.0836/sec
>   > > total: 974
>   > > txn_success0.0/sec 0.033/sec
> 0.0308/sec
>   > > total: 129
>   > > txn_try_again  0.0/sec 0.000/sec
> 0.0003/sec
>   > > total: 1
>   > > poll_create_node   0.4/sec 1.933/sec
> 1.9575/sec
>   > > total: 55611
>   > > poll_zero_timeout  0.0/sec 0.067/sec
> 0.0556/sec
>   > > total: 241
>   > > rconn_queued   0.0/sec 0.050/sec
> 0.0594/sec
>   > > total: 1208720
>   > > rconn_sent  

[ovs-discuss] [OVN] log file rotation and time zone

2020-08-07 Thread Tony Liu
Hi,

Is log file rotation and compression supported for all
OVN services?

How is the time stamp time zone determined by logging?
Is it /etc/localtime, hardcoded UTC, or anything else?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
Is the changes made by Neutron liveness-check in SB properly handled
by inc_proc_eng?

Given my testing scale, even the frequency is lowered by that fix,
ovn-controller still takes quite lot of CPU and time to compute SB.


Thanks!

Tony
> -Original Message-
> From: Tony Liu 
> Sent: Friday, August 7, 2020 1:25 PM
> To: Tony Liu ; Han Zhou 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: RE: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> changes in sb-db
> 
> Thanks for the hint!
> 
> 2020-08-07T19:44:18.019Z|616614|jsonrpc|DBG|tcp:10.6.20.84:6642:
> received notification, method="update3",
> params=[["monid","OVN_Southbound"],"5002cb22-13e5-490a-9a64-
> 5d48914138ca",{"Chassis":{"0390b621-152b-48a0-a3d1-
> 2973c0b823cc":{"modify":{"external_ids":["map",[["neutron:liveness_check
> _at","2020-08-07T19:44:07.130233+00:00"]]]}}}}]
> 
> 
> Nailed it...
> https://bugs.launchpad.net/neutron/+bug/1883554
> 
> 
> Tony
> > -Original Message-
> > From: dev  On Behalf Of Tony Liu
> > Sent: Friday, August 7, 2020 1:14 PM
> > To: Han Zhou 
> > Cc: ovs-dev ; ovs-discuss  > disc...@openvswitch.org>
> > Subject: Re: [ovs-dev] [ovs-discuss] [OVN] ovn-controller takes 100%
> > cpu while no changes in sb-db
> >
> >
> > Here is the outpu.
> > 
> > [root@gateway-1 ~]# docker exec ovn_controller ovs-appctl -t
> > /run/ovn/ovn-controller.6.ctl coverage/show Event coverage, avg rate
> > over last: 5 seconds, last minute, last hour,  hash=e70a83c8:
> > lflow_run  0.0/sec 0.083/sec0.0725/sec
> > total: 295
> > miniflow_malloc0.0/sec 44356.817/sec44527.3975/sec
> > total: 180635403
> > hindex_pathological0.0/sec 0.000/sec0./sec
> > total: 7187
> > hindex_expand  0.0/sec 0.000/sec0./sec
> > total: 17
> > hmap_pathological  0.0/sec 4.167/sec4.1806/sec
> > total: 25091
> > hmap_expand0.0/sec  5366.500/sec 5390.0800/sec
> > total: 23680738
> > txn_unchanged  0.0/sec 0.300/sec0.3175/sec
> > total: 11024
> > txn_incomplete 0.0/sec 0.100/sec0.0836/sec
> > total: 974
> > txn_success0.0/sec 0.033/sec0.0308/sec
> > total: 129
> > txn_try_again  0.0/sec 0.000/sec0.0003/sec
> > total: 1
> > poll_create_node   0.4/sec 1.933/sec1.9575/sec
> > total: 55611
> > poll_zero_timeout  0.0/sec 0.067/sec0.0556/sec
> > total: 241
> > rconn_queued   0.0/sec 0.050/sec0.0594/sec
> > total: 1208720
> > rconn_sent 0.0/sec 0.050/sec0.0594/sec
> > total: 1208720
> > seq_change 0.2/sec 0.783/sec0.7492/sec
> > total: 13962
> > pstream_open   0.0/sec 0.000/sec0./sec
> > total: 1
> > stream_open0.0/sec 0.000/sec0.0003/sec
> > total: 5
> > unixctl_received   0.0/sec 0.000/sec0.0011/sec
> > total: 4
> > unixctl_replied0.0/sec 0.000/sec0.0011/sec
> > total: 4
> > util_xalloc0.8/sec 1396586.967/sec   240916.6047/sec
> > total: 5834154064
> > vconn_open 0.0/sec 0.000/sec0./sec
> > total: 2
> > vconn_received 0.0/sec 0.050/sec0.0594/sec
> > total: 632
> > vconn_sent 0.0/sec 0.050/sec0.0494/sec
> > total: 1213248
> > netlink_received   0.0/sec 0.300/sec0.2900/sec
> > total: 1188
> > netlink_recv_jumbo 0.0/sec 0.083/sec0.0725/sec
> > total: 296
> > netlink_sent   0.0/sec 0.300/sec0.2900/sec
> > total: 1188
> > cmap_expand0.0/sec 0.000/sec0./sec
> > total: 3
> > 82 events never hit
> > [root@gateway-1 ~]# docker exec ovn_controller ovs-appctl -t
> > /run/ovn/ovn-controller.6.ctl coverage/show Event coverage, avg rate
> > over last: 5 seconds, last minute, last hour,  hash=d0107601:
> > lflow_run  0.2/sec 0.083/sec0.0717/sec
> > total: 296
> > miniflow_malloc  122834.2/sec 51180.917/sec43930.2869/sec
> > total: 181249574
> > hindex_pathological0.0/sec 0.0

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
Thanks for the hint!

2020-08-07T19:44:18.019Z|616614|jsonrpc|DBG|tcp:10.6.20.84:6642: received 
notification, method="update3", 
params=[["monid","OVN_Southbound"],"5002cb22-13e5-490a-9a64-5d48914138ca",{"Chassis":{"0390b621-152b-48a0-a3d1-2973c0b823cc":{"modify":{"external_ids":["map",[["neutron:liveness_check_at","2020-08-07T19:44:07.130233+00:00"]]]]


Nailed it...
https://bugs.launchpad.net/neutron/+bug/1883554


Tony
> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Friday, August 7, 2020 1:14 PM
> To: Han Zhou 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: Re: [ovs-dev] [ovs-discuss] [OVN] ovn-controller takes 100% cpu
> while no changes in sb-db
> 
> 
> Here is the outpu.
> 
> [root@gateway-1 ~]# docker exec ovn_controller ovs-appctl -t
> /run/ovn/ovn-controller.6.ctl coverage/show Event coverage, avg rate
> over last: 5 seconds, last minute, last hour,  hash=e70a83c8:
> lflow_run  0.0/sec 0.083/sec0.0725/sec
> total: 295
> miniflow_malloc0.0/sec 44356.817/sec44527.3975/sec
> total: 180635403
> hindex_pathological0.0/sec 0.000/sec0./sec
> total: 7187
> hindex_expand  0.0/sec 0.000/sec0./sec
> total: 17
> hmap_pathological  0.0/sec 4.167/sec4.1806/sec
> total: 25091
> hmap_expand0.0/sec  5366.500/sec 5390.0800/sec
> total: 23680738
> txn_unchanged  0.0/sec 0.300/sec0.3175/sec
> total: 11024
> txn_incomplete 0.0/sec 0.100/sec0.0836/sec
> total: 974
> txn_success0.0/sec 0.033/sec0.0308/sec
> total: 129
> txn_try_again  0.0/sec 0.000/sec0.0003/sec
> total: 1
> poll_create_node   0.4/sec 1.933/sec1.9575/sec
> total: 55611
> poll_zero_timeout  0.0/sec 0.067/sec0.0556/sec
> total: 241
> rconn_queued   0.0/sec 0.050/sec0.0594/sec
> total: 1208720
> rconn_sent 0.0/sec 0.050/sec0.0594/sec
> total: 1208720
> seq_change 0.2/sec 0.783/sec0.7492/sec
> total: 13962
> pstream_open   0.0/sec 0.000/sec0./sec
> total: 1
> stream_open0.0/sec 0.000/sec0.0003/sec
> total: 5
> unixctl_received   0.0/sec 0.000/sec0.0011/sec
> total: 4
> unixctl_replied0.0/sec 0.000/sec0.0011/sec
> total: 4
> util_xalloc0.8/sec 1396586.967/sec   240916.6047/sec
> total: 5834154064
> vconn_open 0.0/sec 0.000/sec0./sec
> total: 2
> vconn_received 0.0/sec 0.050/sec0.0594/sec
> total: 632
> vconn_sent 0.0/sec 0.050/sec0.0494/sec
> total: 1213248
> netlink_received   0.0/sec 0.300/sec0.2900/sec
> total: 1188
> netlink_recv_jumbo 0.0/sec 0.083/sec0.0725/sec
> total: 296
> netlink_sent   0.0/sec 0.300/sec0.2900/sec
> total: 1188
> cmap_expand0.0/sec 0.000/sec0./sec
> total: 3
> 82 events never hit
> [root@gateway-1 ~]# docker exec ovn_controller ovs-appctl -t
> /run/ovn/ovn-controller.6.ctl coverage/show Event coverage, avg rate
> over last: 5 seconds, last minute, last hour,  hash=d0107601:
> lflow_run  0.2/sec 0.083/sec0.0717/sec
> total: 296
> miniflow_malloc  122834.2/sec 51180.917/sec43930.2869/sec
> total: 181249574
> hindex_pathological0.0/sec 0.000/sec0./sec
> total: 7187
> hindex_expand  0.0/sec 0.000/sec0./sec
> total: 17
> hmap_pathological 13.2/sec 4.967/sec4.1264/sec
> total: 25157
> hmap_expand  14982.2/sec  6205.067/sec 5317.9547/sec
> total: 23755649
> txn_unchanged  1.4/sec 0.400/sec0.3144/sec
> total: 11031
> txn_incomplete 0.4/sec 0.117/sec0.0825/sec
> total: 976
> txn_success0.2/sec 0.050/sec0.0306/sec
> total: 130
> txn_try_again  0.0/sec 0.000/sec0.0003/sec
> total: 1
> poll_create_node   7.6/sec 2.467/sec1.9353/sec
> total: 55649
> poll_zero_timeout  0.4/sec 0.100/sec0.0547/sec
> total: 243
> rconn_queued   0.4/sec 0.083/sec0.0592/sec
> total: 1208722
> rconn_sent 0.4/sec 0.083/sec0.0592/sec

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
 rate over last: 5 seconds, last minute, last hour,  
hash=069e0b25:
lflow_run  0.2/sec 0.100/sec0.0719/sec   total: 297
miniflow_malloc  122834.2/sec 61417.100/sec44100.8900/sec   total: 
181863745
hindex_pathological0.0/sec 0.000/sec0./sec   total: 7187
hindex_expand  0.0/sec 0.000/sec0./sec   total: 17
hmap_pathological 10.0/sec 5.800/sec4.1403/sec   total: 
25207
hmap_expand  14756.6/sec  7434.783/sec 5338.4500/sec   total: 
23829432
txn_unchanged  0.4/sec 0.433/sec0.3150/sec   total: 
11033
txn_incomplete 0.0/sec 0.117/sec0.0825/sec   total: 976
txn_success0.0/sec 0.050/sec0.0306/sec   total: 130
txn_try_again  0.0/sec 0.000/sec0.0003/sec   total: 1
poll_create_node   2.6/sec 2.650/sec1.9389/sec   total: 
55662
poll_zero_timeout  0.0/sec 0.100/sec0.0547/sec   total: 243
rconn_queued   0.0/sec 0.083/sec0.0592/sec   total: 
1208722
rconn_sent 0.0/sec 0.083/sec0.0592/sec   total: 
1208722
seq_change 1.4/sec 1.000/sec0.7414/sec   total: 
13980
pstream_open   0.0/sec 0.000/sec0./sec   total: 1
stream_open0.0/sec 0.000/sec0.0003/sec   total: 5
unixctl_received   0.2/sec 0.033/sec0.0017/sec   total: 6
unixctl_replied0.2/sec 0.033/sec0.0017/sec   total: 6
util_xalloc  3864890.0/sec 1933841.933/sec   227487.1978/sec   
total: 5872830887
vconn_open 0.0/sec 0.000/sec0./sec   total: 2
vconn_received 0.0/sec 0.083/sec0.0592/sec   total: 634
vconn_sent 0.0/sec 0.083/sec0.0492/sec   total: 
1213250
netlink_received   0.8/sec 0.400/sec0.2872/sec   total: 1196
netlink_recv_jumbo 0.2/sec 0.100/sec0.0719/sec   total: 298
netlink_sent   0.8/sec 0.400/sec0.2872/sec   total: 1196
cmap_expand0.0/sec 0.000/sec0./sec   total: 3
82 events never hit
[root@gateway-1 ~]# docker exec ovn_controller ovs-appctl -t 
/run/ovn/ovn-controller.6.ctl coverage/show
Event coverage, avg rate over last: 5 seconds, last minute, last hour,  
hash=069e0b25:
lflow_run  0.0/sec 0.083/sec0.0719/sec   total: 297
miniflow_malloc0.0/sec 51180.917/sec44100.8900/sec   total: 
181863745
hindex_pathological0.0/sec 0.000/sec0./sec   total: 7187
hindex_expand  0.0/sec 0.000/sec0./sec   total: 17
hmap_pathological  2.4/sec 4.967/sec4.1436/sec   total: 
25219
hmap_expand  171.8/sec  6205.350/sec 5338.6886/sec   total: 
23830291
txn_unchanged  1.2/sec 0.433/sec0.3167/sec   total: 
11039
txn_incomplete 0.0/sec 0.100/sec0.0825/sec   total: 976
txn_success0.0/sec 0.033/sec0.0306/sec   total: 130
txn_try_again  0.0/sec 0.000/sec0.0003/sec   total: 1
poll_create_node   4.6/sec 2.583/sec1.9453/sec   total: 
55685
poll_zero_timeout  0.0/sec 0.067/sec0.0547/sec   total: 243
rconn_queued   0.2/sec 0.083/sec0.0594/sec   total: 
1208723
rconn_sent 0.2/sec 0.083/sec0.0594/sec   total: 
1208723
seq_change 1.0/sec 0.933/sec0.7428/sec   total: 
13985
pstream_open   0.0/sec 0.000/sec0./sec   total: 1
stream_open0.0/sec 0.000/sec0.0003/sec   total: 5
unixctl_received   0.2/sec 0.050/sec0.0019/sec   total: 7
unixctl_replied0.2/sec 0.050/sec0.0019/sec   total: 7
util_xalloc  4345.0/sec 1611785.933/sec   227493.2325/sec   total: 
5872852612
vconn_open 0.0/sec 0.000/sec0./sec   total: 2
vconn_received 0.2/sec 0.083/sec0.0594/sec   total: 635
vconn_sent 0.2/sec 0.083/sec0.0494/sec   total: 
1213251
netlink_received   0.0/sec 0.333/sec0.2872/sec   total: 1196
netlink_recv_jumbo 0.0/sec 0.083/sec0.0719/sec   total: 298
netlink_sent   0.0/sec 0.333/sec0.2872/sec   total: 1196
cmap_expand0.0/sec 0.000/sec0./sec   total: 3
82 events never hit



Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Friday, August 7, 2020 1:09 PM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while 

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
Enabled debug logging, there are tons of messages.
Note there are 4353 datapath bindings and 13078 port bindings in SB.
4097 LS, 8470 LSP, 256 LR and 4352 LRP in NB. Every 16 LS connect to
a router. All routers connect to the external network.

ovn-controller on compute node is good. The ovn-controller on gateway
node is taking 100% cpu. It's probably related to the ports on the
external network? Any specific messages I need to check?

Any hint to look into it is appreciated!


Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Friday, August 7, 2020 12:39 PM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> changes in sb-db
> 
> 
> 
> On Fri, Aug 7, 2020 at 12:35 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Inline...
> 
>   Thanks!
> 
>   Tony
>   > -Original Message-
>   > From: Han Zhou mailto:zhou...@gmail.com> >
>   > Sent: Friday, August 7, 2020 12:29 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev> d...@openvswitch.org <mailto:d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu
> while no
>   > changes in sb-db
>   >
>   >
>   >
>   > On Fri, Aug 7, 2020 at 12:19 PM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   ovn-controller is using UNIX socket connecting to local
> ovsdb-
>   > server.
>   >
>   > From the log you were showing, you were using tcp:127.0.0.1:6640
> <http://127.0.0.1:6640>
> 
>   Sorry, what I meant was, given your advice, I just made the change
> for
>   ovn-controller to use UNIX socket.
> 
> 
> 
> Oh, I see, no worries.
> 
> 
>   > <http://127.0.0.1:6640>  to connect the local ovsdb.
>   > >   2020-08-
> 07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> <http://127.0.0.1:6640>
>   > > <http://127.0.0.1:6640> <http://127.0.0.1:6640> : connection
> dropped
>   > > (Broken pipe)
>   >
>   >
>   >   Inactivity probe doesn't seem to be the cause of high cpu
> usage.
>   >
>   >   The wakeup on connection to sb-db is always followed by a
>   > "unreasonably
>   >   long" warning. I guess the pollin event loop is stuck for
> too long,
>   > like
>   >   10s as below.
>   >   
>   >   2020-08-07T18:46:49.301Z|00296|poll_loop|INFO|wakeup due to
> [POLLIN]
>   > on fd 19 (10.6.20.91:60712 <http://10.6.20.91:60712>
> <http://10.6.20.91:60712> <->10.6.20.86:6642 <http://10.6.20.86:6642>
>   > <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU usage)
>   >   2020-08-07T18:46:59.460Z|00297|timeval|WARN|Unreasonably
> long
>   > 10153ms poll interval (10075ms user, 1ms system)
>   >   
>   >
>   >   Could that stuck loop be the cause of high cpu usage?
>   >   What is it polling in?
>   >   Why is it stuck, waiting for message from sb-db?
>   >   Isn't it supposed to release the cpu while waiting?
>   >
>   >
>   >
>   > This log means there are messages received from 10.6.20.86:6642
> <http://10.6.20.86:6642>
>   > <http://10.6.20.86:6642>  (the SB DB). Is there SB change? The
> CPU is
>   > spent on handling the SB change. Some type of SB changes are not
> handled
>   > incrementally.
> 
>   SB update is driven by ovn-northd in case anything changed in NB,
>   and ovn-controller in case anything changed on chassis. No, there
>   is nothing changed in NB, neither chassis.
> 
>   Should I bump logging level up to dbg? Is that going to show me
>   what messages ovn-controller is handling?
> 
> 
> 
> Yes, debug log should show the details.
> 
> 
> 
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >
>   >   > -Original Message-
>   >   > From: Han Zhou  <mailto:zhou...@gmail.com>  <mailto:zhou...@gmail.com
> <mailto:zhou...@gmail.com> > >
>   >   > Sent: Fr

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-07 Thread Tony Liu
Good to know, thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Friday, August 7, 2020 12:36 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; ovs-dev
> ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> The raft probe is disabled if you use the latest version of OVS, e.g.
> 2.13.1.
> 
> 
> On Fri, Aug 7, 2020 at 12:28 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Another one here, there is inactivity probe on the raft cluster
>   port.
>   
>   2020-08-07T19:04:53.184Z|02735|reconnect|ERR|tcp:10.6.20.85:6644
> <http://10.6.20.85:6644> : no response to inactivity probe after 5
> seconds, disconnecting
>   2020-08-07T19:04:53.184Z|02736|reconnect|INFO|tcp:10.6.20.85:6644
> <http://10.6.20.85:6644> : connection dropped
>   2020-08-07T19:04:54.185Z|02737|reconnect|INFO|tcp:10.6.20.85:6644
> <http://10.6.20.85:6644> : connecting...
>   2020-08-07T19:04:54.185Z|02738|reconnect|INFO|tcp:10.6.20.85:6644
> <http://10.6.20.85:6644> : connected
>   2020-08-07T19:15:26.228Z|02739|reconnect|ERR|tcp:10.6.20.84:49440
> <http://10.6.20.84:49440> : no response to inactivity probe after 5
> seconds, disconnecting
>   2020-08-07T19:15:26.769Z|02740|reconnect|ERR|tcp:10.6.20.84:6644
> <http://10.6.20.84:6644> : no response to inactivity probe after 5
> seconds, disconnecting
>   2020-08-07T19:15:26.769Z|02741|reconnect|INFO|tcp:10.6.20.84:6644
> <http://10.6.20.84:6644> : connection dropped
>   2020-08-07T19:15:27.771Z|02742|reconnect|INFO|tcp:10.6.20.84:6644
> <http://10.6.20.84:6644> : connecting...
>   2020-08-07T19:15:27.771Z|02743|reconnect|INFO|tcp:10.6.20.84:6644
> <http://10.6.20.84:6644> : connected
>   
>   Which configuration is for that probe interval?
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: dev mailto:ovs-dev-
> boun...@openvswitch.org> > On Behalf Of Tony Liu
>   > Sent: Thursday, August 6, 2020 7:45 PM
>   > To: Han Zhou mailto:hz...@ovn.org> >; Numan
> Siddique mailto:num...@ovn.org> >
>   > Cc: ovs-dev mailto:ovs-
> d...@openvswitch.org> >; ovs-discuss> disc...@openvswitch.org <mailto:disc...@openvswitch.org> >
>   > Subject: Re: [ovs-dev] [ovs-discuss] [OVN] no response to
> inactivity
>   > probe
>   >
>   > Hi Han and Numan,
>   >
>   > I'd like to have a few more clarifications.
>   >
>   > For inactivity probe:
>   > From ovn-controller to ovn-sb-db: ovn-remote-probe-interval
>   >
>   > From ovn-controller to ovs-vswitchd: ovn-openflow-probe-interval
>   >
>   > From ovn-controller to local ovsdb: which interval?
>   >
>   > From local ovsdb to ovn-controller: which interval?
>   >
>   > From ovs-vswitchd to ovn-controller: which interval?
>   >
>   >
>   > Regarding to the connection between ovn-controller and local
> ovsdb-
>   > server, I recall that UNIX socket is lighter than TCP socket and
> UNIX
>   > socket is recommended for local communication.
>   > Is that right?
>   >
>   >
>   > Thanks!
>   >
>   > Tony
>   >
>   > > -Original Message-
>   > > From: Han Zhou mailto:hz...@ovn.org> >
>   > > Sent: Thursday, August 6, 2020 12:42 PM
>   > > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan
> Siddique mailto:num...@ovn.org> >; ovs-dev
>   > > mailto:ovs-...@openvswitch.org> >;
> ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >
>   > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity
> probe
>   > >
>   > >
>   > >
>   > > On Thu, Aug 6, 2020 at 12:07 PM Tony Liu
> mailto:tonyliu0...@hotmail.com>
>   > > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   > > >
>   > > > Inline...
>   > > >
>   > > > Thanks!
>   > > >
>   > > > Tony
>   > > > > -Original Message-
>   > > > > From: Han Zhou mailto:hz...@ovn.org>
> <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >
>   > > &

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
Inline...

Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Friday, August 7, 2020 12:29 PM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> changes in sb-db
> 
> 
> 
> On Fri, Aug 7, 2020 at 12:19 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   ovn-controller is using UNIX socket connecting to local ovsdb-
> server.
> 
> From the log you were showing, you were using tcp:127.0.0.1:6640

Sorry, what I meant was, given your advice, I just made the change for
ovn-controller to use UNIX socket.

> <http://127.0.0.1:6640>  to connect the local ovsdb.
> >   2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> > <http://127.0.0.1:6640> <http://127.0.0.1:6640> : connection dropped
> > (Broken pipe)
> 
> 
>   Inactivity probe doesn't seem to be the cause of high cpu usage.
> 
>   The wakeup on connection to sb-db is always followed by a
> "unreasonably
>   long" warning. I guess the pollin event loop is stuck for too long,
> like
>   10s as below.
>   
>   2020-08-07T18:46:49.301Z|00296|poll_loop|INFO|wakeup due to [POLLIN]
> on fd 19 (10.6.20.91:60712 <http://10.6.20.91:60712> <->10.6.20.86:6642
> <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU usage)
>   2020-08-07T18:46:59.460Z|00297|timeval|WARN|Unreasonably long
> 10153ms poll interval (10075ms user, 1ms system)
>   
> 
>   Could that stuck loop be the cause of high cpu usage?
>   What is it polling in?
>   Why is it stuck, waiting for message from sb-db?
>   Isn't it supposed to release the cpu while waiting?
> 
> 
> 
> This log means there are messages received from 10.6.20.86:6642
> <http://10.6.20.86:6642>  (the SB DB). Is there SB change? The CPU is
> spent on handling the SB change. Some type of SB changes are not handled
> incrementally.

SB update is driven by ovn-northd in case anything changed in NB,
and ovn-controller in case anything changed on chassis. No, there
is nothing changed in NB, neither chassis.

Should I bump logging level up to dbg? Is that going to show me
what messages ovn-controller is handling?

> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:zhou...@gmail.com> >
>   > Sent: Friday, August 7, 2020 10:32 AM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev> d...@openvswitch.org <mailto:d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu
> while no
>   > changes in sb-db
>   >
>   >
>   >
>   > On Fri, Aug 7, 2020 at 10:05 AM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   Hi,
>   >
>   >   Here are some logging snippets from ovn-controller.
>   >   
>   >   2020-08-07T16:38:04.020Z|29250|timeval|WARN|Unreasonably
> long
>   > 8954ms poll interval (8895ms user, 0ms system)
>   >   
>   >   What's that mean? Is it harmless?
>   >
>   >   
>   >   2020-08-07T16:38:04.021Z|29251|timeval|WARN|context
> switches: 0
>   > voluntary, 6 involuntary
>   >   2020-08-07T16:38:04.022Z|29252|poll_loop|INFO|wakeup due to
> [POLLIN]
>   > on fd 19 (10.6.20.91:60398 <http://10.6.20.91:60398>
> <http://10.6.20.91:60398> <->10.6.20.86:6642 <http://10.6.20.86:6642>
>   > <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU usage)
>   >   
>   >   Is this wakeup caused by changes in sb-db?
>   >   Why is ovn-controller so busy?
>   >
>   >   
>   >   2020-08-
> 07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> <http://127.0.0.1:6640>
>   > <http://127.0.0.1:6640> : connection dropped (Broken pipe)
>   >   
>   >   Connection to local ovsdb-server is dropped.
>   >   Is this caused by the timeout of inactivity probe?
>   >
>   >   
>   >   2020-08-07T16:38:04.035Z|29254|poll_loop|INFO|wakeup due to
> [POLLIN]
>   > on fd 20 (<->

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-07 Thread Tony Liu
Another one here, there is inactivity probe on the raft cluster
port. 

2020-08-07T19:04:53.184Z|02735|reconnect|ERR|tcp:10.6.20.85:6644: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-07T19:04:53.184Z|02736|reconnect|INFO|tcp:10.6.20.85:6644: connection 
dropped
2020-08-07T19:04:54.185Z|02737|reconnect|INFO|tcp:10.6.20.85:6644: connecting...
2020-08-07T19:04:54.185Z|02738|reconnect|INFO|tcp:10.6.20.85:6644: connected
2020-08-07T19:15:26.228Z|02739|reconnect|ERR|tcp:10.6.20.84:49440: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-07T19:15:26.769Z|02740|reconnect|ERR|tcp:10.6.20.84:6644: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-07T19:15:26.769Z|02741|reconnect|INFO|tcp:10.6.20.84:6644: connection 
dropped
2020-08-07T19:15:27.771Z|02742|reconnect|INFO|tcp:10.6.20.84:6644: connecting...
2020-08-07T19:15:27.771Z|02743|reconnect|INFO|tcp:10.6.20.84:6644: connected

Which configuration is for that probe interval?


Thanks!

Tony

> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Thursday, August 6, 2020 7:45 PM
> To: Han Zhou ; Numan Siddique 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: Re: [ovs-dev] [ovs-discuss] [OVN] no response to inactivity
> probe
> 
> Hi Han and Numan,
> 
> I'd like to have a few more clarifications.
> 
> For inactivity probe:
> From ovn-controller to ovn-sb-db: ovn-remote-probe-interval
> 
> From ovn-controller to ovs-vswitchd: ovn-openflow-probe-interval
> 
> From ovn-controller to local ovsdb: which interval?
> 
> From local ovsdb to ovn-controller: which interval?
> 
> From ovs-vswitchd to ovn-controller: which interval?
> 
> 
> Regarding to the connection between ovn-controller and local ovsdb-
> server, I recall that UNIX socket is lighter than TCP socket and UNIX
> socket is recommended for local communication.
> Is that right?
> 
> 
> Thanks!
> 
> Tony
> 
> > -Original Message-
> > From: Han Zhou 
> > Sent: Thursday, August 6, 2020 12:42 PM
> > To: Tony Liu 
> > Cc: Han Zhou ; Numan Siddique ; ovs-dev
> > ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Thu, Aug 6, 2020 at 12:07 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> > >
> > > Inline...
> > >
> > > Thanks!
> > >
> > > Tony
> > > > -Original Message-
> > > > From: Han Zhou mailto:hz...@ovn.org> >
> > > > Sent: Thursday, August 6, 2020 11:37 AM
> > > > To: Tony Liu  > > > <mailto:tonyliu0...@hotmail.com> >
> > > > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan
> > > > Siddique mailto:num...@ovn.org> >; ovs-dev
> > > > mailto:ovs-...@openvswitch.org> >;
> > > > ovs-discuss  > > > <mailto:ovs-discuss@openvswitch.org> >
> > > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > > >
> > > >
> > > >
> > > > On Thu, Aug 6, 2020 at 11:11 AM Tony Liu  > > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com> > > wrote:
> > > > >
> > > > > Inline... (please read with monospaced font:))
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Tony
> > > > > > -Original Message-
> > > > > > From: Han Zhou mailto:hz...@ovn.org>
> > > > > > <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >
> > > > > > Sent: Wednesday, August 5, 2020 11:48 PM
> > > > > > To: Tony Liu  > > > > > <mailto:tonyliu0...@hotmail.com>
> > > > > > <mailto:tonyliu0...@hotmail.com
> > > > > > <mailto:tonyliu0...@hotmail.com> > >
> > > > > > Cc: Han Zhou mailto:hz...@ovn.org>
> > > > > > <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >; Numan
> > > > > > Siddique mailto:num...@ovn.org>
> > > > > > <mailto:num...@ovn.org <mailto:num...@ovn.org> > >; ovs-dev
> > > > > > mailto:ovs-...@openvswitch.org>
> > > > > > <mailto:ovs-...@openvswitch.org
> > > > > > <mailto:ovs-...@openvswitch.org>
> > > > > > > >; ovs-discuss  > > > > > <mailto:ovs-discuss@openvswitch.org>
> > > > > > <mailto:ovs-discu

Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
ovn-controller is using UNIX socket connecting to local ovsdb-server.
Inactivity probe doesn't seem to be the cause of high cpu usage.

The wakeup on connection to sb-db is always followed by a "unreasonably
long" warning. I guess the pollin event loop is stuck for too long, like
10s as below.

2020-08-07T18:46:49.301Z|00296|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 
(10.6.20.91:60712<->10.6.20.86:6642) at lib/stream-fd.c:157 (99% CPU usage)
2020-08-07T18:46:59.460Z|00297|timeval|WARN|Unreasonably long 10153ms poll 
interval (10075ms user, 1ms system)


Could that stuck loop be the cause of high cpu usage?
What is it polling in?
Why is it stuck, waiting for message from sb-db?
Isn't it supposed to release the cpu while waiting?


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Friday, August 7, 2020 10:32 AM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] ovn-controller takes 100% cpu while no
> changes in sb-db
> 
> 
> 
> On Fri, Aug 7, 2020 at 10:05 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
>   Here are some logging snippets from ovn-controller.
>   
>   2020-08-07T16:38:04.020Z|29250|timeval|WARN|Unreasonably long
> 8954ms poll interval (8895ms user, 0ms system)
>   
>   What's that mean? Is it harmless?
> 
>   
>   2020-08-07T16:38:04.021Z|29251|timeval|WARN|context switches: 0
> voluntary, 6 involuntary
>   2020-08-07T16:38:04.022Z|29252|poll_loop|INFO|wakeup due to [POLLIN]
> on fd 19 (10.6.20.91:60398 <http://10.6.20.91:60398> <->10.6.20.86:6642
> <http://10.6.20.86:6642> ) at lib/stream-fd.c:157 (99% CPU usage)
>   
>   Is this wakeup caused by changes in sb-db?
>   Why is ovn-controller so busy?
> 
>   
>   2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640
> <http://127.0.0.1:6640> : connection dropped (Broken pipe)
>   
>   Connection to local ovsdb-server is dropped.
>   Is this caused by the timeout of inactivity probe?
> 
>   
>   2020-08-07T16:38:04.035Z|29254|poll_loop|INFO|wakeup due to [POLLIN]
> on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157
> (99% CPU usage)
>   
>   What causes this wakeup?
> 
>   
>   2020-08-07T16:38:04.048Z|29255|poll_loop|INFO|wakeup due to 0-ms
> timeout at lib/ovsdb-idl.c:5391 (99% CPU usage)
>   
>   What's this 0-ms wakeup mean?
> 
>   
>   2020-08-07T16:38:05.022Z|29256|poll_loop|INFO|wakeup due to 962-ms
> timeout at lib/reconnect.c:643 (99% CPU usage)
>   2020-08-07T16:38:05.023Z|29257|reconnect|INFO|tcp:127.0.0.1:6640
> <http://127.0.0.1:6640> : connecting...
>   2020-08-07T16:38:05.041Z|29258|poll_loop|INFO|wakeup due to
> [POLLOUT] on fd 14 (127.0.0.1:51478 <http://127.0.0.1:51478> <-
> >127.0.0.1:6640 <http://127.0.0.1:6640> ) at lib/stream-fd.c:153 (99%
> CPU usage)
>   2020-08-07T16:38:05.041Z|29259|reconnect|INFO|tcp:127.0.0.1:6640
> <http://127.0.0.1:6640> : connected
>   
>   Retry to connect to local ovsdb-server. A pollout event is
> triggered
>   right after connection is established. What's poolout?
> 
>   ovn-controller is taking 100% CPU now, and there is no changes in
>   sb-db (not busy). It seems that it's busy with local ovsdb-server
>   or vswitchd. I'd like to understand why ovn-controller is so busy?
>   All inactivity probe intervals are set to 30s.
> 
> 
> 
> 
> Is there change from the local ovsdb? You can enable dbg log to see what
> is happening.
> For the local ovsdb probe, I have mentioned in the other thread that
> UNIX socket is recommended (instead of tcp 127.0.0.1). Using UNIX socket
> disables probe by default.
> 
> Thanks,
> Han

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] ovn-controller takes 100% cpu while no changes in sb-db

2020-08-07 Thread Tony Liu
Hi,

Here are some logging snippets from ovn-controller.

2020-08-07T16:38:04.020Z|29250|timeval|WARN|Unreasonably long 8954ms poll 
interval (8895ms user, 0ms system)

What's that mean? Is it harmless?


2020-08-07T16:38:04.021Z|29251|timeval|WARN|context switches: 0 voluntary, 6 
involuntary
2020-08-07T16:38:04.022Z|29252|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 
(10.6.20.91:60398<->10.6.20.86:6642) at lib/stream-fd.c:157 (99% CPU usage)

Is this wakeup caused by changes in sb-db?
Why is ovn-controller so busy?


2020-08-07T16:38:04.022Z|29253|reconnect|WARN|tcp:127.0.0.1:6640: connection 
dropped (Broken pipe)

Connection to local ovsdb-server is dropped.
Is this caused by the timeout of inactivity probe?


2020-08-07T16:38:04.035Z|29254|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 
(<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)

What causes this wakeup?


2020-08-07T16:38:04.048Z|29255|poll_loop|INFO|wakeup due to 0-ms timeout at 
lib/ovsdb-idl.c:5391 (99% CPU usage)

What's this 0-ms wakeup mean?


2020-08-07T16:38:05.022Z|29256|poll_loop|INFO|wakeup due to 962-ms timeout at 
lib/reconnect.c:643 (99% CPU usage)
2020-08-07T16:38:05.023Z|29257|reconnect|INFO|tcp:127.0.0.1:6640: connecting...
2020-08-07T16:38:05.041Z|29258|poll_loop|INFO|wakeup due to [POLLOUT] on fd 14 
(127.0.0.1:51478<->127.0.0.1:6640) at lib/stream-fd.c:153 (99% CPU usage)
2020-08-07T16:38:05.041Z|29259|reconnect|INFO|tcp:127.0.0.1:6640: connected

Retry to connect to local ovsdb-server. A pollout event is triggered
right after connection is established. What's poolout?

ovn-controller is taking 100% CPU now, and there is no changes in
sb-db (not busy). It seems that it's busy with local ovsdb-server
or vswitchd. I'd like to understand why ovn-controller is so busy?
All inactivity probe intervals are set to 30s.


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Tony Liu
Hi,

There are still some connection errors from ovn-controller.
Is that connection drop will cause flows to be deleted from vswitchd?

..
2020-08-07T03:55:22.269Z|03988|jsonrpc|WARN|tcp:127.0.0.1:6640: send error: 
Broken pipe
..
2020-08-07T03:55:31.551Z|03996|reconnect|WARN|tcp:127.0.0.1:6640: connection 
dropped (Broken pipe)



2020-08-07T03:55:22.268Z|03986|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 
(127.0.0.1:49514<->127.0.0.1:6640) at lib/stream-fd.c:157 (99% CPU usage)
2020-08-07T03:55:22.268Z|03987|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 
(10.6.20.91:42854<->10.6.20.84:6642) at lib/stream-fd.c:157 (99% CPU usage)
2020-08-07T03:55:22.269Z|03988|jsonrpc|WARN|tcp:127.0.0.1:6640: send error: 
Broken pipe
2020-08-07T03:55:31.549Z|03989|timeval|WARN|Unreasonably long 9280ms poll 
interval (9220ms user, 1ms system)
2020-08-07T03:55:31.550Z|03990|timeval|WARN|disk: 0 reads, 8 writes
2020-08-07T03:55:31.550Z|03991|timeval|WARN|context switches: 0 voluntary, 5 
involuntary
2020-08-07T03:55:31.550Z|03992|coverage|INFO|Dropped 4 log messages in last 47 
seconds (most recently, 9 seconds ago) due to excessive rate
2020-08-07T03:55:31.551Z|03993|coverage|INFO|Skipping details of duplicate 
event coverage for hash=824dd6ab
2020-08-07T03:55:31.551Z|03994|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 
(<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage)
2020-08-07T03:55:31.551Z|03995|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 
(10.6.20.91:42854<->10.6.20.84:6642) at lib/stream-fd.c:157 (100% CPU usage)
2020-08-07T03:55:31.551Z|03996|reconnect|WARN|tcp:127.0.0.1:6640: connection 
dropped (Broken pipe)
2020-08-07T03:55:31.552Z|03997|poll_loop|INFO|wakeup due to 0-ms timeout at 
controller/ovn-controller.c:2123 (100% CPU usage)
2020-08-07T03:55:40.752Z|03998|timeval|WARN|Unreasonably long 9176ms poll 
interval (9118ms user, 0ms system)
2020-08-07T03:55:40.752Z|03999|timeval|WARN|context switches: 0 voluntary, 7 
involuntary
2020-08-07T03:55:40.753Z|04000|poll_loop|INFO|Dropped 2 log messages in last 10 
seconds (most recently, 10 seconds ago) due to excessive rate
2020-08-07T03:55:40.753Z|04001|poll_loop|INFO|wakeup due to 0-ms timeout at 
lib/reconnect.c:643 (99% CPU usage)
2020-08-07T03:55:40.754Z|04002|reconnect|INFO|tcp:127.0.0.1:6640: connecting...
2020-08-07T03:55:40.771Z|04003|reconnect|INFO|tcp:127.0.0.1:6640: connected


Thanks!

Tony
> -Original Message-
> From: discuss  On Behalf Of Tony
> Liu
> Sent: Thursday, August 6, 2020 8:23 PM
> To: Han Zhou ; Numan Siddique 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> Interesting...
> 
> with this configuration on gateway (chassis) node, 
> external_ids: {ovn-bridge-mappings="physnet1:br-ex", ovn-cms-
> options=enable-chassis-as-gw, ovn-encap-ip="10.6.30.91", ovn-encap-
> type=geneve, ovn-openflow-probe-interval="30", ovn-
> remote="tcp:10.6.20.84:6642,tcp:10.6.20.85:6642,tcp:10.6.20.86:6642",
> ovn-remote-probe-interval="3", system-id="gateway-1"}
> 
> 
> I still see error from ovn-controller.
> 
> 2020-08-07T03:17:48.186Z|02737|reconnect|ERR|tcp:127.0.0.1:6640: no
> response to inactivity probe after 8.74 seconds, disconnecting 
> That tcp:127.0.0.1:6640 is the connection between ovn-controller and
> local ovsdb-server.
> 
> Any settings I missed?
> 
> 
> Thanks!
> 
> Tony
> > -Original Message-
> > From: dev  On Behalf Of Tony Liu
> > Sent: Thursday, August 6, 2020 7:45 PM
> > To: Han Zhou ; Numan Siddique 
> > Cc: ovs-dev ; ovs-discuss  > disc...@openvswitch.org>
> > Subject: Re: [ovs-dev] [ovs-discuss] [OVN] no response to inactivity
> > probe
> >
> > Hi Han and Numan,
> >
> > I'd like to have a few more clarifications.
> >
> > For inactivity probe:
> > From ovn-controller to ovn-sb-db: ovn-remote-probe-interval
> >
> > From ovn-controller to ovs-vswitchd: ovn-openflow-probe-interval
> >
> > From ovn-controller to local ovsdb: which interval?
> >
> > From local ovsdb to ovn-controller: which interval?
> >
> > From ovs-vswitchd to ovn-controller: which interval?
> >
> >
> > Regarding to the connection between ovn-controller and local ovsdb-
> > server, I recall that UNIX socket is lighter than TCP socket and UNIX
> > socket is recommended for local communication.
> > Is that right?
> >
> >
> > Thanks!
> >
> > Tony
> >
> > > -Original Message-
> > > From: Han Zhou 
> > > Sent: Thursday, Augus

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Tony Liu
Interesting...

with this configuration on gateway (chassis) node,

external_ids: {ovn-bridge-mappings="physnet1:br-ex", 
ovn-cms-options=enable-chassis-as-gw, ovn-encap-ip="10.6.30.91", 
ovn-encap-type=geneve, ovn-openflow-probe-interval="30", 
ovn-remote="tcp:10.6.20.84:6642,tcp:10.6.20.85:6642,tcp:10.6.20.86:6642", 
ovn-remote-probe-interval="3", system-id="gateway-1"}


I still see error from ovn-controller.

2020-08-07T03:17:48.186Z|02737|reconnect|ERR|tcp:127.0.0.1:6640: no response to 
inactivity probe after 8.74 seconds, disconnecting

That tcp:127.0.0.1:6640 is the connection between ovn-controller
and local ovsdb-server.

Any settings I missed?


Thanks!

Tony
> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Thursday, August 6, 2020 7:45 PM
> To: Han Zhou ; Numan Siddique 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: Re: [ovs-dev] [ovs-discuss] [OVN] no response to inactivity
> probe
> 
> Hi Han and Numan,
> 
> I'd like to have a few more clarifications.
> 
> For inactivity probe:
> From ovn-controller to ovn-sb-db: ovn-remote-probe-interval
> 
> From ovn-controller to ovs-vswitchd: ovn-openflow-probe-interval
> 
> From ovn-controller to local ovsdb: which interval?
> 
> From local ovsdb to ovn-controller: which interval?
> 
> From ovs-vswitchd to ovn-controller: which interval?
> 
> 
> Regarding to the connection between ovn-controller and local ovsdb-
> server, I recall that UNIX socket is lighter than TCP socket and UNIX
> socket is recommended for local communication.
> Is that right?
> 
> 
> Thanks!
> 
> Tony
> 
> > -Original Message-
> > From: Han Zhou 
> > Sent: Thursday, August 6, 2020 12:42 PM
> > To: Tony Liu 
> > Cc: Han Zhou ; Numan Siddique ; ovs-dev
> > ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Thu, Aug 6, 2020 at 12:07 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> > >
> > > Inline...
> > >
> > > Thanks!
> > >
> > > Tony
> > > > -Original Message-
> > > > From: Han Zhou mailto:hz...@ovn.org> >
> > > > Sent: Thursday, August 6, 2020 11:37 AM
> > > > To: Tony Liu  > > > <mailto:tonyliu0...@hotmail.com> >
> > > > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan
> > > > Siddique mailto:num...@ovn.org> >; ovs-dev
> > > > mailto:ovs-...@openvswitch.org> >;
> > > > ovs-discuss  > > > <mailto:ovs-discuss@openvswitch.org> >
> > > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > > >
> > > >
> > > >
> > > > On Thu, Aug 6, 2020 at 11:11 AM Tony Liu  > > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com> > > wrote:
> > > > >
> > > > > Inline... (please read with monospaced font:))
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Tony
> > > > > > -Original Message-
> > > > > > From: Han Zhou mailto:hz...@ovn.org>
> > > > > > <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >
> > > > > > Sent: Wednesday, August 5, 2020 11:48 PM
> > > > > > To: Tony Liu  > > > > > <mailto:tonyliu0...@hotmail.com>
> > > > > > <mailto:tonyliu0...@hotmail.com
> > > > > > <mailto:tonyliu0...@hotmail.com> > >
> > > > > > Cc: Han Zhou mailto:hz...@ovn.org>
> > > > > > <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >; Numan
> > > > > > Siddique mailto:num...@ovn.org>
> > > > > > <mailto:num...@ovn.org <mailto:num...@ovn.org> > >; ovs-dev
> > > > > > mailto:ovs-...@openvswitch.org>
> > > > > > <mailto:ovs-...@openvswitch.org
> > > > > > <mailto:ovs-...@openvswitch.org>
> > > > > > > >; ovs-discuss  > > > > > <mailto:ovs-discuss@openvswitch.org>
> > > > > > <mailto:ovs-discuss@openvswitch.org
> > > > > > <mailto:ovs-discuss@openvswitch.org> > >
> > > > > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity
> > > > > > probe
> > > > > >
> >

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Tony Liu
Hi Han and Numan,

I'd like to have a few more clarifications.

For inactivity probe:
>From ovn-controller to ovn-sb-db: ovn-remote-probe-interval

>From ovn-controller to ovs-vswitchd: ovn-openflow-probe-interval

>From ovn-controller to local ovsdb: which interval?

>From local ovsdb to ovn-controller: which interval?

>From ovs-vswitchd to ovn-controller: which interval?


Regarding to the connection between ovn-controller and local
ovsdb-server, I recall that UNIX socket is lighter than TCP socket
and UNIX socket is recommended for local communication.
Is that right?


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Thursday, August 6, 2020 12:42 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; ovs-dev
> ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Thu, Aug 6, 2020 at 12:07 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> > Inline...
> >
> > Thanks!
> >
> > Tony
> > > -Original Message-
> > > From: Han Zhou mailto:hz...@ovn.org> >
> > > Sent: Thursday, August 6, 2020 11:37 AM
> > > To: Tony Liu  > > <mailto:tonyliu0...@hotmail.com> >
> > > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan Siddique
> > > mailto:num...@ovn.org> >; ovs-dev
> > > mailto:ovs-...@openvswitch.org> >;
> > > ovs-discuss  > > <mailto:ovs-discuss@openvswitch.org> >
> > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > >
> > >
> > >
> > > On Thu, Aug 6, 2020 at 11:11 AM Tony Liu  > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
> > > >
> > > > Inline... (please read with monospaced font:))
> > > >
> > > > Thanks!
> > > >
> > > > Tony
> > > > > -Original Message-
> > > > > From: Han Zhou mailto:hz...@ovn.org>
> > > > > <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >
> > > > > Sent: Wednesday, August 5, 2020 11:48 PM
> > > > > To: Tony Liu  > > > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> > > > > <mailto:tonyliu0...@hotmail.com> > >
> > > > > Cc: Han Zhou mailto:hz...@ovn.org>
> > > > > <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >; Numan Siddique
> > > > > mailto:num...@ovn.org>  <mailto:num...@ovn.org
> > > > > <mailto:num...@ovn.org> > >; ovs-dev  > > > > <mailto:ovs-...@openvswitch.org>
> > > > > <mailto:ovs-...@openvswitch.org <mailto:ovs-...@openvswitch.org>
> > > > > > >; ovs-discuss  > > > > <mailto:ovs-discuss@openvswitch.org>
> > > > > <mailto:ovs-discuss@openvswitch.org
> > > > > <mailto:ovs-discuss@openvswitch.org> > >
> > > > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 5, 2020 at 9:14 PM Tony Liu  > > > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> > > > > <mailto:tonyliu0...@hotmail.com> >
> > > > > <mailto:tonyliu0...@hotmail.com <mailto:tonyliu0...@hotmail.com>
> > > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > > wrote:
> > > > >
> > > > >
> > > > >   I set the connection target="ptcp:6641:10.6.20.84" for
> > > > > ovn-nb-
> > > db
> > > > >   and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first
> > > node
> > > > >   of cluster. Also ovn-openflow-probe-interval=30 on compute
> > > node.
> > > > >   It seems helping. Not that many connect/drop/reconnect in
> > > logging.
> > > > >   That "commit failure" is also gone.
> > > > >   The issue I reported in another thread "packet drop" seems
> > > gone.
> > > > >   And launching VM starts working.
> > > > >
> > > > >   How should I set connection table for all ovn-nb-db and
> > > > > ovn-
> > > sb-db
> > > > >   nodes in the cluster to set inactivity_probe?
> > > > >   One row w

Re: [ovs-discuss] [ovs-dev] packet drop

2020-08-06 Thread Tony Liu
Inline...

Thanks!

Tony
> -Original Message-
> From: Numan Siddique 
> Sent: Thursday, August 6, 2020 11:49 AM
> To: Tony Liu 
> Cc: ovs-...@openvswitch.org; ovs-discuss@openvswitch.org
> Subject: Re: [ovs-discuss] [ovs-dev] packet drop
> 
> On Fri, Aug 7, 2020 at 12:10 AM Tony Liu  wrote:
> >
> > Inline...
> >
> > Thanks!
> >
> > Tony
> > > -Original Message-
> > > From: Numan Siddique 
> > > Sent: Thursday, August 6, 2020 10:03 AM
> > > To: Tony Liu 
> > > Cc: ovs-discuss@openvswitch.org; ovs-...@openvswitch.org
> > > Subject: Re: [ovs-dev] packet drop
> > >
> > >
> > >
> > > On Thu, Aug 6, 2020 at 4:05 AM Tony Liu  > > <mailto:tonyliu0...@hotmail.com> > wrote:
> > >
> > >
> > >
> > >   The drop is caused by flow change.
> > >
> > >   When packet is dropped.
> > >   
> > >
> > > recirc_id(0),tunnel(tun_id=0x19aca,src=10.6.30.92,dst=10.6.30.22,ge
> > > neve({class=0x102,type=0x80,len=4,0x20003/0x7fff}),flags(-
> > > df+csum+key)),in_port(3),eth(src=fa:16:3e:df:1e:85,dst=00:00:00:00:0
> > > df+csum+0:00
> > > /01:00:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),icmp(type
> > > =8/0 xf8), packets:14, bytes:1372, used:0.846s, actions:drop
> > >
> recirc_id(0),in_port(12),eth(src=fa:16:3e:7d:bb:85,dst=fa:16:3e:df:
> > >
> 1e:85),eth_type(0x0800),ipv4(src=192.168.236.152/255.255.255.252,dst=10.
> > > 6.40.9,proto=1,tos=0/0x3,ttl=64,frag=no
> > > <http://192.168.236.152/255.255.255.252,dst=10.6.40.9,proto=1,tos=0/
> > > 0x3, ttl=64,frag=no> ),icmp(type=0), packets:6, bytes:588,
> > > used:8.983s, actions:drop
> > >   
> > >
> > >   When packet goes through.
> > >   
> > >
> > > recirc_id(0),tunnel(tun_id=0x19aca,src=10.6.30.92,dst=10.6.30.22,ge
> > > neve({class=0x102,type=0x80,len=4,0x20003/0x7fff}),flags(-
> > > df+csum+key)),in_port(3),eth(src=fa:16:3e:df:1e:85,dst=00:00:00:00:0
> > > df+csum+0:00
> > > /01:00:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),icmp(type
> > > =8/0 xf8), packets:3, bytes:294, used:0.104s, actions:12
> > >
> recirc_id(0),in_port(12),eth(src=fa:16:3e:7d:bb:85,dst=fa:16:3e:df:
> > >
> 1e:85),eth_type(0x0800),ipv4(src=192.168.236.152/255.255.255.252,dst=10.
> > > 6.40.9,proto=1,tos=0/0x3,ttl=64,frag=no
> > > <http://192.168.236.152/255.255.255.252,dst=10.6.40.9,proto=1,tos=0/
> > > 0x3, ttl=64,frag=no> ),icmp(type=0), packets:3, bytes:294,
> > > used:0.103s,
> > > actions:ct_clear,set(tunnel(tun_id=0x1a8ee,dst=10.6.30.92,ttl=64,tp_
> > > dst=
> > > 6081,geneve({class=0x102,type=0x80,len=4,0x1000b}),flags(df|csum|key
> > > ))),
> > > set(eth(src=fa:16:3e:75:b7:e5,dst=52:54:00:0c:ef:b9)),set(ipv4(ttl=6
> > > 3)),
> > > 3
> > >   
> > >
> > >   Is that flow programmed by ovn-controller via ovs-vswitchd?
> > >
> > > What version of OVN and OVS are you using ?
> >
> > ovn-20.03.0-2.el8.x86_64
> > openvswitch-2.12.0-1.el8.x86_64
> >
> > > Can you share your OVN NB DB ?
> >
> > Yes, I can. Let me know how.
> >
> > > If I understand correctly the packet is received from the patch port
> > > to br-int on the gateway node and then tunnelled to the compute node
> right ?
> > > And the packet is dropped on the compute node ?
> >
> > Yes and yes.
> >
> > > If you could share your NB DB if it's fine with you and tell the
> > > destination logical port, I can try it out locally.
> >
> > Here is what I had.
> > On compute node, ovn-controller is very busy. It keeps saying "commit
> > failed".
> > 
> > 2020-08-05T02:44:23.927Z|04125|reconnect|INFO|tcp:10.6.20.84:6642:
> > connected 2020-08-05T02:44:23.936Z|04126|main|INFO|OVNSB commit failed,
> force recompute next time.
> > 2020-08-05T02:44:23.938Z|04127|ovsdb_idl|INFO|tcp:10.6.20.84:6642:
> > clustered database server is disconnected from cluster; trying another
> > server
> > 2020-08-05T02:44:23.939Z|04128|reconnect|INFO|tcp:10.6.20.84:6642:
> > connection attempt timed out
> > 2020-08-05T02:44:23.939Z|04129|reconnect|INFO|tcp:10.6.20.84:6642:
> > waiting 2 seconds before reconnect 
> >
> > The connection to local OVSDB keeps being dropped, because no probe

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Tony Liu
Inline...

Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Thursday, August 6, 2020 11:37 AM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; ovs-dev
> ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Thu, Aug 6, 2020 at 11:11 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> > Inline... (please read with monospaced font:))
> >
> > Thanks!
> >
> > Tony
> > > -Original Message-
> > > From: Han Zhou mailto:hz...@ovn.org> >
> > > Sent: Wednesday, August 5, 2020 11:48 PM
> > > To: Tony Liu  > > <mailto:tonyliu0...@hotmail.com> >
> > > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan Siddique
> > > mailto:num...@ovn.org> >; ovs-dev
> > > mailto:ovs-...@openvswitch.org> >;
> > > ovs-discuss  > > <mailto:ovs-discuss@openvswitch.org> >
> > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > >
> > >
> > >
> > > On Wed, Aug 5, 2020 at 9:14 PM Tony Liu  > > <mailto:tonyliu0...@hotmail.com> <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
> > >
> > >
> > >   I set the connection target="ptcp:6641:10.6.20.84" for ovn-nb-
> db
> > >   and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first
> node
> > >   of cluster. Also ovn-openflow-probe-interval=30 on compute
> node.
> > >   It seems helping. Not that many connect/drop/reconnect in
> logging.
> > >   That "commit failure" is also gone.
> > >   The issue I reported in another thread "packet drop" seems
> gone.
> > >   And launching VM starts working.
> > >
> > >   How should I set connection table for all ovn-nb-db and ovn-
> sb-db
> > >   nodes in the cluster to set inactivity_probe?
> > >   One row with address 0.0.0.0 seems not working.
> > >
> > > You can simply use 0.0.0.0 in the connection table, but don't
> > > specify the same connection method on the command line when starting
> > > ovsdb- server for NB/SB DB. Otherwise, these are conflicting and
> > > that's why you saw "Address already in use" error.
> >
> > Could you share a bit details how it works?
> > I thought the row in connection table only tells nbdb and sbdb the
> > probe interval. Isn't that right? Does nbdb and sbdb also create
> > socket based on target column?
> 
> >
> 
> In --remote option of ovsdb-server, you can specify either a connection
> method directly, or specify the db,table,column which contains the
> connection information.
> Please see manpage ovsdb-server(1).

Here is how one of those 3 nbdb nodes invoked.

ovsdb-server -vconsole:off -vfile:info 
--log-file=/var/log/kolla/openvswitch/ovn-sb-db.log 
--remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=/run/ovn/ovnsb_db.pid 
--unixctl=/var/run/ovn/ovnsb_db.ctl 
--remote=db:OVN_Southbound,SB_Global,connections 
--private-key=db:OVN_Southbound,SSL,private_key 
--certificate=db:OVN_Southbound,SSL,certificate 
--ca-cert=db:OVN_Southbound,SSL,ca_cert 
--ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols 
--ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers --remote=ptcp:6642:10.6.20.84 
/var/lib/openvswitch/ovn-sb/ov sb.db

It creates UNIX and TCP sockets, and takes configuration from DB.
Does that look ok?
Given that, what the target column should be for all nodes of the cluster?
And whatever target is set, ovsdb-server will create socket, right?
Oh... Should I do "--remote=ptcp:6642:0.0.0.0"? Then I can set the same
in connection table, and it won't cause conflict?
If --remote and connection target are the same, whoever comes in later
will be ignored, right?
In coding, does ovsdb-server create a connection object for each of
--remote and connection target, or it's one single connection object
for both of them because method:port:address is the same? I'd expect
the single object.

> > >   Is "external_ids:ovn-remote-probe-interval" in ovsdb-server on
> > >   compute node for ovn-controller to probe ovn-sb-db?
> > >
> > > OVSDB probe is bidirectional, so you need to set this value, too, if
> > > you don't want too many probes handled by the SB server. (setting
> > > the connection table for SB only changes the server side).
> >
> > In that case, how do I set probe interval for ovn-controller?
> > My understanding is that, ovn-controller reads configuration from
> > ovsdb-server on the local compute n

Re: [ovs-discuss] [ovs-dev] packet drop

2020-08-06 Thread Tony Liu
Inline...

Thanks!

Tony
> -Original Message-
> From: Numan Siddique 
> Sent: Thursday, August 6, 2020 10:03 AM
> To: Tony Liu 
> Cc: ovs-discuss@openvswitch.org; ovs-...@openvswitch.org
> Subject: Re: [ovs-dev] packet drop
> 
> 
> 
> On Thu, Aug 6, 2020 at 4:05 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
> 
>   The drop is caused by flow change.
> 
>   When packet is dropped.
>   
>   recirc_id(0),tunnel(tun_id=0x19aca,src=10.6.30.92,dst=10.6.30.22,ge
> neve({class=0x102,type=0x80,len=4,0x20003/0x7fff}),flags(-
> df+csum+key)),in_port(3),eth(src=fa:16:3e:df:1e:85,dst=00:00:00:00:00:00
> /01:00:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),icmp(type=8/0
> xf8), packets:14, bytes:1372, used:0.846s, actions:drop
>   recirc_id(0),in_port(12),eth(src=fa:16:3e:7d:bb:85,dst=fa:16:3e:df:
> 1e:85),eth_type(0x0800),ipv4(src=192.168.236.152/255.255.255.252,dst=10.
> 6.40.9,proto=1,tos=0/0x3,ttl=64,frag=no
> <http://192.168.236.152/255.255.255.252,dst=10.6.40.9,proto=1,tos=0/0x3,
> ttl=64,frag=no> ),icmp(type=0), packets:6, bytes:588, used:8.983s,
> actions:drop
>   
> 
>   When packet goes through.
>   
>   recirc_id(0),tunnel(tun_id=0x19aca,src=10.6.30.92,dst=10.6.30.22,ge
> neve({class=0x102,type=0x80,len=4,0x20003/0x7fff}),flags(-
> df+csum+key)),in_port(3),eth(src=fa:16:3e:df:1e:85,dst=00:00:00:00:00:00
> /01:00:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),icmp(type=8/0
> xf8), packets:3, bytes:294, used:0.104s, actions:12
>   recirc_id(0),in_port(12),eth(src=fa:16:3e:7d:bb:85,dst=fa:16:3e:df:
> 1e:85),eth_type(0x0800),ipv4(src=192.168.236.152/255.255.255.252,dst=10.
> 6.40.9,proto=1,tos=0/0x3,ttl=64,frag=no
> <http://192.168.236.152/255.255.255.252,dst=10.6.40.9,proto=1,tos=0/0x3,
> ttl=64,frag=no> ),icmp(type=0), packets:3, bytes:294, used:0.103s,
> actions:ct_clear,set(tunnel(tun_id=0x1a8ee,dst=10.6.30.92,ttl=64,tp_dst=
> 6081,geneve({class=0x102,type=0x80,len=4,0x1000b}),flags(df|csum|key))),
> set(eth(src=fa:16:3e:75:b7:e5,dst=52:54:00:0c:ef:b9)),set(ipv4(ttl=63)),
> 3
>   
> 
>   Is that flow programmed by ovn-controller via ovs-vswitchd?
> 
> What version of OVN and OVS are you using ?

ovn-20.03.0-2.el8.x86_64
openvswitch-2.12.0-1.el8.x86_64

> Can you share your OVN NB DB ?

Yes, I can. Let me know how.

> If I understand correctly the packet is received from the patch port to
> br-int on the gateway node and then tunnelled to the compute node right ?
> And the packet is dropped on the compute node ?

Yes and yes.

> If you could share your NB DB if it's fine with you and tell the
> destination logical port, I can try it out locally.

Here is what I had.
On compute node, ovn-controller is very busy. It keeps saying
"commit failed".

2020-08-05T02:44:23.927Z|04125|reconnect|INFO|tcp:10.6.20.84:6642: connected
2020-08-05T02:44:23.936Z|04126|main|INFO|OVNSB commit failed, force recompute 
next time.
2020-08-05T02:44:23.938Z|04127|ovsdb_idl|INFO|tcp:10.6.20.84:6642: clustered 
database server is disconnected from cluster; trying another server
2020-08-05T02:44:23.939Z|04128|reconnect|INFO|tcp:10.6.20.84:6642: connection 
attempt timed out
2020-08-05T02:44:23.939Z|04129|reconnect|INFO|tcp:10.6.20.84:6642: waiting 2 
seconds before reconnect


The connection to local OVSDB keeps being dropped, because no probe
response.

2020-08-05T02:47:15.437Z|04351|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 
(10.6.20.22:42362<->10.6.20.86:6642) at lib/stream-fd.c:157 (100% CPU usage)
2020-08-05T02:47:15.438Z|04352|reconnect|WARN|tcp:127.0.0.1:6640: connection 
dropped (Broken pipe)
2020-08-05T02:47:15.438Z|04353|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt:
 connecting...
2020-08-05T02:47:15.449Z|04354|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt:
 connected


After set probe interval to 30s, the problem seems gone. I mentioned
this in another thread "[OVN] no response to inactivity probe".

I can restore probe interval back to default 5s, see if the problem
can be reproduced. For me, it's important to understand what happens
behind it.

> 
> Thanks
> Numan
> 
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: discuss mailto:ovs-
> discuss-boun...@openvswitch.org> > On Behalf Of Tony
>   > Liu
>   > Sent: Wednesday, August 5, 2020 2:48 PM
>   > To: ovs-discuss@openvswitch.org <mailto:ovs-
> disc...@openvswitch.org> ; ovs-...@openvswitch.org <mailto:ovs-
> d...@openvswitch.org>
>   > Subject: [ovs-discuss] packet drop
>   >
>

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-06 Thread Tony Liu
Inline... (please read with monospaced font:))

Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Wednesday, August 5, 2020 11:48 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; ovs-dev
> ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Wed, Aug 5, 2020 at 9:14 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   I set the connection target="ptcp:6641:10.6.20.84" for ovn-nb-db
>   and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first node
>   of cluster. Also ovn-openflow-probe-interval=30 on compute node.
>   It seems helping. Not that many connect/drop/reconnect in logging.
>   That "commit failure" is also gone.
>   The issue I reported in another thread "packet drop" seems gone.
>   And launching VM starts working.
> 
>   How should I set connection table for all ovn-nb-db and ovn-sb-db
>   nodes in the cluster to set inactivity_probe?
>   One row with address 0.0.0.0 seems not working.
> 
> You can simply use 0.0.0.0 in the connection table, but don't specify
> the same connection method on the command line when starting ovsdb-
> server for NB/SB DB. Otherwise, these are conflicting and that's why you
> saw "Address already in use" error.

Could you share a bit details how it works?
I thought the row in connection table only tells nbdb and sbdb the
probe interval. Isn't that right? Does nbdb and sbdb also create
socket based on target column?

> 
>   Is "external_ids:ovn-remote-probe-interval" in ovsdb-server on
>   compute node for ovn-controller to probe ovn-sb-db?
> 
> OVSDB probe is bidirectional, so you need to set this value, too, if you
> don't want too many probes handled by the SB server. (setting the
> connection table for SB only changes the server side).

In that case, how do I set probe interval for ovn-controller?
My understanding is that, ovn-controller reads configuration from
ovsdb-server on the local compute node. Isn't that right?

>   Is "external_ids:ovn-openflow-probe-interval" in ovsdb-server on
>   compute node for ovn-controller to probe ovsdb-server?
> 
> It is for the OpenFlow connection between ovn-controller and ovs-
> vswitchd, which is part of the OpenFlow protocol.
> 
>   What's probe interval for ovsdb-server to probe ovn-controller?
> 
> The local ovsdb connection uses unix socket, which doesn't send probe by
> default (if I remember correctly).

Here is how ovsdb-server and ovn-controller is invoked on compute node.

root 41129  0.0  0.0 157556 20532 ?SJul30   1:51 
/usr/sbin/ovsdb-server /var/lib/openvswitch/conf.db -vconsole:emer -vsyslog:err 
-vfile:info --remote=punix:/run/openvswitch/db.sock 
--remote=ptcp:6640:127.0.0.1 
--remote=db:Open_vSwitch,Open_vSwitch,manager_options 
--log-file=/var/log/kolla/openvswitch/ovsdb-server.log --pidfile

root 63775 55.9  0.4 1477796 1224324 ? Sl   Aug04 1360:55 
/usr/bin/ovn-controller --pidfile=/run/ovn/ovn-controller.pid 
--log-file=/var/log/kolla/openvswitch/ovn-controller.log tcp:127.0.0.1:6640

Is that OK? Or UNIX socket method is recommended for ovn-controller
to connect to ovsdb-server?

Here is the configuration in open_vswitch table in ovsdb-server.

external_ids: {ovn-encap-ip="10.6.30.22", ovn-encap-type=geneve, 
ovn-openflow-probe-interval="30", 
ovn-remote="tcp:10.6.20.84:6642,tcp:10.6.20.85:6642,tcp:10.6.20.86:6642", 
ovn-remote-probe-interval="6", system-id="compute-3"}

ovn-controller connects to ovsdb-server and reads this configuration,
so it knows how to connect to all sbdb nodes, right?

If it's TCP between ovn-controller and ovsdb-server, is that probe
interval setting will also apply to the probe from ovn-controller to
ovsdb-server?

ovn-controller connects to ovs-vswitchd by UNIX socket to program
open-flow. ovs-vswitchd and ovsdb-server are connected by UNIX too.
So, is that ovn-openflow-probe-interval for the probe from ovn-controller
to ovs-vswitchd via UNIX?

As a summary for the probe setting,

+--+  driver configuration
|  ovn-driver  |
+--+
^|
|v
+--+  inactivity_probe in table "Connection"
|  ovn-nb-db   |
+--+
^|
|v
+--+  options:northd_probe_interval in table "NB_Global"
|  ovn-northd  |  in nbdb.
+--+
^|
|v
+--+  inactivity_probe in table "Connection"
|  ovn-sb-db   |
+--+
^|
|v
++  in table "Open_vSwitch" in ovsdb-server
|ovn-control

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-05 Thread Tony Liu
I set the connection target="ptcp:6641:10.6.20.84" for ovn-nb-db
and "ptcp:6642:10.6.20.84" for ovn-sb-db. .84 is the first node
of cluster. Also ovn-openflow-probe-interval=30 on compute node.
It seems helping. Not that many connect/drop/reconnect in logging.
That "commit failure" is also gone.
The issue I reported in another thread "packet drop" seems gone.
And launching VM starts working.

How should I set connection table for all ovn-nb-db and ovn-sb-db
nodes in the cluster to set inactivity_probe?
One row with address 0.0.0.0 seems not working.

Is "external_ids:ovn-remote-probe-interval" in ovsdb-server on
compute node for ovn-controller to probe ovn-sb-db?

Is "external_ids:ovn-openflow-probe-interval" in ovsdb-server on
compute node for ovn-controller to probe ovsdb-server?

What's probe interval for ovsdb-server to probe ovn-controller?


Thanks!

Tony
> -----Original Message-
> From: discuss  On Behalf Of Tony
> Liu
> Sent: Wednesday, August 5, 2020 4:29 PM
> To: Han Zhou 
> Cc: ovs-dev ; ovs-discuss  disc...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> Hi Han,
> 
> After setting connection target="ptcp:6642:0.0.0.0" for ovn-sb-db, I see
> this error.
> 
> 2020-08-
> 05T23:01:26.819Z|06799|ovsdb_jsonrpc_server|ERR|ptcp:6642:0.0.0.0:
> listen failed: Address already in use  Anything I am missing
> here?
> 
> 
> Thanks!
> 
> Tony
> > -Original Message-
> > From: Han Zhou 
> > Sent: Tuesday, August 4, 2020 4:44 PM
> > To: Tony Liu 
> > Cc: Numan Siddique ; Han Zhou ; ovs-
> > discuss ; ovs-dev
> > 
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Tue, Aug 4, 2020 at 2:50 PM Tony Liu  > <mailto:tonyliu0...@hotmail.com> > wrote:
> >
> >
> > Hi,
> >
> > Since I have 3 OVN DB nodes, should I add 3 rows in connection
> table
> > for the inactivity_probe? Or put 3 addresses into one row?
> >
> > "set-connection" set one row only, and there is no "add-connection".
> > How should I add 3 rows into the table connection?
> >
> >
> >
> >
> > You only need to set one row. Try this command:
> >
> > ovn-nbctl -- --id=@conn_uuid create Connection
> > target="ptcp\:6641\:0.0.0.0" inactivity_probe=0 -- set NB_Global .
> > connections=@conn_uuid
> >
> >
> >
> > Thanks!
> >
> > Tony
> >
> > > -Original Message-
> > > From: Numan Siddique mailto:num...@ovn.org> >
> > > Sent: Tuesday, August 4, 2020 12:36 AM
> > > To: Tony Liu  > <mailto:tonyliu0...@hotmail.com> >
> > > Cc: ovs-discuss mailto:ovs-
> > disc...@openvswitch.org> >; ovs-dev  > > d...@openvswitch.org <mailto:d...@openvswitch.org> >
> > > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> > >
> > >
> > >
> > > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  > <mailto:tonyliu0...@hotmail.com>
> > > <mailto:tonyliu0...@hotmail.com
> > <mailto:tonyliu0...@hotmail.com> > > wrote:
> > >
> > >
> > >   In my deployment, on each Neutron server, there are 13
> > Neutron
> > > server processes.
> > >   I see 12 of them (monitor, maintenance, RPC, API) connect
> > to both
> > > ovn-nb-db
> > >   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB
> > clients.
> > >   Is so many clients OK?
> > >
> > >   Any suggestions how to figure out which side doesn't
> > respond the
> > > probe,
> > >   if it's bi-directional? I don't see any activities from
> > logging,
> > > other than
> > >   connect/drop and reconnect...
> > >
> > >   BTW, please let me know if this is not the right place to
> > discuss
> > > Neutron OVN
> > >   ML2 driver.
> > >
> > >
> > >   Thanks!
> > >
> > >   Tony
> > >
> > >   > -Original Message-
> > >   > From: dev mailto:ovs-
> > dev-boun...@openvswitch.org>  <mailto:ovs-dev- <mailto:ovs-dev->
> > > boun...@openvswitch.org <mailto:boun...@openvswitch.org> > > On
> > Behalf Of Tony Liu
> &g

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-05 Thread Tony Liu
Hi Han,

After setting connection target="ptcp:6642:0.0.0.0" for ovn-sb-db,
I see this error.

2020-08-05T23:01:26.819Z|06799|ovsdb_jsonrpc_server|ERR|ptcp:6642:0.0.0.0: 
listen failed: Address already in use

Anything I am missing here?


Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 4:44 PM
> To: Tony Liu 
> Cc: Numan Siddique ; Han Zhou ; ovs-
> discuss ; ovs-dev 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Tue, Aug 4, 2020 at 2:50 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
>   Since I have 3 OVN DB nodes, should I add 3 rows in connection
> table
>   for the inactivity_probe? Or put 3 addresses into one row?
> 
>   "set-connection" set one row only, and there is no "add-connection".
>   How should I add 3 rows into the table connection?
> 
> 
> 
> 
> You only need to set one row. Try this command:
> 
> ovn-nbctl -- --id=@conn_uuid create Connection
> target="ptcp\:6641\:0.0.0.0" inactivity_probe=0 -- set NB_Global .
> connections=@conn_uuid
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Numan Siddique mailto:num...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 12:36 AM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev    > d...@openvswitch.org <mailto:d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
>   >
>   >
>   >
>   > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   In my deployment, on each Neutron server, there are 13
> Neutron
>   > server processes.
>   >   I see 12 of them (monitor, maintenance, RPC, API) connect
> to both
>   > ovn-nb-db
>   >   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB
> clients.
>   >   Is so many clients OK?
>   >
>   >   Any suggestions how to figure out which side doesn't
> respond the
>   > probe,
>   >   if it's bi-directional? I don't see any activities from
> logging,
>   > other than
>   >   connect/drop and reconnect...
>   >
>   >   BTW, please let me know if this is not the right place to
> discuss
>   > Neutron OVN
>   >   ML2 driver.
>   >
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >
>   >   > -Original Message-
>   >   > From: dev mailto:ovs-
> dev-boun...@openvswitch.org>  <mailto:ovs-dev- <mailto:ovs-dev->
>   > boun...@openvswitch.org <mailto:boun...@openvswitch.org> > > On
> Behalf Of Tony Liu
>   >   > Sent: Monday, August 3, 2020 7:45 PM
>   >   > To: ovs-discuss mailto:ovs-
> disc...@openvswitch.org>  <mailto:ovs- <mailto:ovs->
>   > disc...@openvswitch.org <mailto:disc...@openvswitch.org> > >;
> ovs-dev>   > d...@openvswitch.org <mailto:d...@openvswitch.org>
> <mailto:d...@openvswitch.org <mailto:d...@openvswitch.org> > >
>   >   > Subject: [ovs-dev] [OVN] no response to inactivity probe
>   >   >
>   >   > Hi,
>   >   >
>   >   > Neutron OVN ML2 driver was disconnected by ovn-nb-db.
> There are
>   > many
>   >   > error messages from ovn-nb-db leader.
>   >   > 
>   >   > 2020-08-
> 04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620
> <http://10.6.20.81:58620>
>   > <http://10.6.20.81:58620> : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300
> <http://10.6.20.81:58300>
>   > <http://10.6.20.81:58300> : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582
> <http://10.6.20.81:59582>
>   > <http://10.6.20.81:59582> : no
>   >   > respo

Re: [ovs-discuss] OVN Scale with RAFT: how to make raft cluster clients to balanced state again

2020-08-05 Thread Tony Liu
Sorry for hijacking this thread, I'd like to get some clarifications.

How is the initial balanced state established, say 100 ovn-controllers
connecting to 3 ovn-sb-db?

The ovn-controller doesn't have to connect to the leader of ovn-sb-db,
does it? In case it connects to the follower, the write request still
needs to be forwarded to the leader, right?

These logs keep showing up.

2020-08-05T22:48:33.141Z|103607|reconnect|INFO|tcp:10.6.20.84:6642: 
connecting...
2020-08-05T22:48:33.151Z|103608|reconnect|INFO|tcp:127.0.0.1:6640: connected
2020-08-05T22:48:33.151Z|103609|reconnect|INFO|tcp:10.6.20.84:6642: connected
2020-08-05T22:48:33.159Z|103610|main|INFO|OVNSB commit failed, force recompute 
next time.
2020-08-05T22:48:33.161Z|103611|ovsdb_idl|INFO|tcp:10.6.20.84:6642: clustered 
database server is disconnected from cluster; trying another server
2020-08-05T22:48:33.161Z|103612|reconnect|INFO|tcp:10.6.20.84:6642: connection 
attempt timed out
2020-08-05T22:48:33.161Z|103613|reconnect|INFO|tcp:10.6.20.84:6642: waiting 2 
seconds before reconnect

What's that "clustered database server is disconnected from cluster" mean?


Thanks!

Tony


> -Original Message-
> From: discuss  On Behalf Of Han
> Zhou
> Sent: Wednesday, August 5, 2020 3:05 PM
> To: Winson Wang 
> Cc: winson wang ; ovn-kuberne...@googlegroups.com;
> ovs-discuss@openvswitch.org
> Subject: Re: [ovs-discuss] OVN Scale with RAFT: how to make raft cluster
> clients to balanced state again
> 
> 
> 
> On Wed, Aug 5, 2020 at 12:51 PM Winson Wang   > wrote:
> 
> 
>   Hello OVN Experts:
> 
>   With large scale ovn-k8s cluster,  there are several conditions
> that would make ovn-controller clients connect SB central from a
> balanced state to an unbalanced state.
> 
>   Is there an ongoing project to address this problem?
>   If not,  I have one proposal not sure if it is doable.
>   Please share your thoughts.
> 
>   The issue:
> 
>   OVN SB RAFT 3 node cluster,  at first all the ovn-controller
> clients will connect all the 3 nodes in a balanced state.
> 
>   The following conditions will make the connections become
> unbalanced.
> 
>   *   One RAFT node restart,  all the ovn-controller clients to
> reconnect to the two remaining cluster nodes.
> 
>   *   Ovn-k8s,  after SB raft pods rolling upgrade, the last raft
> pod has no client connections.
> 
> 
>   RAFT clients in an unbalanced state would trigger more stress to
> the raft cluster,  which makes the raft unstable under stress compared
> to a balanced state.
> 
> 
>   The proposal solution:
> 
> 
> 
>   Ovn-controller adds next unix commands “reconnect” with argument of
> preferred SB node IP.
> 
>   When unbalanced state happens,  the UNIX command can trigger ovn-
> controller reconnect
> 
>   To new SB raft node with fast sync which doesn’t trigger the whole
> DB downloading process.
> 
> 
> 
> Thanks Winson. The proposal sounds good to me. Will you implement it?
> 
> Han
> 
> 
> 
> 
> 
>   --
> 
>   Winson
> 
> 
> 
>   --
>   You received this message because you are subscribed to the Google
> Groups "ovn-kubernetes" group.
>   To unsubscribe from this group and stop receiving emails from it,
> send an email to ovn-kubernetes+unsubscr...@googlegroups.com
>  .
>   To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS--
> iOW0LxxtkOhJpRT49E-9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com
>  iOW0LxxtkOhJpRT49E-
> 9bJVy0iXraC1LMDUWeu6kLA%40mail.gmail.com?utm_medium=email_source=foo
> ter> .
> 

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] packet drop

2020-08-05 Thread Tony Liu


The drop is caused by flow change.

When packet is dropped.

recirc_id(0),tunnel(tun_id=0x19aca,src=10.6.30.92,dst=10.6.30.22,geneve({class=0x102,type=0x80,len=4,0x20003/0x7fff}),flags(-df+csum+key)),in_port(3),eth(src=fa:16:3e:df:1e:85,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),icmp(type=8/0xf8),
 packets:14, bytes:1372, used:0.846s, actions:drop
recirc_id(0),in_port(12),eth(src=fa:16:3e:7d:bb:85,dst=fa:16:3e:df:1e:85),eth_type(0x0800),ipv4(src=192.168.236.152/255.255.255.252,dst=10.6.40.9,proto=1,tos=0/0x3,ttl=64,frag=no),icmp(type=0),
 packets:6, bytes:588, used:8.983s, actions:drop


When packet goes through.

recirc_id(0),tunnel(tun_id=0x19aca,src=10.6.30.92,dst=10.6.30.22,geneve({class=0x102,type=0x80,len=4,0x20003/0x7fff}),flags(-df+csum+key)),in_port(3),eth(src=fa:16:3e:df:1e:85,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),icmp(type=8/0xf8),
 packets:3, bytes:294, used:0.104s, actions:12
recirc_id(0),in_port(12),eth(src=fa:16:3e:7d:bb:85,dst=fa:16:3e:df:1e:85),eth_type(0x0800),ipv4(src=192.168.236.152/255.255.255.252,dst=10.6.40.9,proto=1,tos=0/0x3,ttl=64,frag=no),icmp(type=0),
 packets:3, bytes:294, used:0.103s, 
actions:ct_clear,set(tunnel(tun_id=0x1a8ee,dst=10.6.30.92,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x1000b}),flags(df|csum|key))),set(eth(src=fa:16:3e:75:b7:e5,dst=52:54:00:0c:ef:b9)),set(ipv4(ttl=63)),3


Is that flow programmed by ovn-controller via ovs-vswitchd?


Thanks!

Tony

> -Original Message-
> From: discuss  On Behalf Of Tony
> Liu
> Sent: Wednesday, August 5, 2020 2:48 PM
> To: ovs-discuss@openvswitch.org; ovs-...@openvswitch.org
> Subject: [ovs-discuss] packet drop
> 
> Hi,
> 
> I am running ping from external to VM via OVN gateway.
> On the compute node, ICMP request packet is consistently coming into
> interface "ovn-gatewa-1". But there is about 10 out of 25 packet loss on
> tap interface. It's like the switch pauses 10s after every 15s.
> 
> Has anyone experiences such issue?
> Any advice how to look into it?
> 
> 
> 21fed09f-909e-4efc-b117-f5d5fcb636c9
> Bridge br-int
> fail_mode: secure
> datapath_type: system
> Port "ovn-gatewa-0"
> Interface "ovn-gatewa-0"
> type: geneve
> options: {csum="true", key=flow, remote_ip="10.6.30.91"}
> bfd_status: {diagnostic="No Diagnostic", flap_count="1",
> forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up,
> state=up}
> Port "tap2588bb4e-35"
> Interface "tap2588bb4e-35"
> Port "ovn-gatewa-1"
> Interface "ovn-gatewa-1"
> type: geneve
> options: {csum="true", key=flow, remote_ip="10.6.30.92"}
> bfd_status: {diagnostic="No Diagnostic", flap_count="1",
> forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up,
> state=up}
> Port "tap37f6b2d7-cc"
> Interface "tap37f6b2d7-cc"
> Port "tap2c4b3b0f-8b"
> Interface "tap2c4b3b0f-8b"
> Port "tap23245491-a4"
> Interface "tap23245491-a4"
> Port "tap51660269-2c"
> Interface "tap51660269-2c"
> Port "tap276cd1ef-e1"
> Interface "tap276cd1ef-e1"
> Port "tap138526d3-b3"
> Interface "tap138526d3-b3"
> Port "tapd1ae48a1-2d"
> Interface "tapd1ae48a1-2d"
> Port br-int
> Interface br-int
> type: internal
> Port "tapdd08f476-94"
> Interface "tapdd08f476-94"
> 
> 
> 
> Thanks!
> 
> Tony
> 
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] packet drop

2020-08-05 Thread Tony Liu
Hi,

I am running ping from external to VM via OVN gateway.
On the compute node, ICMP request packet is consistently coming
into interface "ovn-gatewa-1". But there is about 10 out of 25
packet loss on tap interface. It's like the switch pauses 10s
after every 15s.

Has anyone experiences such issue?
Any advice how to look into it?


21fed09f-909e-4efc-b117-f5d5fcb636c9
Bridge br-int
fail_mode: secure
datapath_type: system
Port "ovn-gatewa-0"
Interface "ovn-gatewa-0"
type: geneve
options: {csum="true", key=flow, remote_ip="10.6.30.91"}
bfd_status: {diagnostic="No Diagnostic", flap_count="1", 
forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up}
Port "tap2588bb4e-35"
Interface "tap2588bb4e-35"
Port "ovn-gatewa-1"
Interface "ovn-gatewa-1"
type: geneve
options: {csum="true", key=flow, remote_ip="10.6.30.92"}
bfd_status: {diagnostic="No Diagnostic", flap_count="1", 
forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up}
Port "tap37f6b2d7-cc"
Interface "tap37f6b2d7-cc"
Port "tap2c4b3b0f-8b"
Interface "tap2c4b3b0f-8b"
Port "tap23245491-a4"
Interface "tap23245491-a4"
Port "tap51660269-2c"
Interface "tap51660269-2c"
Port "tap276cd1ef-e1"
Interface "tap276cd1ef-e1"
Port "tap138526d3-b3"
Interface "tap138526d3-b3"
Port "tapd1ae48a1-2d"
Interface "tapd1ae48a1-2d"
Port br-int
Interface br-int
type: internal
Port "tapdd08f476-94"
Interface "tapdd08f476-94"



Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN scale

2020-08-04 Thread Tony Liu
Hi,

Continue this thread with some updates.

I finally got 4096 networks and 256 router created, 16 networks connecting
to each router. All routers are set as external gateway.

On underlay, those 256 gateway addresses on the provider network are
reachable. Ping is steady.

I launched 10 VMs on one compute node. One of them failed because network
allocation failed. Didn't look into it.

When ping from underlay to VM, it's bumpy. There is 1s or 2s delay about
every 10 pings.

Can't launch any more VMs. It always fails.

One of the Neutron node is very busy. From the logging on INFO level,
it just keeps connecting to OVN.

The active ovn-northd is busy, but all ovn-nb-db and ovn-sb-db are not.

On compute node, ovn-controller is very busy. It keeps saying
"commit failed".

2020-08-05T02:44:23.927Z|04125|reconnect|INFO|tcp:10.6.20.84:6642: connected
2020-08-05T02:44:23.936Z|04126|main|INFO|OVNSB commit failed, force recompute 
next time.
2020-08-05T02:44:23.938Z|04127|ovsdb_idl|INFO|tcp:10.6.20.84:6642: clustered 
database server is disconnected from cluster; trying another server
2020-08-05T02:44:23.939Z|04128|reconnect|INFO|tcp:10.6.20.84:6642: connection 
attempt timed out
2020-08-05T02:44:23.939Z|04129|reconnect|INFO|tcp:10.6.20.84:6642: waiting 2 
seconds before reconnect


The connection to local OVSDB keeps being dropped, because no probe
response. The probe interval is set to 30s already.

2020-08-05T02:47:15.437Z|04351|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 
(10.6.20.22:42362<->10.6.20.86:6642) at lib/stream-fd.c:157 (100% CPU usage)
2020-08-05T02:47:15.438Z|04352|reconnect|WARN|tcp:127.0.0.1:6640: connection 
dropped (Broken pipe)
2020-08-05T02:47:15.438Z|04353|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt:
 connecting...
2020-08-05T02:47:15.449Z|04354|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt:
 connected


Also error about localnet port.

2020-08-05T02:47:15.403Z|04345|patch|ERR|bridge not found for localnet port 
'provnet-006baf64-409d-434d-b95b-017a77969b55' with network name 'physnet1'


First of all, this kind of scale should work fine, right?

Any advices how to look into it?


Thanks!

Tony

> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Monday, July 27, 2020 10:16 AM
> To: Han Zhou 
> Cc: ovs-...@openvswitch.org; ovs-discuss@openvswitch.org
> Subject: Re: [ovs-dev] [ovs-discuss] OVN scale
> 
> Hi Han,
> 
> Just some updates here.
> 
> I tried with 4K networks on single router. Configuration was done
> without any issues. I checked both nb-db and sb-db, they all look good.
> It's just that router configuration is huge (in Neutron DB, nb-db and
> flow table in sb-db), because it contains all 4K ports. Also, the
> pipeline of router datapath in sb-db is quite big.
> 
> I see ovn-northd master and sb-db leader are busy, taking 90+% CPU.
> There are only 3 compute nodes and 2 gateway nodes. Does that monitor
> setting "ovn-monitor-all" matters in such case? Any idea what they are
> busy with, without any configuration updates from OpenStack? The nb-db
> is not busy though.
> 
> Probably because nb-db is busy, ovn-controller can't connect to it
> consistently. It keeps being disconnected and reconnecting. Restarting
> ovn-controller seems help. I am able to launch a few VMs on different
> networks and they are connected via the router.
> 
> Now, I have problem on external access. The router is set as gateway to
> a provider/underlay network on an interface on the gateway node. The
> router is allocated an underlay address from that provider network. My
> understanding is that, the br-ex on gateway node holding the active
> router will broadcast ARP to announce that router underlay address in
> case of failover. Also, it will respond ARP request for that router
> underlay address. But when I run tcpdump on that underlay interface on
> gateway node, I see ARP request coming in, but no ARP response going out.
> I checked the flow table in sb-db, it seems ok. I also checked flow on
> br-ex by "ovs-ofctl dump-flows br-ex", I don't see anything about ARP
> there.
> How should I look into it?
> 
> Again, the case is to support 4K networks with external access (security
> group is disabled), 4K routers (one for each network), 50 routers (one
> for 80 networks), 1 router (for all 4K networks)...
> All networks are isolated by ACL on the logical router. Which option
> should work better?
> Any comment is appreciated.
> 
> 
> Thanks!
> 
> Tony
> 
> 
> 
> From: discuss  on behalf of Tony
> Liu 
> Sent: July 21, 2020 09:09 PM
> To: Daniel Alvarez 
> Cc: ovs-discuss@openvswitch.org 
> Subject: Re: [ovs-discuss] OVN scale
> 
> [root@

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Tony Liu
In that case, I can use set-connection to set one row.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 4:44 PM
> To: Tony Liu 
> Cc: Numan Siddique ; Han Zhou ; ovs-
> discuss ; ovs-dev 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Tue, Aug 4, 2020 at 2:50 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
>   Since I have 3 OVN DB nodes, should I add 3 rows in connection
> table
>   for the inactivity_probe? Or put 3 addresses into one row?
> 
>   "set-connection" set one row only, and there is no "add-connection".
>   How should I add 3 rows into the table connection?
> 
> 
> 
> 
> You only need to set one row. Try this command:
> 
> ovn-nbctl -- --id=@conn_uuid create Connection
> target="ptcp\:6641\:0.0.0.0" inactivity_probe=0 -- set NB_Global .
> connections=@conn_uuid
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Numan Siddique mailto:num...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 12:36 AM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev    > d...@openvswitch.org <mailto:d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
>   >
>   >
>   >
>   > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   In my deployment, on each Neutron server, there are 13
> Neutron
>   > server processes.
>   >   I see 12 of them (monitor, maintenance, RPC, API) connect
> to both
>   > ovn-nb-db
>   >   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB
> clients.
>   >   Is so many clients OK?
>   >
>   >   Any suggestions how to figure out which side doesn't
> respond the
>   > probe,
>   >   if it's bi-directional? I don't see any activities from
> logging,
>   > other than
>   >   connect/drop and reconnect...
>   >
>   >   BTW, please let me know if this is not the right place to
> discuss
>   > Neutron OVN
>   >   ML2 driver.
>   >
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >
>   >   > -Original Message-
>   >   > From: dev mailto:ovs-
> dev-boun...@openvswitch.org>  <mailto:ovs-dev- <mailto:ovs-dev->
>   > boun...@openvswitch.org <mailto:boun...@openvswitch.org> > > On
> Behalf Of Tony Liu
>   >   > Sent: Monday, August 3, 2020 7:45 PM
>   >   > To: ovs-discuss mailto:ovs-
> disc...@openvswitch.org>  <mailto:ovs- <mailto:ovs->
>   > disc...@openvswitch.org <mailto:disc...@openvswitch.org> > >;
> ovs-dev>   > d...@openvswitch.org <mailto:d...@openvswitch.org>
> <mailto:d...@openvswitch.org <mailto:d...@openvswitch.org> > >
>   >   > Subject: [ovs-dev] [OVN] no response to inactivity probe
>   >   >
>   >   > Hi,
>   >   >
>   >   > Neutron OVN ML2 driver was disconnected by ovn-nb-db.
> There are
>   > many
>   >   > error messages from ovn-nb-db leader.
>   >   > 
>   >   > 2020-08-
> 04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620
> <http://10.6.20.81:58620>
>   > <http://10.6.20.81:58620> : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300
> <http://10.6.20.81:58300>
>   > <http://10.6.20.81:58300> : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582
> <http://10.6.20.81:59582>
>   > <http://10.6.20.81:59582> : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626
> <http://10.6.20.83:42626>
>   

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Hi Han,

Sounds good. I am looking forward to incremental-processing,
and will go from there.

BTW, it would be great if you could let me know how to set probe
interval for 3-node cluster, here or in another thread.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 4:02 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; Ben Pfaff
> ; Leonid Ryzhyk ; ovs-dev  d...@openvswitch.org>; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> Hi Tony,
> 
> I am glad it is more clear now. For your concern regarding taking too
> much time for one round of computing, it is valid, but I guess it is not
> directly related to the IDLE probe any more, right?
> The OVSDB IDL in fact already does some of the work of caping and
> buffering like what you proposed. The IDL will read a limited number of
> messages to get processed in each round (and the remaining messages are
> buffered in the stream). However, sometimes a single notification
> message can contain a huge amount of data. It is hard to split the data
> from one single notification, because the data are internally dependent
> on each other.
> 
> Without incremental-processing, the size of the data change doesn't
> matter much because all data is recomputed anyway. I'd suggest to see
> what's the outcome of incremental-processing, and see if any further
> improvement is still needed for handling big transactions.
> 
> In my opinion, the special cases of a big data change triggered by
> scenarios such as data restore can be handled by operational approaches
> instead of implementation. For example, you could adjust the probe
> interval before doing data restore, and change it back afterwards. But
> of course, if there are good ways to implement we should definitely
> consider.
> 
> 
> Thanks,
> Han
> 
> 
> On Tue, Aug 4, 2020 at 2:00 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi Han,
> 
>   Thanks for clarifications! It's crystal clear.
> 
>   My concern, in general, is blocking. For onv-northd, or OVSDB
> client,
>   (I assume all OVSDB clients are using the same library for
> connection,
>   proble, etc.?) when handing current event, it won't be interrupted
> to
>   handle any incoming event, right? How long does it take to handle a
>   computing event for big chunk of data? How much data can be
> buffered
>   to be computed? Is there estimated maximum time for handle so much
> data?
> 
>   In case it takes more than 5s to process an event, then the peer
> will
>   drop the connection because of probe timeout.
> 
>   With incremental-process, if I restore DB, then that still could be
> a
>   huge incremental, unless the incremental size is controlled. That's
>   probably why you recommend to restore to existing cluster, to avoid
>   huge incremental from restoring to a fresh cluster. Am I right?
> 
>   What I used to do is to chop big data into pieces and to be handled
> by
>   multiple event loops. That way, other events will have a chance to
> get
>   processed. So big chunk of data won't cause blocking.
> 
>   Enlarge probe interval will sort of resolve the issue, but it will
> lose
>   the point of probing. Just like that election timer, enlarge the
> timer
>   avoids often failover, but it also increases the failover time when
> real
>   problem happens. And yes, I agree that it's on control plane and
> doesn't
>   break data plane, but just like in networking world, routing
> convergence
>   is very important.
> 
>   I am thinking, in your incremental-processing, if the time for each
> event
>   loop can be capped or controlled, that would be very helpful. The
> side
>   effect of that option is memory consumption. You will need to
> buffer more
>   data. But today, it's lots easier to increase memory to boost
> performance.
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 12:34 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan
> Siddique mailto:num...@ovn.org> >; Ben Pfaff
>   > mailto:b...@ovn.org> >; Leonid Ryzhyk
> mailto:lryz...@vmware.com> >; ovs-dev> d...@openvswitch.org <mailto:d...@openvswitch.org> >; ovs-discuss
> mailto:ovs-discuss@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] 

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Tony Liu
Hi,

Since I have 3 OVN DB nodes, should I add 3 rows in connection table
for the inactivity_probe? Or put 3 addresses into one row?

"set-connection" set one row only, and there is no "add-connection".
How should I add 3 rows into the table connection?


Thanks!

Tony

> -Original Message-
> From: Numan Siddique 
> Sent: Tuesday, August 4, 2020 12:36 AM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   In my deployment, on each Neutron server, there are 13 Neutron
> server processes.
>   I see 12 of them (monitor, maintenance, RPC, API) connect to both
> ovn-nb-db
>   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB clients.
>   Is so many clients OK?
> 
>   Any suggestions how to figure out which side doesn't respond the
> probe,
>   if it's bi-directional? I don't see any activities from logging,
> other than
>   connect/drop and reconnect...
> 
>   BTW, please let me know if this is not the right place to discuss
> Neutron OVN
>   ML2 driver.
> 
> 
>   Thanks!
> 
>       Tony
> 
>   > -Original Message-
>   > From: dev mailto:ovs-dev-
> boun...@openvswitch.org> > On Behalf Of Tony Liu
>   > Sent: Monday, August 3, 2020 7:45 PM
>   > To: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev> d...@openvswitch.org <mailto:d...@openvswitch.org> >
>   > Subject: [ovs-dev] [OVN] no response to inactivity probe
>   >
>   > Hi,
>   >
>   > Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are
> many
>   > error messages from ovn-nb-db leader.
>   > 
>   > 2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620
> <http://10.6.20.81:58620> : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300
> <http://10.6.20.81:58300> : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582
> <http://10.6.20.81:59582> : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626
> <http://10.6.20.83:42626> : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:10.6.20.82:45412
> <http://10.6.20.82:45412> : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:10.6.20.81:59416
> <http://10.6.20.81:59416> : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:10.6.20.81:60004
> <http://10.6.20.81:60004> : no
>   > response to inactivity probe after 5 seconds, disconnecting
> 
>   >
>   > Could anyone share a bit details how this inactivity probe works?
> 
> 
> 
> The inactivity probe is sent by both the server and clients
> independently.
> Meaning ovsdb-server will send an inactivity probe every 'x' configured
> seconds to all its connected clients and if it doesn't get a reply from
> the client within some time, it disconnects the connection.
> 
> The inactivity probe from the server side can be configured. Run "ovn-
> nbctl list connection"
> and you will see inactivity_probe column. You can set this column to
> desired value like - ovn-nbctl set connection . inactivity_probe=3
> (for 30 seconds)
> 
> The same thing for SB ovsdb-server.
> 
> Similarly each client (ovn-northd, ovn-controller, neutron server) sends
> inactivity probe every 'y' seconds and if the client doesn't get any
> reply from ovsdb-server it will disconnect the connection and reconnect
> again.
> 
> For ovn-northd you can configured this as - ovn-nbctl set NB_Global .
> options:northd_probe_interval=3
> 
> For ovn-controllers - ovs-vsctl set open . external_ids:ovn-remote-
> probe-interval=3
> 
> There is also a probe interval for openflow connection from ovn-
> controller to ovs-vswitchd which you can configure as ovs-vsctl set
> open . external_ids:ovn-openflow-probe-interval=30 (this is in seconds)
> 
> 
> Regarding the neutron server I think it is set to 60 seconds. Plea

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Hi Han,

Thanks for clarifications! It's crystal clear.

My concern, in general, is blocking. For onv-northd, or OVSDB client,
(I assume all OVSDB clients are using the same library for connection,
proble, etc.?) when handing current event, it won't be interrupted to
handle any incoming event, right? How long does it take to handle a
computing event for big chunk of data? How much data can be buffered
to be computed? Is there estimated maximum time for handle so much data?

In case it takes more than 5s to process an event, then the peer will
drop the connection because of probe timeout.

With incremental-process, if I restore DB, then that still could be a
huge incremental, unless the incremental size is controlled. That's
probably why you recommend to restore to existing cluster, to avoid
huge incremental from restoring to a fresh cluster. Am I right?

What I used to do is to chop big data into pieces and to be handled by
multiple event loops. That way, other events will have a chance to get
processed. So big chunk of data won't cause blocking.

Enlarge probe interval will sort of resolve the issue, but it will lose
the point of probing. Just like that election timer, enlarge the timer
avoids often failover, but it also increases the failover time when real
problem happens. And yes, I agree that it's on control plane and doesn't
break data plane, but just like in networking world, routing convergence
is very important.

I am thinking, in your incremental-processing, if the time for each event
loop can be capped or controlled, that would be very helpful. The side
effect of that option is memory consumption. You will need to buffer more
data. But today, it's lots easier to increase memory to boost performance.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 12:34 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; Ben Pfaff
> ; Leonid Ryzhyk ; ovs-dev  d...@openvswitch.org>; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Tue, Aug 4, 2020 at 11:40 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Inline...
> 
>   Thanks!
> 
>   Tony
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 11:01 AM
>   > To: Numan Siddique mailto:num...@ovn.org> >; Ben
> Pfaff mailto:b...@ovn.org> >; Leonid
>   > Ryzhyk mailto:lryz...@vmware.com> >
>   > Cc: Tony Liu  <mailto:tonyliu0...@hotmail.com> >; Han Zhou  <mailto:hz...@ovn.org> >; ovs-
>   > dev mailto:ovs-...@openvswitch.org> >;
> ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> no
>   > configuration update
>   >
>   >
>   >
>   > On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique  <mailto:num...@ovn.org>
>   > <mailto:num...@ovn.org <mailto:num...@ovn.org> > > wrote:
>   >
>   >
>   >
>   >
>   >   On Tue, Aug 4, 2020 at 9:02 AM Tony Liu
> mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   The probe awakes recomputing?
>   >   There is probe every 5 seconds. Without any
> connection
>   > up/down or failover,
>   >   ovn-northd will recompute everything every 5
> seconds, no
>   > matter what?
>   >   Really?
>   >
>   >   Anyways, I will increase the probe interval for now,
> see if
>   > that helps.
>   >
>   >
>   >
>   >   I think we should optimise this case. I am planning to look
> into
>   > this.
>   >
>   >   Thanks
>   >   Numan
>   >
>   >
>   > Thanks Numan.
>   > I'd like to discuss more on this before we move forward to change
>   > anything.
>   >
>   > 1) Regarding the problem itself, the CPU cost triggered by OVSDB
> IDLE
>   > probe when there is no configuration change to compute, I don't
> think it
>   > matters that much in real production. It simply wastes CPU cycles
> when
>   > there is nothing to do, so what harm would it do here? For ovn-
> northd,
>   > since it is the centralized component, we would always ensure
> there is
>   > enough CPU available for ovn-north when computing is needed, and
> t

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Inline...

Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 11:01 AM
> To: Numan Siddique ; Ben Pfaff ; Leonid
> Ryzhyk 
> Cc: Tony Liu ; Han Zhou ; ovs-
> dev ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique  <mailto:num...@ovn.org> > wrote:
> 
> 
> 
> 
>   On Tue, Aug 4, 2020 at 9:02 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   The probe awakes recomputing?
>   There is probe every 5 seconds. Without any connection
> up/down or failover,
>   ovn-northd will recompute everything every 5 seconds, no
> matter what?
>   Really?
> 
>   Anyways, I will increase the probe interval for now, see if
> that helps.
> 
> 
> 
>   I think we should optimise this case. I am planning to look into
> this.
> 
>   Thanks
>   Numan
> 
> 
> Thanks Numan.
> I'd like to discuss more on this before we move forward to change
> anything.
> 
> 1) Regarding the problem itself, the CPU cost triggered by OVSDB IDLE
> probe when there is no configuration change to compute, I don't think it
> matters that much in real production. It simply wastes CPU cycles when
> there is nothing to do, so what harm would it do here? For ovn-northd,
> since it is the centralized component, we would always ensure there is
> enough CPU available for ovn-north when computing is needed, and this
> reservation will be wasted anyway when there is no change to compute. So,
> I'd avoid making any change specifically only to address this issue. I
> could be wrong, though. I'd like to hear what would be the real concern
> if this is not addressed.

Is more vCPUs going to help here? Is ovn-northd multi-thread?

I am probably still missing something here. The probe is there all times,
every 5s. If ovn-northd is in the middle of a computing, is a probe going
to make ovn-northd restart the computing? Or the probe only triggers
computing when ovn-northd is idle? Even with the latter case, what's the
intention to trigger computing by probe?

> 
> 2) ovn-northd incremental processing would avoid this CPU problem
> naturally. So let's discuss how to move forward for incremental
> processing, which is much more important because it also solves the CPU
> efficiency when handling the changes, and the IDLE probe problem is just
> a byproduct. I believe the DDlog branch would have solved this problem.
> However, it seems we are not sure about the current status of DDlog. As
> you proposed at the last OVN meeting, an alternative is to implement
> partial incremental-processing using the I-P engine like ovn-controller.
> While I have no objection to this, we'd better check with Ben and Leonid
> on the plan to avoid overlapping and waste of work. @Ben @Leonid, would
> you mind sharing the status here since you were not at the meeting last
> week?

My point is that, a probe is not supposed to trigger a computing, no matter
it's full or incremental.

> 
> 
> 
> Thanks,
> Han
> 
> 
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Monday, August 3, 2020 8:22 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: Han Zhou mailto:hz...@ovn.org> >; ovs-
> discuss mailto:ovs-
> disc...@openvswitch.org> >;
>   > ovs-dev mailto:ovs-
> d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU
> when no
>   > configuration update
>   >
>   > Sorry that I didn't make it clear enough. The OVSDB probe
> itself doesn't
>   > take much CPU, but the probe awakes ovn-northd main loop,
> which
>   > recompute everything, which is why you see CPU spike.
>   > It will be solved by incremental-processing, when only
> delta is
>   > processed, and in case of probe handling, there is no
> change in
>   > configuration, so the delta is zero.
>   > For now, please follow the steps to adjust probe interval,
> if the CPU of
>   > ovn-northd (when there is no configuration change) is a
> concern for you.
>   > But please remember that this has no impact to the real CPU
> usage for
>   > handling configuration changes.
>   >
&

Re: [ovs-discuss] [ovs-dev] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Tony Liu
Is there any difference to restore DB on existing cluster vs. fresh cluster,
in terms of performance?

If I don't have to restore on fresh cluster, which is recommended?

For now, since ovn-northd always recomputes the whole DB, I guess not much
difference?

With incremental-process, would restoring to a fresh cluster be better?

Is it necessary to stop or restart ovn-northd during DB restore?


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 11:13 AM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-dev] [OVN] stale data complained by ovn-controller
> after db restore
> 
> 
> 
> On Tue, Aug 4, 2020 at 10:30 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
>   Here is how I restore OVN DB.
>   * Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
>   * Clean up all DB files.
>   * Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are
> up and
> running.
>   * Set DB election timer to 10s.
>   * Restore DB to ovn-nb-db by ovsdb-client.
>   * Start all ovn-northd services.
> 
>   A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.
> 
>   Now, the client of ovn-sb-db, ovn-controller and nova-compute
> complaint about
>   "stale data". The chassis node is not getting updated.
>   
>   2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-]
> tcp:10.6.20.84:6642 <http://10.6.20.84:6642> : connected
>   2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog [-]
> tcp:10.6.20.84:6642 <http://10.6.20.84:6642> : clustered database server
> has stale data; trying another server
>   
> 
>   Restarting ovn-controller and nova-compute resolve the issue.
> 
>   Is this expected? As part of the DB restore process, should I
> restart
>   ovn-controller and nova-compute on all chassis node?
> 
> 
> 
> 
> Yes, this is expected if you freshly start a new cluster. (It wouldn't
> happen if you simply restore the old data on the existing cluster.
> However, I understand that the scenario of restoring data on a freshly
> created cluster is a valid use case).
> For this case, you could either restart ovn-controller, or trigger a
> client side raft index reset by:
> ovn-appctl -t ovn-controller sb-cluster-state-reset
> 
> Similarly for ovn-northd:
> ovn-appctl -t ovn-northd nb-cluster-state-reset
> ovn-appctl -t ovn-northd sb-cluster-state-reset
> 
> To use this command, you will need at least 20.06 of OVN and OVS master.
> 
> 
> Thanks,
> Han
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   ___
>   dev mailing list
>   d...@openvswitch.org <mailto:d...@openvswitch.org>
>   https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Tony Liu
Hi,

Here is how I restore OVN DB.
* Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
* Clean up all DB files.
* Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are up and
  running.
* Set DB election timer to 10s.
* Restore DB to ovn-nb-db by ovsdb-client.
* Start all ovn-northd services.

A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.

Now, the client of ovn-sb-db, ovn-controller and nova-compute complaint about
"stale data". The chassis node is not getting updated.

2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-] 
tcp:10.6.20.84:6642: connected
2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog [-] 
tcp:10.6.20.84:6642: clustered database server has stale data; trying another 
server


Restarting ovn-controller and nova-compute resolve the issue.

Is this expected? As part of the DB restore process, should I restart
ovn-controller and nova-compute on all chassis node?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Thanks Numan for looking into it!
Probe is for health check only, it's not supposed to trigger translation,
even with incremental implementation. Translation should be triggered only
when a ovn-northd becomes active.


Tony

> -Original Message-
> From: Numan Siddique 
> Sent: Tuesday, August 4, 2020 12:38 AM
> To: Tony Liu 
> Cc: Han Zhou ; ovs-dev ; ovs-
> discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Tue, Aug 4, 2020 at 9:02 AM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   The probe awakes recomputing?
>   There is probe every 5 seconds. Without any connection up/down or
> failover,
>   ovn-northd will recompute everything every 5 seconds, no matter
> what?
>   Really?
> 
>   Anyways, I will increase the probe interval for now, see if that
> helps.
> 
> 
> 
> I think we should optimise this case. I am planning to look into this.
> 
> Thanks
> Numan
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Monday, August 3, 2020 8:22 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: Han Zhou mailto:hz...@ovn.org> >; ovs-discuss
> mailto:ovs-discuss@openvswitch.org> >;
>   > ovs-dev mailto:ovs-
> d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> no
>   > configuration update
>   >
>   > Sorry that I didn't make it clear enough. The OVSDB probe itself
> doesn't
>   > take much CPU, but the probe awakes ovn-northd main loop, which
>   > recompute everything, which is why you see CPU spike.
>   > It will be solved by incremental-processing, when only delta is
>   > processed, and in case of probe handling, there is no change in
>   > configuration, so the delta is zero.
>   > For now, please follow the steps to adjust probe interval, if the
> CPU of
>   > ovn-northd (when there is no configuration change) is a concern
> for you.
>   > But please remember that this has no impact to the real CPU usage
> for
>   > handling configuration changes.
>   >
>   >
>   > Thanks,
>   > Han
>   >
>   >
>   > On Mon, Aug 3, 2020 at 8:11 PM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   Health check (5 sec internal) taking 30%-100% CPU is
> definitely not
>   > acceptable,
>   >   if that's really the case. There must be some blocking (and
> not
>   > yielding CPU)
>   >   in coding, which is not supposed to be there.
>   >
>   >   Could you point me to the coding for such health check?
>   >   Is it single thread? Does it use any event library?
>   >
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >
>   >   > -Original Message-
>   >   > From: Han Zhou mailto:hz...@ovn.org>
> <mailto:hz...@ovn.org <mailto:hz...@ovn.org> > >
>   >   > Sent: Saturday, August 1, 2020 9:11 PM
>   >   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > >
>   >   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org>  <mailto:ovs- <mailto:ovs->
>   > disc...@openvswitch.org <mailto:disc...@openvswitch.org> > >;
> ovs-dev>   > d...@openvswitch.org <mailto:d...@openvswitch.org>
> <mailto:d...@openvswitch.org <mailto:d...@openvswitch.org> > >
>   >   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much
> CPU when
>   > no
>   >   > configuration update
>   >   >
>   >   >
>   >   >
>   >   > On Fri, Jul 31, 2020 at 4:14 PM Tony Liu
> mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com <mailto:tonyliu0...@hotmail.com> >
>   >   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > > wrote:
>   >   >
>   >

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-03 Thread Tony Liu
In my deployment, on each Neutron server, there are 13 Neutron server processes.
I see 12 of them (monitor, maintenance, RPC, API) connect to both ovn-nb-db
and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB clients.
Is so many clients OK?

Any suggestions how to figure out which side doesn't respond the probe,
if it's bi-directional? I don't see any activities from logging, other than
connect/drop and reconnect...

BTW, please let me know if this is not the right place to discuss Neutron OVN
ML2 driver.


Thanks!

Tony

> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Monday, August 3, 2020 7:45 PM
> To: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: [ovs-dev] [OVN] no response to inactivity probe
> 
> Hi,
> 
> Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are many
> error messages from ovn-nb-db leader.
> 
> 2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620: no
> response to inactivity probe after 5 seconds, disconnecting
> 2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300: no
> response to inactivity probe after 5 seconds, disconnecting
> 2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582: no
> response to inactivity probe after 5 seconds, disconnecting
> 2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626: no
> response to inactivity probe after 5 seconds, disconnecting
> 2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:10.6.20.82:45412: no
> response to inactivity probe after 5 seconds, disconnecting
> 2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:10.6.20.81:59416: no
> response to inactivity probe after 5 seconds, disconnecting
> 2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:10.6.20.81:60004: no
> response to inactivity probe after 5 seconds, disconnecting 
> 
> Could anyone share a bit details how this inactivity probe works?
> From OVN ML2 driver log, I see it connected to the leader, then the
> connection was closed by leader after 5 or 6 seconds. Is this probe one-
> way or two-ways?
> Both sides are not busy, not taking much CPU cycles. Not sure how this
> could happen. Any thoughts?
> 
> 
> Thanks!
> 
> Tony
> 
> 
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-03 Thread Tony Liu
The probe awakes recomputing?
There is probe every 5 seconds. Without any connection up/down or failover,
ovn-northd will recompute everything every 5 seconds, no matter what?
Really?

Anyways, I will increase the probe interval for now, see if that helps.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Monday, August 3, 2020 8:22 PM
> To: Tony Liu 
> Cc: Han Zhou ; ovs-discuss ;
> ovs-dev 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> Sorry that I didn't make it clear enough. The OVSDB probe itself doesn't
> take much CPU, but the probe awakes ovn-northd main loop, which
> recompute everything, which is why you see CPU spike.
> It will be solved by incremental-processing, when only delta is
> processed, and in case of probe handling, there is no change in
> configuration, so the delta is zero.
> For now, please follow the steps to adjust probe interval, if the CPU of
> ovn-northd (when there is no configuration change) is a concern for you.
> But please remember that this has no impact to the real CPU usage for
> handling configuration changes.
> 
> 
> Thanks,
> Han
> 
> 
> On Mon, Aug 3, 2020 at 8:11 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Health check (5 sec internal) taking 30%-100% CPU is definitely not
> acceptable,
>   if that's really the case. There must be some blocking (and not
> yielding CPU)
>   in coding, which is not supposed to be there.
> 
>   Could you point me to the coding for such health check?
>   Is it single thread? Does it use any event library?
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -----Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Saturday, August 1, 2020 9:11 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev> d...@openvswitch.org <mailto:d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> no
>   > configuration update
>   >
>   >
>   >
>   > On Fri, Jul 31, 2020 at 4:14 PM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   Hi,
>   >
>   >   I see the active ovn-northd takes much CPU (30% - 100%)
> when there
>   > is no
>   >   configuration from OpenStack, nothing happening on all
> chassis
>   > nodes either.
>   >
>   >   Is this expected? What is it busy with?
>   >
>   >
>   >
>   >
>   > Yes, this is expected. It is due to the OVSDB probe between ovn-
> northd
>   > and NB/SB OVSDB servers, which is used to detect the OVSDB
> connection
>   > failure.
>   > Usually this is not a concern (unlike the probe with a large
> number of
>   > ovn-controller clients), because ovn-northd is a centralized
> component
>   > and the CPU cost when there is no configuration change doesn't
> matter
>   > that much. However, if it is a concern, the probe interval
> (default 5
>   > sec) can be changed.
>   > If you change, remember to change on both server side and client
> side.
>   > For client side (ovn-northd), it is configured in the NB DB's
> NB_Global
>   > table's options:northd_probe_interval. See man page of ovn-nb(5).
>   > For server side (NB and SB), it is configured in the NB and SB
> DB's
>   > Connection table's inactivity_probe column.
>   >
>   > Thanks,
>   > Han
>   >
>   >
>   >
>   >   
>   >   2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due to
> [POLLIN]
>   > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358>
> <http://10.6.20.84:44358> <->10.6.20.84:6641 <http://10.6.20.84:6641>
>   > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (68% CPU usage)
>   >   2020-07-
> 31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641>
>   > <http://10.6.20.84:6641> : received request, method="echo",
> params=[],
>   > id="echo"
>   >   2020-07-
> 31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641>
>   > <http://10.6.20.84:6641> : send reply

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-03 Thread Tony Liu
Health check (5 sec internal) taking 30%-100% CPU is definitely not acceptable,
if that's really the case. There must be some blocking (and not yielding CPU)
in coding, which is not supposed to be there.

Could you point me to the coding for such health check?
Is it single thread? Does it use any event library?


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Saturday, August 1, 2020 9:11 PM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Fri, Jul 31, 2020 at 4:14 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
>   I see the active ovn-northd takes much CPU (30% - 100%) when there
> is no
>   configuration from OpenStack, nothing happening on all chassis
> nodes either.
> 
>   Is this expected? What is it busy with?
> 
> 
> 
> 
> Yes, this is expected. It is due to the OVSDB probe between ovn-northd
> and NB/SB OVSDB servers, which is used to detect the OVSDB connection
> failure.
> Usually this is not a concern (unlike the probe with a large number of
> ovn-controller clients), because ovn-northd is a centralized component
> and the CPU cost when there is no configuration change doesn't matter
> that much. However, if it is a concern, the probe interval (default 5
> sec) can be changed.
> If you change, remember to change on both server side and client side.
> For client side (ovn-northd), it is configured in the NB DB's NB_Global
> table's options:northd_probe_interval. See man page of ovn-nb(5).
> For server side (NB and SB), it is configured in the NB and SB DB's
> Connection table's inactivity_probe column.
> 
> Thanks,
> Han
> 
> 
> 
>   
>   2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due to [POLLIN]
> on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> <->10.6.20.84:6641
> <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (68% CPU usage)
>   2020-07-31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641> : received request, method="echo", params=[],
> id="echo"
>   2020-07-31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641> : send reply, result=[], id="echo"
>   2020-07-31T23:08:12.777Z|04270|poll_loop|DBG|wakeup due to [POLLIN]
> on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> <->10.6.20.85:6642
> <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (34% CPU usage)
>   2020-07-31T23:08:12.777Z|04271|reconnect|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : idle 5002 ms, sending inactivity probe
>   2020-07-31T23:08:12.777Z|04272|reconnect|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : entering IDLE
>   2020-07-31T23:08:12.777Z|04273|jsonrpc|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : send request, method="echo", params=[],
> id="echo"
>   2020-07-31T23:08:12.777Z|04274|jsonrpc|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : received request, method="echo", params=[],
> id="echo"
>   2020-07-31T23:08:12.777Z|04275|reconnect|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : entering ACTIVE
>   2020-07-31T23:08:12.777Z|04276|jsonrpc|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : send reply, result=[], id="echo"
>   2020-07-31T23:08:13.635Z|04277|poll_loop|DBG|wakeup due to [POLLIN]
> on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> <->10.6.20.85:6642
> <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (34% CPU usage)
>   2020-07-31T23:08:13.635Z|04278|jsonrpc|DBG|tcp:10.6.20.85:6642
> <http://10.6.20.85:6642> : received reply, result=[], id="echo"
>   2020-07-31T23:08:14.480Z|04279|hmap|DBG|Dropped 129 log messages in
> last 5 seconds (most recently, 0 seconds ago) due to excessive rate
>   2020-07-31T23:08:14.480Z|04280|hmap|DBG|lib/shash.c:112: 2 buckets
> with 6+ nodes, including 2 buckets with 6 nodes (32 nodes total across
> 32 buckets)
>   2020-07-31T23:08:14.513Z|04281|poll_loop|DBG|wakeup due to 27-ms
> timeout at lib/reconnect.c:643 (34% CPU usage)
>   2020-07-31T23:08:14.513Z|04282|reconnect|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641> : idle 5001 ms, sending inactivity probe
>   2020-07-31T23:08:14.513Z|04283|reconnect|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641> : entering IDLE
>   2020-07-31T23:08:14.513Z|04284|jsonrpc|DBG|tcp:10.6.20.84:6641
> <http://10.6.20.84:6641> : send request, method="echo", params=[],

[ovs-discuss] [OVN] no response to inactivity probe

2020-08-03 Thread Tony Liu
Hi,

Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are many error
messages from ovn-nb-db leader.

2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:10.6.20.82:45412: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:10.6.20.81:59416: no response 
to inactivity probe after 5 seconds, disconnecting
2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:10.6.20.81:60004: no response 
to inactivity probe after 5 seconds, disconnecting


Could anyone share a bit details how this inactivity probe works?
>From OVN ML2 driver log, I see it connected to the leader, then the connection
was closed by leader after 5 or 6 seconds. Is this probe one-way or two-ways?
Both sides are not busy, not taking much CPU cycles. Not sure how this could
happen. Any thoughts?


Thanks!

Tony



___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] constraint violation error from Neutron OVN ML2 driver

2020-08-03 Thread Tony Liu
Hi,

Any clues about this error? It is reproduceable but not consistently.
There are 3 Neutron nodes and 3 OVN DB nodes (RAFT cluster).
It happened when connecting network to router by OpenStack cli.

===
2020-08-03 12:30:17.054 22 ERROR ovsdbapp.backend.ovs_idl.transaction 
[req-acf33a39-f8b5-4b9f-91d7-0100f1e7c189 - - - - -] OVSDB Error: 
{"details":"Transaction causes multiple rows in \"Logical_Switch_Port\" table 
to have identical values (\"6ee12e8e-f8e5-46e2-84b7-58b9dc7d9253\") for index 
on column \"name\".  First row, with UUID b5c36d61-ad55-45bf-90f6-70f6649251c3, 
existed in the database before this transaction and was not modified by the 
transaction.  Second row, with UUID 5498876d-55f6-4364-8db7-f6d591dc9ba9, was 
inserted by this transaction.","error":"constraint violation"}
2020-08-03 12:30:17.055 22 ERROR ovsdbapp.backend.ovs_idl.transaction 
[req-8b5c1d93-ba88-4259-ab7e-08ee6c57a884 fb4212bf04404c15a19208ca920c1b1a 
3e9209736c7146bead16e02b0679f3a1 - default default] Traceback (most recent call 
last):
  File 
"/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 
122, in run
txn.results.put(txn.do_commit())
  File 
"/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", 
line 118, in do_commit
raise RuntimeError(msg)
RuntimeError: OVSDB Error: {"details":"Transaction causes multiple rows in 
\"Logical_Switch_Port\" table to have identical values 
(\"6ee12e8e-f8e5-46e2-84b7-58b9dc7d9253\") for index on column \"name\".  First 
row, with UUID b5c36d61-ad55-45bf-90f6-70f6649251c3, existed in the database 
before this transaction and was not modified by the transaction.  Second row, 
with UUID 5498876d-55f6-4364-8db7-f6d591dc9ba9, was inserted by this 
transaction.","error":"constraint violation"}

2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers 
[req-8b5c1d93-ba88-4259-ab7e-08ee6c57a884 fb4212bf04404c15a19208ca920c1b1a 
3e9209736c7146bead16e02b0679f3a1 - default default] Mechanism driver 'ovn' 
failed in create_port_postcommit: RuntimeError: OVSDB Error: 
{"details":"Transaction causes multiple rows in \"Logical_Switch_Port\" table 
to have identical values (\"6ee12e8e-f8e5-46e2-84b7-58b9dc7d9253\") for index 
on column \"name\".  First row, with UUID b5c36d61-ad55-45bf-90f6-70f6649251c3, 
existed in the database before this transaction and was not modified by the 
transaction.  Second row, with UUID 5498876d-55f6-4364-8db7-f6d591dc9ba9, was 
inserted by this transaction.","error":"constraint violation"}
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers Traceback (most 
recent call last):
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 477, 
in _call_on_drivers
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers 
getattr(driver.obj, method_name)(context)
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py",
 line 544, in create_port_postcommit
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers 
self._ovn_client.create_port(context._plugin_context, port)
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
 line 437, in create_port
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers 
self._qos_driver.create_port(txn, port)
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers next(self.gen)
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py",
 line 184, in transaction
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers yield t
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers next(self.gen)
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 119, in transaction
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers del 
self._nested_txns_map[cur_thread_id]
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 69, in __exit__
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers self.result = 
self.commit()
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers   File 
"/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", 
line 62, in commit
2020-08-03 12:30:17.056 22 ERROR neutron.plugins.ml2.managers raise 

Re: [ovs-discuss] [OVN] ovn-northd HA

2020-08-01 Thread Tony Liu
When I restore 4096 LS, 4354 LSP, 256 LR and 256 LRP, (I clean up
all DBs before restore.) it takes a few seconds to restore the nb-db.
But onv-northd takes forever to update sb-db.

I changed sb-db election timer from 1s to 10s. Then it takes just a
few minutes for sb-db to get fully synced.

How does that sb-db leader switch affect such sync?


Thanks!

Tony

> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Saturday, August 1, 2020 5:26 PM
> To: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: [ovs-dev] [OVN] ovn-northd HA
> 
> Hi,
> 
> I have a few questions about ovn-northd HA.
> 
> Does the lock for active ovn-northd have to be acquired from the leader
> of sb-db?
> 
> If ovn-northd didn't acquire the lock, it becomes standby. Does it keep
> trying to acquire the lock, or wait for notification, or monitor the
> active ovn-northd?
> 
> If it keeps trying, what's the period?
> 
> Say the active ovn-northd is down, the connection to sb-db is down, sb-
> db releases the lock, so another ovn-northd can acquire it.
> Is that correct?
> 
> When sb-db is busy, the connection from ovn-northd is dropped. Not sure
> from which side it's dropped. And that triggers active ovn-northd switch.
> Is that right?
> 
> In case that sb-db leader switchs, is that going to cause active ovn-
> northd switch as well?
> 
> For whatever reason, in case active ovn-northd switches, is the new
> active ovn-northd going to continue the work left by the previous leader,
> or start all over again?
> 
> 
> Thanks!
> 
> Tony
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] ovn-northd HA

2020-08-01 Thread Tony Liu
Hi,

I have a few questions about ovn-northd HA.

Does the lock for active ovn-northd have to be acquired from the leader
of sb-db?

If ovn-northd didn't acquire the lock, it becomes standby. Does it keep
trying to acquire the lock, or wait for notification, or monitor the
active ovn-northd?

If it keeps trying, what's the period?

Say the active ovn-northd is down, the connection to sb-db is down,
sb-db releases the lock, so another ovn-northd can acquire it.
Is that correct?

When sb-db is busy, the connection from ovn-northd is dropped. Not sure
from which side it's dropped. And that triggers active ovn-northd switch.
Is that right?

In case that sb-db leader switchs, is that going to cause active
ovn-northd switch as well?

For whatever reason, in case active ovn-northd switches, is the new
active ovn-northd going to continue the work left by the previous leader,
or start all over again?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-07-31 Thread Tony Liu
Hi,

I see the active ovn-northd takes much CPU (30% - 100%) when there is no
configuration from OpenStack, nothing happening on all chassis nodes either.

Is this expected? What is it busy with?


2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due to [POLLIN] on fd 8 
(10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (68% CPU usage)
2020-07-31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641: received 
request, method="echo", params=[], id="echo"
2020-07-31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641: send reply, 
result=[], id="echo"
2020-07-31T23:08:12.777Z|04270|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 
(10.6.20.84:49158<->10.6.20.85:6642) at lib/stream-fd.c:157 (34% CPU usage)
2020-07-31T23:08:12.777Z|04271|reconnect|DBG|tcp:10.6.20.85:6642: idle 5002 ms, 
sending inactivity probe
2020-07-31T23:08:12.777Z|04272|reconnect|DBG|tcp:10.6.20.85:6642: entering IDLE
2020-07-31T23:08:12.777Z|04273|jsonrpc|DBG|tcp:10.6.20.85:6642: send request, 
method="echo", params=[], id="echo"
2020-07-31T23:08:12.777Z|04274|jsonrpc|DBG|tcp:10.6.20.85:6642: received 
request, method="echo", params=[], id="echo"
2020-07-31T23:08:12.777Z|04275|reconnect|DBG|tcp:10.6.20.85:6642: entering 
ACTIVE
2020-07-31T23:08:12.777Z|04276|jsonrpc|DBG|tcp:10.6.20.85:6642: send reply, 
result=[], id="echo"
2020-07-31T23:08:13.635Z|04277|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 
(10.6.20.84:49158<->10.6.20.85:6642) at lib/stream-fd.c:157 (34% CPU usage)
2020-07-31T23:08:13.635Z|04278|jsonrpc|DBG|tcp:10.6.20.85:6642: received reply, 
result=[], id="echo"
2020-07-31T23:08:14.480Z|04279|hmap|DBG|Dropped 129 log messages in last 5 
seconds (most recently, 0 seconds ago) due to excessive rate
2020-07-31T23:08:14.480Z|04280|hmap|DBG|lib/shash.c:112: 2 buckets with 6+ 
nodes, including 2 buckets with 6 nodes (32 nodes total across 32 buckets)
2020-07-31T23:08:14.513Z|04281|poll_loop|DBG|wakeup due to 27-ms timeout at 
lib/reconnect.c:643 (34% CPU usage)
2020-07-31T23:08:14.513Z|04282|reconnect|DBG|tcp:10.6.20.84:6641: idle 5001 ms, 
sending inactivity probe
2020-07-31T23:08:14.513Z|04283|reconnect|DBG|tcp:10.6.20.84:6641: entering IDLE
2020-07-31T23:08:14.513Z|04284|jsonrpc|DBG|tcp:10.6.20.84:6641: send request, 
method="echo", params=[], id="echo"
2020-07-31T23:08:15.370Z|04285|poll_loop|DBG|wakeup due to [POLLIN] on fd 8 
(10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (34% CPU usage)
2020-07-31T23:08:15.370Z|04286|jsonrpc|DBG|tcp:10.6.20.84:6641: received 
request, method="echo", params=[], id="echo"
2020-07-31T23:08:15.370Z|04287|reconnect|DBG|tcp:10.6.20.84:6641: entering 
ACTIVE
2020-07-31T23:08:15.370Z|04288|jsonrpc|DBG|tcp:10.6.20.84:6641: send reply, 
result=[], id="echo"
2020-07-31T23:08:16.236Z|04289|poll_loop|DBG|wakeup due to 0-ms timeout at 
tcp:10.6.20.84:6641 (100% CPU usage)
2020-07-31T23:08:16.236Z|04290|jsonrpc|DBG|tcp:10.6.20.84:6641: received reply, 
result=[], id="echo"
2020-07-31T23:08:17.778Z|04291|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 
(10.6.20.84:49158<->10.6.20.85:6642) at lib/stream-fd.c:157 (100% CPU usage)
2020-07-31T23:08:17.778Z|04292|jsonrpc|DBG|tcp:10.6.20.85:6642: received 
request, method="echo", params=[], id="echo"
2020-07-31T23:08:17.778Z|04293|jsonrpc|DBG|tcp:10.6.20.85:6642: send reply, 
result=[], id="echo"
2020-07-31T23:08:20.372Z|04294|poll_loop|DBG|wakeup due to [POLLIN] on fd 8 
(10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (41% CPU usage)
2020-07-31T23:08:20.372Z|04295|reconnect|DBG|tcp:10.6.20.84:6641: idle 5002 ms, 
sending inactivity probe
2020-07-31T23:08:20.372Z|04296|reconnect|DBG|tcp:10.6.20.84:6641: entering IDLE
2020-07-31T23:08:20.372Z|04297|jsonrpc|DBG|tcp:10.6.20.84:6641: send request, 
method="echo", params=[], id="echo"
2020-07-31T23:08:20.372Z|04298|jsonrpc|DBG|tcp:10.6.20.84:6641: received 
request, method="echo", params=[], id="echo"
2020-07-31T23:08:20.372Z|04299|reconnect|DBG|tcp:10.6.20.84:6641: entering 
ACTIVE
2020-07-31T23:08:20.372Z|04300|jsonrpc|DBG|tcp:10.6.20.84:6641: send reply, 
result=[], id="echo"
2020-07-31T23:08:20.376Z|04301|hmap|DBG|Dropped 181 log messages in last 6 
seconds (most recently, 1 seconds ago) due to excessive rate
2020-07-31T23:08:20.376Z|04302|hmap|DBG|northd/ovn-northd.c:595: 2 buckets with 
6+ nodes, including 2 buckets with 6 nodes (256 nodes total across 256 buckets)
2020-07-31T23:08:21.222Z|04303|poll_loop|DBG|wakeup due to [POLLIN] on fd 8 
(10.6.20.84:44358<->10.6.20.84:6641) at lib/stream-fd.c:157 (41% CPU usage)
2020-07-31T23:08:21.223Z|04304|jsonrpc|DBG|tcp:10.6.20.84:6641: received reply, 
result=[], id="echo"
2020-07-31T23:08:22.779Z|04305|poll_loop|DBG|wakeup due to 706-ms timeout at 
lib/reconnect.c:643 (41% CPU usage)
2020-07-31T23:08:22.779Z|04306|reconnect|DBG|tcp:10.6.20.85:6642: idle 5001 ms, 
sending inactivity probe
2020-07-31T23:08:22.779Z|04307|reconnect|DBG|tcp:10.6.20.85:6642: entering IDLE
2020-07-31T23:08:22.779Z|04308|jsonrpc|DBG|tcp:10.6.20.85:6642: 

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-31 Thread Tony Liu
Thanks Han! It's clear!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Friday, July 31, 2020 10:11 AM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; ovs-
> dev ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] DB backup and restore
> 
> 
> 
> On Thu, Jul 30, 2020 at 11:07 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi Han,
> 
>   ovsdb-client backup and restore work as expected. Sorry for the
> false alarm.
>   I messed up with the container. When restore the snapshot for nb-db,
> sb-db is
>   updated accordingly by ovn-northd.
> 
> 
> 
> Great!
> 
> 
> 
>   I think this man page should be updated saying RAFT cluster is also
> supported
>   by backup and restore.
>   http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt
> 
> 
> 
> I didn't see it saying RAFT cluster is not supported in the above
> document. Probably you misunderstood this statement:
> "Reads  snapshot,  which  must  be  a OVSDB standalone or active-backup
> database"
> The backup file you generated from ovsdb-client backup command is in
> OVSDB standalone format, which is mentioned in the "backup" document.
> 
> 
> In addition, the document ovsdb(7) also made it clear that this is the
> right way to backup/restore clustered DB.
> Did this clarify?
> 
> 
> 
>   Thanks!
> 
>   Tony
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Thursday, July 30, 2020 7:19 PM
>   > To: Tony Liu  <mailto:tonyliu0...@hotmail.com> >
>   > Cc: Numan Siddique  <mailto:nusid...@redhat.com> >; Han Zhou  <mailto:hz...@ovn.org> >; ovs-
>   > dev mailto:ovs-...@openvswitch.org> >;
> ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] DB backup and restore
>   >
>   >
>   >
>   > On Thu, Jul 30, 2020 at 7:04 PM Tony Liu  <mailto:tonyliu0...@hotmail.com>
>   > <mailto:tonyliu0...@hotmail.com
> <mailto:tonyliu0...@hotmail.com> > > wrote:
>   >
>   >
>   >   Hi,
>   >
>   >
>   >
>   >   Just update, finally make this snapshot/rollback work for
> me.
>   >
>   >   The rollback is not live though. Here is what I did.
>   >
>   >
>   >
>   >   1. Make a snapshot by ovsdb-client. Assuming no ongoing
>   >
>   >  Transactions, and data is consistent on all nodes. The
>   >
>   >  Snapshot can be done on any node. It doesn't include any
>   >
>   >  cluster info. That's probably why the man page says this
> is
>   >
>   >  for standalone and A/B only. But that cluster info seems
>   >
>   >  not required to restore.
>   >
>   >
>   >
>   >   2. To rollback/restore, stop services on all nodes,
> starting
>   >
>   >  from followers to the leader.
>   >
>   >
>   >
>   >   3. Pick a node as the new leader, copy snapshot to be the
> DB
>   >
>   >  file. Then start the service. A cluster with new cluster
> ID
>   >
>   >  will be created. The node will be allocated a new server
> ID
>   >
>   >  as well.
>   >
>   >
>   >
>   >   4. On the rest two nodes, remove the DB file, restart
> service
>   >
>   >  with remote-address pointing to the leader.
>   >
>   >
>   >
>   >   Now, the new cluster starts working with the rollback data.
>   >
>   >
>   > The steps you gave may work, but it is weird. It is better to
> just
>   > follow the steps mentioned in this section:
>   >
>   >
> https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7
>   > .rst#backing-up-and-restoring-a-database
>   >
>   >
>   >
>   >
>   >
>   >
>   >   "ovs-client restore" doesn't work for me, not sure why.
>   >
>   >   
>   >
>   >   ovsdb-client: ovsdb error: /dev/stdin: cannot identify file
> type
>   >
>   >   
>   >
>   >   I tried to restore the snapshot cre

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-31 Thread Tony Liu
Hi Han,

ovsdb-client backup and restore work as expected. Sorry for the false alarm.
I messed up with the container. When restore the snapshot for nb-db, sb-db is
updated accordingly by ovn-northd.

I think this man page should be updated saying RAFT cluster is also supported
by backup and restore.
http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt


Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Thursday, July 30, 2020 7:19 PM
> To: Tony Liu 
> Cc: Numan Siddique ; Han Zhou ; ovs-
> dev ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] DB backup and restore
> 
> 
> 
> On Thu, Jul 30, 2020 at 7:04 PM Tony Liu  <mailto:tonyliu0...@hotmail.com> > wrote:
> 
> 
>   Hi,
> 
> 
> 
>   Just update, finally make this snapshot/rollback work for me.
> 
>   The rollback is not live though. Here is what I did.
> 
> 
> 
>   1. Make a snapshot by ovsdb-client. Assuming no ongoing
> 
>  Transactions, and data is consistent on all nodes. The
> 
>  Snapshot can be done on any node. It doesn't include any
> 
>  cluster info. That's probably why the man page says this is
> 
>  for standalone and A/B only. But that cluster info seems
> 
>  not required to restore.
> 
> 
> 
>   2. To rollback/restore, stop services on all nodes, starting
> 
>  from followers to the leader.
> 
> 
> 
>   3. Pick a node as the new leader, copy snapshot to be the DB
> 
>  file. Then start the service. A cluster with new cluster ID
> 
>  will be created. The node will be allocated a new server ID
> 
>  as well.
> 
> 
> 
>   4. On the rest two nodes, remove the DB file, restart service
> 
>  with remote-address pointing to the leader.
> 
> 
> 
>   Now, the new cluster starts working with the rollback data.
> 
> 
> The steps you gave may work, but it is weird. It is better to just
> follow the steps mentioned in this section:
> 
> https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7
> .rst#backing-up-and-restoring-a-database
> 
> 
> 
> 
> 
> 
>   "ovs-client restore" doesn't work for me, not sure why.
> 
>   
> 
>   ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type
> 
>   
> 
>   I tried to restore the snapshot created by backup, also the
> 
>   Directly copied DB file, neither of them works. Wondering anyone
> 
>   experienced such issue?
> 
> 
> 
> Maybe your command was wrong. Could you share your command line, and the
> version used?
> 
> 
> 
> 
> 
>   To Numan, it would great if you could share the details to use
> 
>   Neutron-ovn-sync-util.
> 
> 
> 
> 
> 
>   Thanks!
> 
> 
> 
>   Tony
> 
> 
> 
>   From: Tony Liu <mailto:tonyliu0...@hotmail.com>
>   Sent: Thursday, July 30, 2020 4:51 PM
>   To: Numan Siddique <mailto:nusid...@redhat.com> ; Han Zhou
> <mailto:hz...@ovn.org>
>   Cc: Han Zhou <mailto:hz...@ovn.org> ; ovs-dev <mailto:ovs-
> d...@openvswitch.org> ; ovs-discuss <mailto:ovs-discuss@openvswitch.org>
>   Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore
> 
> 
> 
>   Hi Numan,
> 
>   I found this comment you made a few years back.
> 
>   - At neutron-server startup, OVN ML2 driver syncs the neutron
>   DB and OVN DB if sync mode is set to repair.
>   - Admin can run the "neutron-ovn-db-sync-util" to sync the DBs.
> 
>   Could you share the details to try those two options?
> 
> 
>   Thanks!
> 
>   Tony
> 
>   From: Tony Liu<mailto:tonyliu0...@hotmail.com>
>   Sent: Thursday, July 30, 2020 4:38 PM
>   To: Han Zhou<mailto:hz...@ovn.org>
>   Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-
> d...@openvswitch.org>; ovs-discuss<mailto:ovs-discuss@openvswitch.org>
>   Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore
> 
>   Hi,
> 
>   I have another thought after some diggings. Since I am with
>   OpenStack, all networking configurations are from OpenStack.
>   I could snapshot OpenStack MariaDB, restore and run
>   neutron-ovn-db-sync to update OVN DB. Would that be a cleaner
>   solution?
> 
>   BTW, I got this error when restore the OVN DB.
>   ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type
> 
>   The file was created by "backup" command.
> 
> 
>   Thanks!
>

Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

2020-07-30 Thread Tony Liu
I agree, that will stop the duplication from being created.


Thanks!

Tony

From: Han Zhou<mailto:zhou...@gmail.com>
Sent: Thursday, July 30, 2020 7:30 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: Ben Pfaff<mailto:b...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] OVN: Two datapath-bindings are created for the same 
logical-switch



On Thu, Jul 30, 2020 at 7:24 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Han,

Continue with this thread. Regarding to your comment in another thread.
===
2) OVSDB clients usually monitors and syncs all (interested) data from server 
to local, so when they do declarative processing, they could correct problems 
by themselves. In fact, ovn-northd does the check and deletes duplicated 
datapaths. I did a simple test and it did cleanup by itself:
2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding 
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate 
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e

I am not sure why in your case north was stuck, but I agree there must be 
something wrong. Please collect northd logs if you encounter this again so we 
can dig further.
===

You are right that ovn-northd will try to clean up the duplication, but,
there are ports in port-binding referencing to this datapath-binding, so
ovn-northd fails to delete the datapath-binding. I have to manually delete
those ports to be able to delete the datapath-binding. I believe it’s not
supported for ovn-northd to delete a configuration that is being
referenced. Is that right? If yes, should we fix it or it's the intention?


Yes, good point!
It is definitely a bug and we should fix it. I think the best fix is to change 
the schema and add "logical_datapath" as a index, but we'll need to make it 
backward compatible to avoid upgrade issues.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 23, 2020 7:51 PM
To: Han Zhou<mailto:zhou...@gmail.com>; Ben Pfaff<mailto:b...@ovn.org>
Cc: ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for 
the same logical-switch

Hi Han,

Thanks for taking the time to look into this. This problem is not consistently 
reproduced.
Developers normally ignore it:) I think we collected enough context and we can 
let it go for now.
I will rebuild setup, tune that RAFT heartbeat timer and rerun the test. Will 
keep you posted.


Thanks again!

Tony


From: Han Zhou mailto:zhou...@gmail.com>>
Sent: July 23, 2020 06:53 PM
To: Tony Liu mailto:tonyliu0...@hotmail.com>>; Ben 
Pfaff mailto:b...@ovn.org>>
Cc: Numan Siddique mailto:num...@ovn.org>>; ovs-dev 
mailto:ovs-...@openvswitch.org>>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org> 
mailto:ovs-discuss@openvswitch.org>>
Subject: Re: [ovs-dev] OVN: Two datapath-bindings are created for the same 
logical-switch


On Thu, Jul 23, 2020 at 10:33 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Changed the title for this specific problem.
> I looked into logs and have more findings.
> The problem was happening when sb-db leader switched.

Hi Tony,

Thanks for this detailed information. Could you confirm which version of OVS is 
used (to understand OVSDB behavior).

>
> For ovsdb cluster, what may trigger the leader switch? Given the log,
> 2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout expired, 
> starting election
> The election is asked by a follower node. Is that because the connection from 
> follower to leader timeout,
> then follower assumes the leader is dead and starts an election?

You are right, the RAFT heart beat would timeout when server is too busy and 
the election timer is too small (default 1s). For large scale test, please 
increase the election timer by:
ovn-appctl -t  cluster/change-election-timer OVN_Southbound 

I suggest to set  to be at least bigger than 1 or more in your case. 
(you need to increase the value gradually - 2000, 4000, 8000, 16000 - so it 
will take you 4 commands to reach this from the initial default value 1000, not 
very convenient, I know :)

 here is the path to the socket ctl file of ovn-sb, usually under 
/var/run/ovn.

>
> For ovn-northd (3 instances), they all connect to the sb-db leader, whoever 
> has the locker is the master.
> When sb-db leader switches, all ovn-northd instances look for the new leader. 
> In this case, there is no
> guarantee that the old ovn-northd master remains the role, other ovn-nort

Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

2020-07-30 Thread Tony Liu
Hi Han,

Continue with this thread. Regarding to your comment in another thread.
===
2) OVSDB clients usually monitors and syncs all (interested) data from server 
to local, so when they do declarative processing, they could correct problems 
by themselves. In fact, ovn-northd does the check and deletes duplicated 
datapaths. I did a simple test and it did cleanup by itself:
2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding 
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate 
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e

I am not sure why in your case north was stuck, but I agree there must be 
something wrong. Please collect northd logs if you encounter this again so we 
can dig further.
===

You are right that ovn-northd will try to clean up the duplication, but,
there are ports in port-binding referencing to this datapath-binding, so
ovn-northd fails to delete the datapath-binding. I have to manually delete
those ports to be able to delete the datapath-binding. I believe it’s not
supported for ovn-northd to delete a configuration that is being
referenced. Is that right? If yes, should we fix it or it's the intention?


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 23, 2020 7:51 PM
To: Han Zhou<mailto:zhou...@gmail.com>; Ben Pfaff<mailto:b...@ovn.org>
Cc: ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for 
the same logical-switch

Hi Han,

Thanks for taking the time to look into this. This problem is not consistently 
reproduced.
Developers normally ignore it:) I think we collected enough context and we can 
let it go for now.
I will rebuild setup, tune that RAFT heartbeat timer and rerun the test. Will 
keep you posted.


Thanks again!

Tony


From: Han Zhou 
Sent: July 23, 2020 06:53 PM
To: Tony Liu ; Ben Pfaff 
Cc: Numan Siddique ; ovs-dev ; 
ovs-discuss@openvswitch.org 
Subject: Re: [ovs-dev] OVN: Two datapath-bindings are created for the same 
logical-switch


On Thu, Jul 23, 2020 at 10:33 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Changed the title for this specific problem.
> I looked into logs and have more findings.
> The problem was happening when sb-db leader switched.

Hi Tony,

Thanks for this detailed information. Could you confirm which version of OVS is 
used (to understand OVSDB behavior).

>
> For ovsdb cluster, what may trigger the leader switch? Given the log,
> 2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout expired, 
> starting election
> The election is asked by a follower node. Is that because the connection from 
> follower to leader timeout,
> then follower assumes the leader is dead and starts an election?

You are right, the RAFT heart beat would timeout when server is too busy and 
the election timer is too small (default 1s). For large scale test, please 
increase the election timer by:
ovn-appctl -t  cluster/change-election-timer OVN_Southbound 

I suggest to set  to be at least bigger than 1 or more in your case. 
(you need to increase the value gradually - 2000, 4000, 8000, 16000 - so it 
will take you 4 commands to reach this from the initial default value 1000, not 
very convenient, I know :)

 here is the path to the socket ctl file of ovn-sb, usually under 
/var/run/ovn.

>
> For ovn-northd (3 instances), they all connect to the sb-db leader, whoever 
> has the locker is the master.
> When sb-db leader switches, all ovn-northd instances look for the new leader. 
> In this case, there is no
> guarantee that the old ovn-northd master remains the role, other ovn-northd 
> instance may find the
> leader and acquire the lock first. So, the sb-db leader switch may also cause 
> ovn-northd master switch.
> Such switch may happen in the middle of ovn-northd transaction, in that case, 
> is there any guarantee to
> the transaction completeness? My guess is that, the older created a 
> datapath-binding for a logical-switch,
> switch happened when this transaction is not completed, then the new 
> master/leader created another
> data-path binding for the same logical-switch. Does it make any sense?

I agree with you it could be related to the failover and the lock behavior 
during the failover. It could be a lock problem causing 2 northds became active 
at the same time for a short moment. However, I still can't imagine how the 
duplicated entries are created with different tunnel keys. If both northd 
create the datapath binding for the same LS at the same time, they should 
allocate the same tunnel key, and then one of them should fail during the 
transa

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Tony Liu
Mmm... nb-db rolled back, but sb-db is not re-synced, ovn-northd
complaints "clustered database server has stale data; trying
another server". Any way to workaround it or I need to snapshot
and rollback sb-db as well?


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 7:04 PM
To: Numan Siddique<mailto:nusid...@redhat.com>; Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

Just update, finally make this snapshot/rollback work for me.
The rollback is not live though. Here is what I did.

1. Make a snapshot by ovsdb-client. Assuming no ongoing
   Transactions, and data is consistent on all nodes. The
   Snapshot can be done on any node. It doesn't include any
   cluster info. That's probably why the man page says this is
   for standalone and A/B only. But that cluster info seems
   not required to restore.

2. To rollback/restore, stop services on all nodes, starting
   from followers to the leader.

3. Pick a node as the new leader, copy snapshot to be the DB
   file. Then start the service. A cluster with new cluster ID
   will be created. The node will be allocated a new server ID
   as well.

4. On the rest two nodes, remove the DB file, restart service
   with remote-address pointing to the leader.

Now, the new cluster starts working with the rollback data.

"ovs-client restore" doesn't work for me, not sure why.

ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type

I tried to restore the snapshot created by backup, also the
Directly copied DB file, neither of them works. Wondering anyone
experienced such issue?

To Numan, it would great if you could share the details to use
Neutron-ovn-sync-util.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 4:51 PM
To: Numan Siddique<mailto:nusid...@redhat.com>; Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi Numan,

I found this comment you made a few years back.

- At neutron-server startup, OVN ML2 driver syncs the neutron
DB and OVN DB if sync mode is set to repair.
- Admin can run the "neutron-ovn-db-sync-util" to sync the DBs.

Could you share the details to try those two options?


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 4:38 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

I have another thought after some diggings. Since I am with
OpenStack, all networking configurations are from OpenStack.
I could snapshot OpenStack MariaDB, restore and run
neutron-ovn-db-sync to update OVN DB. Would that be a cleaner
solution?

BTW, I got this error when restore the OVN DB.
ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type

The file was created by "backup" command.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 3:41 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

A quick question here. Given this man page.
http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt

It says backup and restore commands are for OVSDB standalone and

active-backup databases.



Can they be used for RAFT cluster? If not, what would be the concern,

like inconsistency?



If I restore to a follower, is the request going to be forwarded to the

leader to restore DB for the whole cluster? But I believe it's recommended

to restore to the leader directly for performance sake.



I am going to give it a try anyways, see how it works. Will make sure

there is no configuration update from OpenStack side while running such

snapshot and restore process.





Thanks!



Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Thursday, July 30, 2020 12:23 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: Han Zhou<mailto:hz...@ovn.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Thu, Jul 30, 2020 at 10:56 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Han,

That doc helps. I will run some tests and update here. The use case I want
to cover is snap

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Tony Liu
Hi,

Just update, finally make this snapshot/rollback work for me.
The rollback is not live though. Here is what I did.

1. Make a snapshot by ovsdb-client. Assuming no ongoing
   Transactions, and data is consistent on all nodes. The
   Snapshot can be done on any node. It doesn't include any
   cluster info. That's probably why the man page says this is
   for standalone and A/B only. But that cluster info seems
   not required to restore.

2. To rollback/restore, stop services on all nodes, starting
   from followers to the leader.

3. Pick a node as the new leader, copy snapshot to be the DB
   file. Then start the service. A cluster with new cluster ID
   will be created. The node will be allocated a new server ID
   as well.

4. On the rest two nodes, remove the DB file, restart service
   with remote-address pointing to the leader.

Now, the new cluster starts working with the rollback data.

"ovs-client restore" doesn't work for me, not sure why.

ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type

I tried to restore the snapshot created by backup, also the
Directly copied DB file, neither of them works. Wondering anyone
experienced such issue?

To Numan, it would great if you could share the details to use
Neutron-ovn-sync-util.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 4:51 PM
To: Numan Siddique<mailto:nusid...@redhat.com>; Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi Numan,

I found this comment you made a few years back.

- At neutron-server startup, OVN ML2 driver syncs the neutron
DB and OVN DB if sync mode is set to repair.
- Admin can run the "neutron-ovn-db-sync-util" to sync the DBs.

Could you share the details to try those two options?


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 4:38 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

I have another thought after some diggings. Since I am with
OpenStack, all networking configurations are from OpenStack.
I could snapshot OpenStack MariaDB, restore and run
neutron-ovn-db-sync to update OVN DB. Would that be a cleaner
solution?

BTW, I got this error when restore the OVN DB.
ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type

The file was created by "backup" command.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 3:41 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

A quick question here. Given this man page.
http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt

It says backup and restore commands are for OVSDB standalone and

active-backup databases.



Can they be used for RAFT cluster? If not, what would be the concern,

like inconsistency?



If I restore to a follower, is the request going to be forwarded to the

leader to restore DB for the whole cluster? But I believe it's recommended

to restore to the leader directly for performance sake.



I am going to give it a try anyways, see how it works. Will make sure

there is no configuration update from OpenStack side while running such

snapshot and restore process.





Thanks!



Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Thursday, July 30, 2020 12:23 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: Han Zhou<mailto:hz...@ovn.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Thu, Jul 30, 2020 at 10:56 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Han,

That doc helps. I will run some tests and update here. The use case I want
to cover is snapshot/rollback and backup/restore.


Actually, "at-least-once" consistency, because OVSDB does not have a session
mechanism to drop duplicate transactions if a connection drops after the server
commits it but before the client receives the result.

I saw duplicated datapath bindings for the same logical switch once, if you
recall. This may explain that. The ovn-northd connection to sb-db is dropped
before receiving the result. So ovn-northd initiates another transaction to
create datapath binding for the same logical switch.

Yes, this is a possibility.
However, in reality, this is usually not a problem

[ovs-discuss] [OVN] Where is DB cluster info stored?

2020-07-30 Thread Tony Liu

Hi,

Where is RAFT DB cluster info stored?
I see cluster ID in the DB file, but where is server ID stored?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] Where is DB cluster info stored?

2020-07-30 Thread Tony Liu
Sorry for bothering, please discard this question.
Server ID is in DB file as well.

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 5:22 PM
To: ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: [OVN] Where is DB cluster info stored?


Hi,

Where is RAFT DB cluster info stored?
I see cluster ID in the DB file, but where is server ID stored?


Thanks!

Tony


___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Tony Liu
Hi Numan,

I found this comment you made a few years back.

- At neutron-server startup, OVN ML2 driver syncs the neutron
DB and OVN DB if sync mode is set to repair.
- Admin can run the "neutron-ovn-db-sync-util" to sync the DBs.

Could you share the details to try those two options?


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 4:38 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

I have another thought after some diggings. Since I am with
OpenStack, all networking configurations are from OpenStack.
I could snapshot OpenStack MariaDB, restore and run
neutron-ovn-db-sync to update OVN DB. Would that be a cleaner
solution?

BTW, I got this error when restore the OVN DB.
ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type

The file was created by "backup" command.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 3:41 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

A quick question here. Given this man page.
http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt

It says backup and restore commands are for OVSDB standalone and

active-backup databases.



Can they be used for RAFT cluster? If not, what would be the concern,

like inconsistency?



If I restore to a follower, is the request going to be forwarded to the

leader to restore DB for the whole cluster? But I believe it's recommended

to restore to the leader directly for performance sake.



I am going to give it a try anyways, see how it works. Will make sure

there is no configuration update from OpenStack side while running such

snapshot and restore process.





Thanks!



Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Thursday, July 30, 2020 12:23 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: Han Zhou<mailto:hz...@ovn.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Thu, Jul 30, 2020 at 10:56 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Han,

That doc helps. I will run some tests and update here. The use case I want
to cover is snapshot/rollback and backup/restore.


Actually, "at-least-once" consistency, because OVSDB does not have a session
mechanism to drop duplicate transactions if a connection drops after the server
commits it but before the client receives the result.

I saw duplicated datapath bindings for the same logical switch once, if you
recall. This may explain that. The ovn-northd connection to sb-db is dropped
before receiving the result. So ovn-northd initiates another transaction to
create datapath binding for the same logical switch.

Yes, this is a possibility.
However, in reality, this is usually not a problem:

1) If DB schema has table keys properly defined, the redundant transaction from 
clients would be rejected by DB server because of key constraint check. In the 
datapath binding case, this doesn't work because of the poor definition of the 
datapath_binding table. It should have had "logical_switch_router" column 
defined and set as a key (in addition to the "tunnel_key") instead of storing 
it in external_ids. The duplicated entries would have been avoided. The other 
tables such as port_binding would never have such problem.

2) OVSDB clients usually monitors and syncs all (interested) data from server 
to local, so when they do declarative processing, they could correct problems 
by themselves. In fact, ovn-northd does the check and deletes duplicated 
datapaths. I did a simple test and it did cleanup by itself:
2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding 
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate 
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e

I am not sure why in your case north was stuck, but I agree there must be 
something wrong. Please collect northd logs if you encounter this again so we 
can dig further.

I see two ways to improve it.
1) On client side, if the connection is broken while waiting for the result
   of a transaction, the client checks the transaction state, committed or not,
   when it reconnects to the leader (maybe a different node).
   Do we have such check today?

Clients does check. In this case when transaction was actually successful but 
ap

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Tony Liu
Hi,

I have another thought after some diggings. Since I am with
OpenStack, all networking configurations are from OpenStack.
I could snapshot OpenStack MariaDB, restore and run
neutron-ovn-db-sync to update OVN DB. Would that be a cleaner
solution?

BTW, I got this error when restore the OVN DB.
ovsdb-client: ovsdb error: /dev/stdin: cannot identify file type

The file was created by "backup" command.


Thanks!

Tony

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: Thursday, July 30, 2020 3:41 PM
To: Han Zhou<mailto:hz...@ovn.org>
Cc: Han Zhou<mailto:hz...@ovn.org>; ovs-dev<mailto:ovs-...@openvswitch.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] [OVN] DB backup and restore

Hi,

A quick question here. Given this man page.
http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt

It says backup and restore commands are for OVSDB standalone and

active-backup databases.



Can they be used for RAFT cluster? If not, what would be the concern,

like inconsistency?



If I restore to a follower, is the request going to be forwarded to the

leader to restore DB for the whole cluster? But I believe it's recommended

to restore to the leader directly for performance sake.



I am going to give it a try anyways, see how it works. Will make sure

there is no configuration update from OpenStack side while running such

snapshot and restore process.





Thanks!



Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Thursday, July 30, 2020 12:23 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: Han Zhou<mailto:hz...@ovn.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Thu, Jul 30, 2020 at 10:56 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Han,

That doc helps. I will run some tests and update here. The use case I want
to cover is snapshot/rollback and backup/restore.


Actually, "at-least-once" consistency, because OVSDB does not have a session
mechanism to drop duplicate transactions if a connection drops after the server
commits it but before the client receives the result.

I saw duplicated datapath bindings for the same logical switch once, if you
recall. This may explain that. The ovn-northd connection to sb-db is dropped
before receiving the result. So ovn-northd initiates another transaction to
create datapath binding for the same logical switch.

Yes, this is a possibility.
However, in reality, this is usually not a problem:

1) If DB schema has table keys properly defined, the redundant transaction from 
clients would be rejected by DB server because of key constraint check. In the 
datapath binding case, this doesn't work because of the poor definition of the 
datapath_binding table. It should have had "logical_switch_router" column 
defined and set as a key (in addition to the "tunnel_key") instead of storing 
it in external_ids. The duplicated entries would have been avoided. The other 
tables such as port_binding would never have such problem.

2) OVSDB clients usually monitors and syncs all (interested) data from server 
to local, so when they do declarative processing, they could correct problems 
by themselves. In fact, ovn-northd does the check and deletes duplicated 
datapaths. I did a simple test and it did cleanup by itself:
2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding 
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate 
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e

I am not sure why in your case north was stuck, but I agree there must be 
something wrong. Please collect northd logs if you encounter this again so we 
can dig further.

I see two ways to improve it.
1) On client side, if the connection is broken while waiting for the result
   of a transaction, the client checks the transaction state, committed or not,
   when it reconnects to the leader (maybe a different node).
   Do we have such check today?

Clients does check. In this case when transaction was actually successful but 
appears to be failed from client point of view, the check doesn't help.

2) I see client connection is dropped by the leader when it's busy. I don't
   think this is a good way to control the traffic. The server can cache and
   hold the request when it's busy, or even push back. Dropping connection
   is not a good option. Any thoughts here?

The server doesn't make this kind of decisions. It could be simply overloaded 
and disconnected from the cluster, or even worse, a node could crash after 
commiting the transaction.

Thanks,
Han


Thanks!

Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Wednesday, July 29, 2020 11:38 PM
To: T

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Tony Liu
Hi,

A quick question here. Given this man page.
http://www.openvswitch.org/support/dist-docs/ovsdb-client.1.txt

It says backup and restore commands are for OVSDB standalone and

active-backup databases.



Can they be used for RAFT cluster? If not, what would be the concern,

like inconsistency?



If I restore to a follower, is the request going to be forwarded to the

leader to restore DB for the whole cluster? But I believe it's recommended

to restore to the leader directly for performance sake.



I am going to give it a try anyways, see how it works. Will make sure

there is no configuration update from OpenStack side while running such

snapshot and restore process.





Thanks!



Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Thursday, July 30, 2020 12:23 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: Han Zhou<mailto:hz...@ovn.org>; 
ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Thu, Jul 30, 2020 at 10:56 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Han,

That doc helps. I will run some tests and update here. The use case I want
to cover is snapshot/rollback and backup/restore.


Actually, "at-least-once" consistency, because OVSDB does not have a session
mechanism to drop duplicate transactions if a connection drops after the server
commits it but before the client receives the result.

I saw duplicated datapath bindings for the same logical switch once, if you
recall. This may explain that. The ovn-northd connection to sb-db is dropped
before receiving the result. So ovn-northd initiates another transaction to
create datapath binding for the same logical switch.

Yes, this is a possibility.
However, in reality, this is usually not a problem:

1) If DB schema has table keys properly defined, the redundant transaction from 
clients would be rejected by DB server because of key constraint check. In the 
datapath binding case, this doesn't work because of the poor definition of the 
datapath_binding table. It should have had "logical_switch_router" column 
defined and set as a key (in addition to the "tunnel_key") instead of storing 
it in external_ids. The duplicated entries would have been avoided. The other 
tables such as port_binding would never have such problem.

2) OVSDB clients usually monitors and syncs all (interested) data from server 
to local, so when they do declarative processing, they could correct problems 
by themselves. In fact, ovn-northd does the check and deletes duplicated 
datapaths. I did a simple test and it did cleanup by itself:
2020-07-30T18:55:53.057Z|6|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-30T19:02:10.465Z|7|ovn_northd|INFO|deleting Datapath_Binding 
abef9503-445e-4a52-ae88-4c826cbad9d6 with duplicate 
external-ids:logical-switch/router ee80c38b-2016-4cbc-9437-f73e3a59369e

I am not sure why in your case north was stuck, but I agree there must be 
something wrong. Please collect northd logs if you encounter this again so we 
can dig further.

I see two ways to improve it.
1) On client side, if the connection is broken while waiting for the result
   of a transaction, the client checks the transaction state, committed or not,
   when it reconnects to the leader (maybe a different node).
   Do we have such check today?

Clients does check. In this case when transaction was actually successful but 
appears to be failed from client point of view, the check doesn't help.

2) I see client connection is dropped by the leader when it's busy. I don't
   think this is a good way to control the traffic. The server can cache and
   hold the request when it's busy, or even push back. Dropping connection
   is not a good option. Any thoughts here?

The server doesn't make this kind of decisions. It could be simply overloaded 
and disconnected from the cluster, or even worse, a node could crash after 
commiting the transaction.

Thanks,
Han


Thanks!

Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Wednesday, July 29, 2020 11:38 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Wed, Jul 29, 2020 at 10:58 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Hi,
>
>
>
> There is any guidance to backup and restore OVN nb-db and sb-db?
>
>
>
> Is /var/lib/openvswitch/ovn-[ns]b/ovn[ns]b.db the only database file?
>
>
>
> For 3-node DB cluster, is replication 3 (the data is replicated onto
>
> All 3 nodes)?
>
>
>
> Are DB files on 3 nodes identical?
>
>
>
> If I stop a DB follower and empty the DB file on the follower node,
>
> when I start it back, is the wh

Re: [ovs-discuss] [OVN] DB backup and restore

2020-07-30 Thread Tony Liu
Hi Han,

That doc helps. I will run some tests and update here. The use case I want
to cover is snapshot/rollback and backup/restore.


Actually, "at-least-once" consistency, because OVSDB does not have a session
mechanism to drop duplicate transactions if a connection drops after the server
commits it but before the client receives the result.

I saw duplicated datapath bindings for the same logical switch once, if you
recall. This may explain that. The ovn-northd connection to sb-db is dropped
before receiving the result. So ovn-northd initiates another transaction to
create datapath binding for the same logical switch.

I see two ways to improve it.
1) On client side, if the connection is broken while waiting for the result
   of a transaction, the client checks the transaction state, committed or not,
   when it reconnects to the leader (maybe a different node).
   Do we have such check today?
2) I see client connection is dropped by the leader when it's busy. I don't
   think this is a good way to control the traffic. The server can cache and
   hold the request when it's busy, or even push back. Dropping connection
   is not a good option. Any thoughts here?


Thanks!

Tony

From: Han Zhou<mailto:hz...@ovn.org>
Sent: Wednesday, July 29, 2020 11:38 PM
To: Tony Liu<mailto:tonyliu0...@hotmail.com>
Cc: ovs-discuss<mailto:ovs-discuss@openvswitch.org>; 
ovs-dev<mailto:ovs-...@openvswitch.org>
Subject: Re: [ovs-discuss] [OVN] DB backup and restore



On Wed, Jul 29, 2020 at 10:58 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Hi,
>
>
>
> There is any guidance to backup and restore OVN nb-db and sb-db?
>
>
>
> Is /var/lib/openvswitch/ovn-[ns]b/ovn[ns]b.db the only database file?
>
>
>
> For 3-node DB cluster, is replication 3 (the data is replicated onto
>
> All 3 nodes)?
>
>
>
> Are DB files on 3 nodes identical?
>
>
>
> If I stop a DB follower and empty the DB file on the follower node,
>
> when I start it back, is the whole DB going to be replicated to it?
>
>
>
> To backup the DB, is it OK to copy the DB file from any node, assuming
>
> no transaction ongoing?
>
>
>
> Is the following going to work to restore the DB?
>
> * Stop all 3 DBs.
>
> * Copy backup DB file to one node, empty DB file on the rest two nodes.
>
> * Bootstrap the node with DB file.
>
> * Start the rest two nodes to join the cluster.
>

For ovsdb operations, please refer to "man 7 ovsdb", or here: 
https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst

>
>
> Do I need to restore sb-db as well? Or restore nb-db only and let
>
> ovn-northd to sync data from nb-db to sb-db. Chassis data should be
>
> updated by onv-controller?
>

You don't have to restore sb-db. ovn-northd and ovn-controllers will sync the 
data in SB DB.
However, it may take quite some time to sync if the scale is large.
Also, remember that the mac_binding table in SB will not be restored by 
ovn-controller because it is populated as a result of ARP packets handling by 
ovn-controller. The entries will be generated again only if new ARP packets are 
observed by ovn-controller.

>
>
> I am running scaling test. It takes quite a lot of time to build
>
> Configurations. Wondering if I can back and restore DB to rollback
>
> to some checkpoint to avoid restart all over.
>
>
>
>
>
> Thanks!
>
>
>
> Tony
>
>
>
> ___
> discuss mailing list
> disc...@openvswitch.org<mailto:disc...@openvswitch.org>
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] DB backup and restore

2020-07-29 Thread Tony Liu
Hi,

There is any guidance to backup and restore OVN nb-db and sb-db?

Is /var/lib/openvswitch/ovn-[ns]b/ovn[ns]b.db the only database file?

For 3-node DB cluster, is replication 3 (the data is replicated onto
All 3 nodes)?

Are DB files on 3 nodes identical?

If I stop a DB follower and empty the DB file on the follower node,
when I start it back, is the whole DB going to be replicated to it?

To backup the DB, is it OK to copy the DB file from any node, assuming
no transaction ongoing?

Is the following going to work to restore the DB?
* Stop all 3 DBs.
* Copy backup DB file to one node, empty DB file on the rest two nodes.
* Bootstrap the node with DB file.
* Start the rest two nodes to join the cluster.

Do I need to restore sb-db as well? Or restore nb-db only and let
ovn-northd to sync data from nb-db to sb-db. Chassis data should be
updated by onv-controller?

I am running scaling test. It takes quite a lot of time to build
Configurations. Wondering if I can back and restore DB to rollback
to some checkpoint to avoid restart all over.


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] OVN: configuration in Neutron DB?

2020-07-29 Thread Tony Liu
Hi Lucas,

Is it OK to discuss OpenStack integration here? Otherwise, please let
me know which OpenStack email list we can use.

I am running networking scaling test. The target is 4K isolated private
networks with external/public access via logical routers. The test in
my previous email is to add 256 routers and set them as gateway. When I
use openstack cli to create a router and set it as gateway immediately,
it doesn’t work some times, Neutron complaints that the router is not
found.

for c in `seq 0 1 255`; do
echo "INFO: $op router-$c..."
openstack router $op router-$c
if [ "$op" == "create" ]; then
openstack router set \
--external-gateway public \
--fixed-ip ip-address=10.6.33.$c \
--disable-snat \
router-$c
fi
done

I thought Neutron API call is synchronous, which means when the client
gets the response for a creation request, the object is created. But
OVN part is async. Is that right? That’s why I was asking reading is
from Neutron DB or OVN DB. If the reading is from Neutron DB, I don’t
understand why the router can not be found when client sets it?
Any thought here?

I changed test script to create all 256 routers first, then set them.
It works fine. It kind of proves that it’s not guaranteed that the
object is ready after receiving the response. I don’t believe this is
expected.

Now, I am adding subnets to router, 16 subnets to 1 router. It’s 4096
networks on 256 routers. It’s quit slow and I am getting some errors.

INFO: Add subnet-1-0 to router-1...
BadRequestException: 400: Client Error for url: 
http://10.6.20.200:9696/v2.0/routers/3d9bff46-4eba-4621-b7c4-3c3062ab8d6e/add_router_interface,
 Bad router request: Router already has a port on subnet 
9d8d1f2d-a401-4ad9-9a72-0d530fba2085.
INFO: Add subnet-1-16 to router-1...
INFO: Add subnet-1-32 to router-1...
ConflictException: 409: Client Error for url: 
http://10.6.20.200:9696/v2.0/routers/3d9bff46-4eba-4621-b7c4-3c3062ab8d6e/add_router_interface,
 IP address 192.168.1.33 already allocated in subnet 
e44dee36-0ee4-4c23-91cf-1ac5b9ee07e0
INFO: Add subnet-1-48 to router-1...
INFO: Add subnet-1-64 to router-1...
HttpException: 500: Server Error for url: 
http://10.6.20.200:9696/v2.0/routers/3d9bff46-4eba-4621-b7c4-3c3062ab8d6e/add_router_interface,
 Request Failed: internal server error while processing your request.

I am not sure if that’s because I add subnets to one router back to back.
I will try another way to add subnet to avoid such back to back requests
to the same router. Will see if that helps.


Thanks!

Tony

From: Lucas Alvares Gomes<mailto:lucasago...@gmail.com>
Sent: Wednesday, July 29, 2020 2:22 AM
To: Numan Siddique<mailto:num...@ovn.org>
Cc: Tony Liu<mailto:tonyliu0...@hotmail.com>; 
ovs-...@openvswitch.org<mailto:ovs-...@openvswitch.org>; Lucas Alvares Gomes 
Martins<mailto:lmart...@redhat.com>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Subject: Re: [ovs-dev] [ovs-discuss] OVN: configuration in Neutron DB?

Hi,

On Wed, Jul 29, 2020 at 7:42 AM Numan Siddique  wrote:
>
> Adding Daniel and Lucas. Maybe you can also include opendev ML to get
> appropriate responses from the OpenStack side.
>
> Please see below for few comments.
>
>
> On Wed, Jul 29, 2020 at 12:02 PM Tony Liu  wrote:
>
> > Quick update. I changed the script to create 256 routers first, then set
> > each of them as gateway.
> > There is no create and set back to back. It seems working fine now.
> >
> > It would be good someone can clarify my questions. It seems that it's
> > not guaranteed that the
> > object is ready when client get OK response of creation request. Is this
> > expected?
> >
> >
> > Thanks!
> >
> > Tony
> >
> > --
> > *From:* dev  on behalf of Tony Liu <
> > tonyliu0...@hotmail.com>
> > *Sent:* July 28, 2020 10:37 PM
> > *To:* ovs-discuss@openvswitch.org ;
> > ovs-...@openvswitch.org 
> > *Subject:* [ovs-dev] OVN: configuration in Neutron DB?
> >
> > Hi,
> >
> > In case of integration with OpenStack, for example, when a client requests
> > to create a network,
> > is this network configuration saved in both Neutron DB and OVN DB, or OVN
> > DB only?
> >
>
> The neutron API first saves in the neutron db and the neutron OVN mechanism
> driver will talk
> to the Northbound ovsdb-server and create corresponding OVN logical
> resources.
>

Both as numans said, it's first created in the Neutron database and
then the OVN plugin is invoked to create that resource in the OVN NB
database. We usually use the "name" column in the OVN NB DB to store
the correspondent Neutron UUID, if the re

Re: [ovs-discuss] OVN: configuration in Neutron DB?

2020-07-29 Thread Tony Liu
Thanks Numan!

Tony

From: Numan Siddique 
Sent: July 28, 2020 11:41 PM
To: Tony Liu 
Cc: ovs-discuss@openvswitch.org ; 
ovs-...@openvswitch.org ; Daniel Alvarez Sanchez 
; Lucas Alvares Gomes Martins 
Subject: Re: [ovs-discuss] OVN: configuration in Neutron DB?


Adding Daniel and Lucas. Maybe you can also include opendev ML to get 
appropriate responses from the OpenStack side.

Please see below for few comments.


On Wed, Jul 29, 2020 at 12:02 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Quick update. I changed the script to create 256 routers first, then set each 
of them as gateway.
There is no create and set back to back. It seems working fine now.

It would be good someone can clarify my questions. It seems that it's not 
guaranteed that the
object is ready when client get OK response of creation request. Is this 
expected?


Thanks!

Tony


From: dev 
mailto:ovs-dev-boun...@openvswitch.org>> on 
behalf of Tony Liu mailto:tonyliu0...@hotmail.com>>
Sent: July 28, 2020 10:37 PM
To: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org> 
mailto:ovs-discuss@openvswitch.org>>; 
ovs-...@openvswitch.org<mailto:ovs-...@openvswitch.org> 
mailto:ovs-...@openvswitch.org>>
Subject: [ovs-dev] OVN: configuration in Neutron DB?

Hi,

In case of integration with OpenStack, for example, when a client requests to 
create a network,
is this network configuration saved in both Neutron DB and OVN DB, or OVN DB 
only?

The neutron API first saves in the neutron db and the neutron OVN mechanism 
driver will talk
to the Northbound ovsdb-server and create corresponding OVN logical resources.

Also, when a client gets a network from Neutron API, is the configuration read 
from Neutron DB
or OVN DB?

I think its read from the neutron DB.



Other than coding, is there any doc about how Neutron OVN ML2 driver works?

You can refer here - 
https://docs.openstack.org/neutron/latest/admin/ovn/refarch/refarch.html



I have this script to create 256 routers and set each of them as gateway.
router()
{
local op=$1

for c in `seq 0 1 255`; do
echo "INFO: $op router-$c..."
openstack router $op router-$c
if [ "$op" == "create" ]; then
openstack router set \
--external-gateway public \
--fixed-ip ip-address=10.6.33.$c \
--disable-snat \
router-$c
fi
done
}
I see lots failures from Neutron log when get/show a router. It seems like 
that, when setting a router,
the router is not completely ready yet. Is it possible?

After running that script, I see some logical routers in ovn-nb-db don't have 
gw_port_id. And there
are some duplications. Here is an example. Each of them has unique UUID.

external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="1", 
"neutron:router_name"=router-255}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="1", 
"neutron:router_name"=router-232}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="0", 
"neutron:router_name"=router-158}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="0", 
"neutron:router_name"=router-158}
external_ids: 
{"neutron:gw_port_id"="e52dda53-c914-4ea7-840b-8632a5770680", 
"neutron:revision_number"="2", "neutron:router_name"=router-158}

I enabled nb-db debug logging and searched, eg. router-158, it only shows in a 
jsonrpc reply message
including 3 router-158, as the above.

Any clues?


Maybe Daniel/Lucas can comment.

Thanks
Numan


Thanks!

Tony



___
dev mailing list
d...@openvswitch.org<mailto:d...@openvswitch.org>
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
discuss mailing list
disc...@openvswitch.org<mailto:disc...@openvswitch.org>
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: configuration in Neutron DB?

2020-07-29 Thread Tony Liu
Quick update. I changed the script to create 256 routers first, then set each 
of them as gateway.
There is no create and set back to back. It seems working fine now.

It would be good someone can clarify my questions. It seems that it's not 
guaranteed that the
object is ready when client get OK response of creation request. Is this 
expected?


Thanks!

Tony


From: dev  on behalf of Tony Liu 

Sent: July 28, 2020 10:37 PM
To: ovs-discuss@openvswitch.org ; 
ovs-...@openvswitch.org 
Subject: [ovs-dev] OVN: configuration in Neutron DB?

Hi,

In case of integration with OpenStack, for example, when a client requests to 
create a network,
is this network configuration saved in both Neutron DB and OVN DB, or OVN DB 
only?
Also, when a client gets a network from Neutron API, is the configuration read 
from Neutron DB
or OVN DB?

Other than coding, is there any doc about how Neutron OVN ML2 driver works?

I have this script to create 256 routers and set each of them as gateway.
router()
{
local op=$1

for c in `seq 0 1 255`; do
echo "INFO: $op router-$c..."
openstack router $op router-$c
if [ "$op" == "create" ]; then
openstack router set \
--external-gateway public \
--fixed-ip ip-address=10.6.33.$c \
--disable-snat \
router-$c
fi
done
}
I see lots failures from Neutron log when get/show a router. It seems like 
that, when setting a router,
the router is not completely ready yet. Is it possible?

After running that script, I see some logical routers in ovn-nb-db don't have 
gw_port_id. And there
are some duplications. Here is an example. Each of them has unique UUID.

external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="1", 
"neutron:router_name"=router-255}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="1", 
"neutron:router_name"=router-232}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="0", 
"neutron:router_name"=router-158}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="0", 
"neutron:router_name"=router-158}
external_ids: 
{"neutron:gw_port_id"="e52dda53-c914-4ea7-840b-8632a5770680", 
"neutron:revision_number"="2", "neutron:router_name"=router-158}

I enabled nb-db debug logging and searched, eg. router-158, it only shows in a 
jsonrpc reply message
including 3 router-158, as the above.

Any clues?


Thanks!

Tony



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN: configuration in Neutron DB?

2020-07-29 Thread Tony Liu
Hi,

In case of integration with OpenStack, for example, when a client requests to 
create a network,
is this network configuration saved in both Neutron DB and OVN DB, or OVN DB 
only?
Also, when a client gets a network from Neutron API, is the configuration read 
from Neutron DB
or OVN DB?

Other than coding, is there any doc about how Neutron OVN ML2 driver works?

I have this script to create 256 routers and set each of them as gateway.
router()
{
local op=$1

for c in `seq 0 1 255`; do
echo "INFO: $op router-$c..."
openstack router $op router-$c
if [ "$op" == "create" ]; then
openstack router set \
--external-gateway public \
--fixed-ip ip-address=10.6.33.$c \
--disable-snat \
router-$c
fi
done
}
I see lots failures from Neutron log when get/show a router. It seems like 
that, when setting a router,
the router is not completely ready yet. Is it possible?

After running that script, I see some logical routers in ovn-nb-db don't have 
gw_port_id. And there
are some duplications. Here is an example. Each of them has unique UUID.

external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="1", 
"neutron:router_name"=router-255}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="1", 
"neutron:router_name"=router-232}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="0", 
"neutron:router_name"=router-158}
external_ids: {"neutron:gw_port_id"="", "neutron:revision_number"="0", 
"neutron:router_name"=router-158}
external_ids: 
{"neutron:gw_port_id"="e52dda53-c914-4ea7-840b-8632a5770680", 
"neutron:revision_number"="2", "neutron:router_name"=router-158}

I enabled nb-db debug logging and searched, eg. router-158, it only shows in a 
jsonrpc reply message
including 3 router-158, as the above.

Any clues?


Thanks!

Tony



___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN: resync nb-db to sb-db

2020-07-28 Thread Tony Liu
Hi,

When I run a script to create bunch of networks and routers from OpenStack, for 
whatever reason,
nb-db is fully updated, but sb-db is only partially updated. For example, there 
are 500 logical routers
in nb-db, but only 218 datapath bindings in sb-db. In this case, is there any 
way to resync nb-db to
sb-db? Any advices to avoid such failure?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Display OpenFlow port number, interface name and bridge name in single OvS CLI command

2020-07-28 Thread Tony Liu

for p in $(ovs-vsctl list-ports br-int); do \
ovs-vsctl -f table --columns=ofport,name list interface $p; \
done

Tony

From: discuss  on behalf of Matteo Olivi 

Sent: July 28, 2020 11:14 AM
To: ovs-discuss@openvswitch.org 
Subject: [ovs-discuss] Display OpenFlow port number, interface name and bridge 
name in single OvS CLI command

Hello everyone,
I have an OvS bridge X and some network interfaces connected to it via OpenFlow 
ports.
For each interface connected to X, I want to display the name and the number of 
its OpenFlow port.
I've been using the following command:
"ovs-vsctl -f table -- --columns=ofport,name list Interface"

The problem with the command above is that it lists the interface name and 
OpenFlow port
number for the interfaces connected to all the bridges on the host, while I 
only want the interfaces
connected to bridge X. Is there a single command to obtain what I need? Or, is 
there a single
database table where rows store the information that I need, i.e. the triplet 
(bridge name, interface name, OpenFlow port number) ?

Thanks,
Matteo.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN nb-db and sb-db election timer

2020-07-27 Thread Tony Liu
Hi,

During scaling test, when sb-db is busy, followers believe the leader is dead 
and started election
request. Some inconsistency happens during such leader switch. Two datapath 
bindings are created
for the same logical switch. To avoid such case, I was recommended to increase 
election timer x10.
4K networks are created successfully with that setting.

Is it necessary to set big election timer for nb-db as well? The nb-db doesn't 
seem very busy during
the test, sb-db is always busy and taking 90+% CPU.

With that big election timer, in case real problem happens, like the leader 
node goes down, is it going
to take a while for the new leader to be elected?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN scale

2020-07-27 Thread Tony Liu
Hi Han,

Just some updates here.

I tried with 4K networks on single router. Configuration was done without any 
issues. I checked both
nb-db and sb-db, they all look good. It's just that router configuration is 
huge (in Neutron DB, nb-db
and flow table in sb-db), because it contains all 4K ports. Also, the pipeline 
of router datapath in sb-db
is quite big.

I see ovn-northd master and sb-db leader are busy, taking 90+% CPU. There are 
only 3 compute nodes
and 2 gateway nodes. Does that monitor setting "ovn-monitor-all" matters in 
such case? Any idea what
they are busy with, without any configuration updates from OpenStack? The nb-db 
is not busy though.

Probably because nb-db is busy, ovn-controller can't connect to it 
consistently. It keeps being
disconnected and reconnecting. Restarting ovn-controller seems help. I am able 
to launch a few VMs
on different networks and they are connected via the router.

Now, I have problem on external access. The router is set as gateway to a 
provider/underlay network
on an interface on the gateway node. The router is allocated an underlay 
address from that provider
network. My understanding is that, the br-ex on gateway node holding the active 
router will broadcast
ARP to announce that router underlay address in case of failover. Also, it will 
respond ARP request for
that router underlay address. But when I run tcpdump on that underlay interface 
on gateway node,
I see ARP request coming in, but no ARP response going out. I checked the flow 
table in sb-db, it seems
ok. I also checked flow on br-ex by "ovs-ofctl dump-flows br-ex", I don't see 
anything about ARP there.
How should I look into it?

Again, the case is to support 4K networks with external access (security group 
is disabled),
4K routers (one for each network), 50 routers (one for 80 networks), 1 router 
(for all 4K networks)...
All networks are isolated by ACL on the logical router. Which option should 
work better?
Any comment is appreciated.


Thanks!

Tony



From: discuss  on behalf of Tony Liu 

Sent: July 21, 2020 09:09 PM
To: Daniel Alvarez 
Cc: ovs-discuss@openvswitch.org 
Subject: Re: [ovs-discuss] OVN scale

[root@ovn-db-2 ~]# ovn-nbctl list nb_global
_uuid   : b7b3aa05-f7ed-4dbc-979f-10445ac325b8
connections : []
external_ids: {"neutron:liveness_check_at"="2020-07-22 
04:03:17.726917+00:00"}
hv_cfg  : 312
ipsec   : false
name: ""
nb_cfg  : 2636
options : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
sb_cfg  : 2005
ssl : []

[root@ovn-db-2 ~]# ovn-sbctl list sb_global
_uuid   : 3720bc1d-b0da-47ce-85ca-96fa8d398489
connections : []
external_ids: {}
ipsec   : false
nb_cfg  : 312
options : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
ssl : []

The NBDB and SBDB is definitely out of sync. Is there any way to force 
ovn-northd sync them?

Thanks!

Tony


From: Tony Liu 
Sent: July 21, 2020 08:39 PM
To: Daniel Alvarez 
Cc: Cory Hawkless ; ovs-discuss@openvswitch.org 
; Dumitru Ceara 
Subject: Re: [ovs-discuss] OVN scale

When create a network (and subnet) on OpenStack, a GW port and service port 
(for DHCP and metadata)
are also created. They are created in Neutron and onv-nb-db by ML2 driver. Then 
ovn-northd will translate
such update from NBDB to SBDB. My question here is that, with 20.03, is this 
translation incremental?

After created 4000 networks successfully on OpenStack, I see 4000 logical 
switches and 8000 LS ports
in NBDB. But in SBDB, there are only 1567 port-bindings. The break happened 
when translating 1568th
port. If ovn-northd recompiles the whole DB for every update, this problem can 
be explained. The DB is
too big for ovn-northd to compile in time, so all the followed updates are 
lost. Does it make sense?

I recall DB update is coordinated by some "version", like some changes happened 
in NBDB, the version
bumps up, ovn-northd update SBDB and bumps up version as well, so they match. 
So, if NBDB version
bumps up more than once while ovn-northd updating SBDB, is that still going to 
work? If yes, then it's
just matter of time, no matter how fast update happening in NBDB, ovn-northd 
will catch them up
eventually. Am I right about that?

Any comment is welcome.


Thanks!

Tony



From: Tony Liu 
Sent: July 21, 2020 10:22 AM
To: Daniel Alvarez 
Cc: Cory Hawkless ; ovs-discuss@openvswitch.org 
; Dumitru Ceara 
Subject: Re: [ovs-discuss] OVN scale

Hi Daniel, all

4000 networks and 50 routers, 200 networks on each router, they are all created.
CPU usage of Neutron server, ovn-nb-db, ovn-northd, ovn-sb-db, ovn-controller 
and ovs-vswitchd is OK,
n

Re: [ovs-discuss] [ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

2020-07-23 Thread Tony Liu
Hi Han,

Thanks for taking the time to look into this. This problem is not consistently 
reproduced.
Developers normally ignore it:) I think we collected enough context and we can 
let it go for now.
I will rebuild setup, tune that RAFT heartbeat timer and rerun the test. Will 
keep you posted.


Thanks again!

Tony


From: Han Zhou 
Sent: July 23, 2020 06:53 PM
To: Tony Liu ; Ben Pfaff 
Cc: Numan Siddique ; ovs-dev ; 
ovs-discuss@openvswitch.org 
Subject: Re: [ovs-dev] OVN: Two datapath-bindings are created for the same 
logical-switch


On Thu, Jul 23, 2020 at 10:33 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Changed the title for this specific problem.
> I looked into logs and have more findings.
> The problem was happening when sb-db leader switched.

Hi Tony,

Thanks for this detailed information. Could you confirm which version of OVS is 
used (to understand OVSDB behavior).

>
> For ovsdb cluster, what may trigger the leader switch? Given the log,
> 2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout expired, 
> starting election
> The election is asked by a follower node. Is that because the connection from 
> follower to leader timeout,
> then follower assumes the leader is dead and starts an election?

You are right, the RAFT heart beat would timeout when server is too busy and 
the election timer is too small (default 1s). For large scale test, please 
increase the election timer by:
ovn-appctl -t  cluster/change-election-timer OVN_Southbound 

I suggest to set  to be at least bigger than 1 or more in your case. 
(you need to increase the value gradually - 2000, 4000, 8000, 16000 - so it 
will take you 4 commands to reach this from the initial default value 1000, not 
very convenient, I know :)

 here is the path to the socket ctl file of ovn-sb, usually under 
/var/run/ovn.

>
> For ovn-northd (3 instances), they all connect to the sb-db leader, whoever 
> has the locker is the master.
> When sb-db leader switches, all ovn-northd instances look for the new leader. 
> In this case, there is no
> guarantee that the old ovn-northd master remains the role, other ovn-northd 
> instance may find the
> leader and acquire the lock first. So, the sb-db leader switch may also cause 
> ovn-northd master switch.
> Such switch may happen in the middle of ovn-northd transaction, in that case, 
> is there any guarantee to
> the transaction completeness? My guess is that, the older created a 
> datapath-binding for a logical-switch,
> switch happened when this transaction is not completed, then the new 
> master/leader created another
> data-path binding for the same logical-switch. Does it make any sense?

I agree with you it could be related to the failover and the lock behavior 
during the failover. It could be a lock problem causing 2 northds became active 
at the same time for a short moment. However, I still can't imagine how the 
duplicated entries are created with different tunnel keys. If both northd 
create the datapath binding for the same LS at the same time, they should 
allocate the same tunnel key, and then one of them should fail during the 
transaction commit because of index conflict in DB. But here they have 
different keys so both were inserted in DB.

(OVSDB transaction is atomic even during failover and no client should see 
partial data of a transaction.)

(cc Ben to comment more on the possibility of both clients acquiring the lock 
during failover)

>
> From the log, when sb-db switched, ovn-northd master connected to the new 
> leader and lost the master,
> but immediately, it acquired the lock and become master again. Not sure how 
> this happened.

>From the ovn-northd logs, the ovn-northd on .86 firstly connected to SB DB on 
>.85, which suggests that it regarded .85 as the leader (otherwise it would 
>disconnect and retry another server), and then immediately after connecting 
>.85 and acquiring the lock, it disconnected because it somehow noticed that 
>.85 is not the leader, and then retried and connected to .86 (the new leader) 
>and found out that the lock is already acquired by .85 northd, so it switched 
>to standby. The .85 northd luckly connected to .86 in the first try so it was 
>able to acquire the lock on the leader node first. Maybe the key thing is to 
>figure out why the .86 northd initially connected to .85 DB which is not the 
>leader and acquired lock.

Thanks,
Han

>
> Here are some loggings.
>  .84 sb-db leader =
> 2020-07-21T01:08:20.221Z|01408|raft|INFO|current entry eid 
> 639238ba-bc00-4efe-bb66-6ac766bb5f4b does not match prerequisite 
> 78e8e167-8b4c-4292-8e25-d9975631b010 in execute_command_request
>
> 2020-07-21T01:08:38.450Z|01409|timeval|WARN|Unreasonably long 1435ms poll 
> interval (1135ms user, 43ms system

Re: [ovs-discuss] [ovs-dev] OVN Controller Incremental Processing

2020-07-23 Thread Tony Liu
Hi Han,

Now, I have 4000 records in logical-switch table in nb-db, only 1567 records in 
datapath-binding
table in sb-db. The translation was broken by a duplication (2 datapath 
bindings point to the same
logical-switch). Not sure how that happened. Anyways, I manually removed this 
duplication.
How can I trigger ovn-northd to finish the translation for all the rest logical 
switches?


Thanks!

Tony


From: Han Zhou 
Sent: July 23, 2020 04:19 PM
To: Tony Liu 
Cc: Han Zhou ; ovs-dev ; ovs-discuss 

Subject: Re: [ovs-dev] OVN Controller Incremental Processing



On Thu, Jul 23, 2020 at 4:12 PM 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Thanks Han for the quick confirmation!
> That says, when changes was made into nb-db, ovn-northd doesn't recompile the 
> whole db, instead, it only updates the increment into sb-db. I am currently 
> running some scaling test and seeing 100% CPU usage, hence asking.
>
Oh, no. The talk was about "OVN-controller", which is the component running on 
hypervisors, to translate SB data into OVS flows, and this has been implemented 
(although not all scenarios are incrementally processed). For ovn-northd, it 
runs on central node to convert data from NB to SB DB, and it is not 
incremental yet with incremental processing, so it is expected to see 100% CPU. 
There is currently a work ongoing for ovn-northd incremental processing, with 
DDlog, by Ben and Leonid.

> Tony
>
> On Jul 23, 2020 4:02 PM, Han Zhou mailto:hz...@ovn.org>> wrote:
>
>
>
> On Thu, Jul 23, 2020 at 11:17 AM Tony Liu 
> mailto:tonyliu0...@hotmail.com>> wrote:
> >
> > Hi,
> >
> > Is this implemented and released?
> > https://www.slideshare.net/hanzhou1978/ovn-controller-incremental-processing
> > Could anyone share an update on this?
> >
> >
> > Thanks!
> >
> > Tony
> >
>
> Yes, it was released initially in OVS/OVN 2.12 (if I remember correctly), and 
> there have been more improvements added gradually since then.
> (The "future" part which talks about DDlog is not implemented yet.)
>
> Thanks,
> Han
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: ovn-sbctl backoff

2020-07-23 Thread Tony Liu
Appreciate it! You connected dots for me.

Tony


From: Han Zhou 
Sent: July 23, 2020 05:10 PM
To: Tony Liu 
Cc: Han Zhou ; ovs-discuss 
Subject: Re: [ovs-discuss] OVN: ovn-sbctl backoff



On Thu, Jul 23, 2020 at 4:34 PM 
mailto:tonyliu0...@hotmail.com>> wrote:
>
> Good to know! I recall I read somewhere saying only the leader takes write 
> request. I will double check!
>
> Well, in that case, I have another question, why is such leader role 
> required? In a quorum based cluster, all nodes are equal. And why does 
> ovn-northd have to connect to the leader?
> Guess I will need to read more about RAFT:)
>
Quick answer: only leader node does the actual write, aso all writes are 
redirected to leader, but it can be initiated from followers. ovn-northd 
connects to leader for better performance because it heavily writes.
The manpage does provide more details, and yes RAFT paper has even more.

>
> Thanks!
>
> Tony
>
> On Jul 23, 2020 4:26 PM, Han Zhou mailto:hz...@ovn.org>> wrote:
>
>
>
> On Thu, Jul 23, 2020 at 4:07 PM 
> mailto:tonyliu0...@hotmail.com>> wrote:
> >
> > Thanks Han for the prompt responses!
> > That option is ok for reading. If I want to write, I have to connect to the 
> > leader, right? Then my question remains, how does ovn-sbctl find out how to 
> > connect to the leader?
> >
>
> RAFT doesn't require you to connect to leader for writing. You can connect to 
> any node and write.
> However, if for any reason you want to connect to the leader, you need to 
> specify the DB connection method as: ,,...,. For 
> example: 
> tcp:10.0.0.2:6641<http://10.0.0.2:6641>,tcp:10.0.0.3:6641<http://10.0.0.3:6641>,tcp:10.0.0.4:6641<http://10.0.0.4:6641>.
> You can read more details about OVSDB clustering in manpage ovsdb(7).
>
> Thanks,
> Han
>
> > Thanks again!
> >
> > Tony
> >
> > On Jul 23, 2020 3:57 PM, Han Zhou mailto:hz...@ovn.org>> 
> > wrote:
> >
> >
> >
> > On Thu, Jul 23, 2020 at 3:43 PM Tony Liu 
> > mailto:tonyliu0...@hotmail.com>> wrote:
> > >
> > > Hi,
> > >
> > > In case of ovsdb cluster, when I run ovn-sbctl, it connects to the unix 
> > > socket of local sb-db.
> > > If local sb-db is not the leader, ovn-sbctl tries another server to look 
> > > for the leader.
> > > How does ovn-sbctl connect to another server? By which connection?
> > > How does ovn-sbctl know the connection?
> > > Or the local sb-db asks to be the leader to respond ovn-sbctl?
> > > I can't figure out how it works from verbose messages.
> > > Any help to clarify is appreciated!
> > >
> > >
> > > Thanks!
> > >
> > > Tony
> >
> > If you don't intentionally try to connect to the leader, you can use 
> > ovn-sbctl --no-leader-only ... to avoid the retry.
> > If you want to avoid typing this option every time, you can export 
> > OVN_SBCTL_OPTIONS"="--no-leader-only", and then just run ovn-sbctl ...
> > (does this answer your question?)
> >
> > Thanks,
> > Han
> >
> >
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN: ovn-sbctl backoff

2020-07-23 Thread Tony Liu
Hi,

In case of ovsdb cluster, when I run ovn-sbctl, it connects to the unix socket 
of local sb-db.
If local sb-db is not the leader, ovn-sbctl tries another server to look for 
the leader.
How does ovn-sbctl connect to another server? By which connection?
How does ovn-sbctl know the connection?
Or the local sb-db asks to be the leader to respond ovn-sbctl?
I can't figure out how it works from verbose messages.
Any help to clarify is appreciated!


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN nb-db and sb-db out of sync

2020-07-23 Thread Tony Liu
Hi Numan,

This is how sb-db is brought up.
```
/usr/share/ovn/scripts/ovn-ctl run_sb_ovsdb --db-sb-create-insecure-remote=yes 
--db-sb-addr=10.6.20.84 --db-sb-cluster-local-addr=10.6.20.84  
--db-sock=/run/ovn/ovnsb_db.sock --db-sb-pid=/run/ovn/ovnsb_db.pid 
--db-sb-file=/var/lib/openvswitch/ovn-sb/ovnsb.db 
--ovn-sb-logfile=/var/log/kolla/openvswitch/ovn-sb-db.log
```
The script you pointed to me starts both nb-db and sb-db without 
"run_sb_ovsdb". But I don't think
that really matters. In this case, I assume "ovnsb.db" will be initialized 
properly?

I checked code, that "stale data" is caused by some index mismatch. Any clues?


Thanks!

Tony


From: Numan Siddique 
Sent: July 23, 2020 11:54 AM
To: Tony Liu 
Cc: ovs-dev ; ovs-discuss@openvswitch.org 

Subject: Re: [ovs-discuss] OVN nb-db and sb-db out of sync



On Thu, Jul 23, 2020 at 11:35 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi Numan,

I did each of the followings on all 3 OVN DB nodes.
```
docker stop ovn_sb_db
mv /var/lib/docker/volumes/ovn_sb_db/_data/ovnsb.db 
/var/lib/docker/volumes/ovn_sb_db/_data/ovnsb.db.bak
docker start ovn_sb_db
docker restart ovn_northd
```

I see new DB file is created, but I got complaints from ovn-northd.
```
2020-07-22T23:37:27.274Z|80540|ovsdb_idl|WARN|tcp:10.6.20.84:6642<http://10.6.20.84:6642>:
 clustered database server has stale data; trying another server
```

Should I use ovsdb-tool to initialize the DB, instead of relying on ovn-sb-db, 
or something else I am missing?

I would suggest to use ovn-ctl for initializing/starting the cluster.

Please take a look at this as an example - 
https://github.com/ovn-org/ovn-fake-multinode/blob/master/ovn_cluster.sh#L337

Thanks
Numan


I also tried to use "ovn-sbctl destroy" to remove the record, but onv-sbctl is 
stuck there forever.


Thanks!

Tony


From: Numan Siddique mailto:num...@ovn.org>>
Sent: July 23, 2020 03:15 AM
To: Tony Liu mailto:tonyliu0...@hotmail.com>>
Cc: ovs-dev mailto:ovs-...@openvswitch.org>>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org> 
mailto:ovs-discuss@openvswitch.org>>
Subject: Re: [ovs-discuss] OVN nb-db and sb-db out of sync



On Thu, Jul 23, 2020 at 8:22 AM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi,

I see why sb-db broke at 1568th port-binding.
The 1568th datapath-binding in sb-db references the same

_uuid   : 108cf745-db82-43c0-a9d3-afe27a41e4aa
external_ids: {logical-switch="8a5d1d3c-e9fc-4cbe-a461-98ff838e6473", 
name=neutron-e907dc17-f1e8-4217-a37d-86e9a98c86c2, name2=net-97-192}
tunnel_key  : 1567

_uuid   : d934ed92-2f3c-4b31-8a76-2a5047a3bb46
external_ids: {logical-switch="8a5d1d3c-e9fc-4cbe-a461-98ff838e6473", 
name=neutron-e907dc17-f1e8-4217-a37d-86e9a98c86c2, name2=net-97-192}
tunnel_key  : 1568

I don't believe this is supposed to happen. Any idea how could it happen?
Then ovn-northd is stuck in trying to delete this duplication, and it ignores 
all the following updates.
That caused out-of-sync between nb-db and sb-db.
Any way I can fix it manually, like with ovn-sbctl to delete it?

If you delete the ovn sb db resources manually, ovn-northd should sync it up.

But I'm surprised why ovn-northd didn't sync earlier. There's something wrong 
related to raft going
on here. Not sure what.

Thanks
Numan




Thanks!

Tony


From: dev 
mailto:ovs-dev-boun...@openvswitch.org>> on 
behalf of Tony Liu mailto:tonyliu0...@hotmail.com>>
Sent: July 22, 2020 11:33 AM
To: ovs-dev mailto:ovs-...@openvswitch.org>>
Subject: [ovs-dev] OVN nb-db and sb-db out of sync

Hi,

During a scaling test where 4000 networks are created from OpenStack, I see that
nb-db and sb-db are out of sync. All 4000 logical switches and 8000 LS ports
(GW port and service port of each network) are created in nb-db. In sb-db,
only 1567 port-bindings, 4000 is expected.

[root@ovn-db-2 ~]# ovn-nbctl list nb_global
_uuid   : b7b3aa05-f7ed-4dbc-979f-10445ac325b8
connections : []
external_ids: {"neutron:liveness_check_at"="2020-07-22 
04:03:17.726917+00:00"}
hv_cfg  : 312
ipsec   : false
name: ""
nb_cfg  : 2636
options : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
sb_cfg  : 2005
ssl : []

[root@ovn-db-2 ~]# ovn-sbctl list sb_global
_uuid   : 3720bc1d-b0da-47ce-85ca-96fa8d398489
connections : []
external_ids: {}
ipsec   : false
nb_cfg  : 312
options : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
ssl : []

Is there any way to force ovn-northd to rebuild sb-db t

  1   2   >