Re: [ovs-discuss] active_backup failover issue
> > On Tue, 27 Apr 2021 at 23:08, Numan Siddique wrote: > > > Having 3 chassis will not result in this split brain scenario which you > > > have > > > probably observed. I dig a little deeper. I guess what I experience is an issue when only having 2 chassis hosting gateways. ha_chassis_group_is_active reads if (sset_is_empty(active_tunnels)) { /* If active tunnel sset is empty, it means it has lost * connectivity with other chassis. */ return false; } I think the code tries to prevent a split brain scenario here: if there is no tunnel working it necessarily means the current chassis is broken (although there should be a tunnel working towards a compute). When there are 2 chassis in the group, when the first chassis goes down, the only tunnel is down, and the port is never claimed. I can solve that by having 3 chassis in the group or returning true (when a_ch_grp->n_ha_chassis == 2) above. I don't think practically anyone would run with only 2 chassis acting as gateway though! Thanks Francois ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] active_backup failover issue
On Tue, Apr 27, 2021 at 6:00 PM Francois wrote: > > On Tue, 27 Apr 2021 at 23:08, Numan Siddique wrote: > > > > On Tue, Apr 27, 2021 at 4:58 PM Francois wrote: > > > > > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique wrote: > > > > > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois > > > > wrote: > > > > > > > > > > > > The ovn-controller running on chassis-1 will not detect the BFD > > > > failover. > > > > > > Thanks for your answer! Ok for chassis-1. > > > > > > What I don't understand is why chassis-2, who is aware that chassis-1 > > > is down, is not able to act as a gateway for its own ports. > > > > I see what's going on. So ovn-controller on chassis-2 detects the failover > > and claims the cr-. But ovn-controller on chassis-1 which has > > higher priority claims it back because according to it, BFD is fine. > > > > You can probably monitor the ovn-controller logs on both chassis, and you > > might notice claim/release logs. > > > > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates > > to the cr-. > > > > Having 3 chassis will not result in this split brain scenario which you have > > probably observed. > > I am going to do a bit more research and see what happens on some > real OpenStack installation, maybe I messed up somewhere. > > There is nothing logged in the ovn-controller, and nothing flooding > the DB (+one line saying port_binding is down). My understanding was > that the move of gateway (as it happens for chassis-3) happens > without the involvement of the control plane, in other words in case > the first gateway fails, the flows to move to the second gateway are > already installed and can be used straight away. > > I am puzzled because if I trace the packet from chassis-2 before and > after chassis-1 dies, it always end up in flow > > 37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f > set_field:0x4/0xff->tun_id > set_field:0x3->tun_metadata0 > move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30] > -> NXM_NX_TUN_METADATA0[16..30] is now 0x1 > bundle(eth_src,0,active_backup,ofport,members:7) > > Only difference is, when chassis-1 is up, the added > -> output to kernel tunnel > > It seems that there is no backup flow for packets not going through a > tunnel, straight to external. I think it is expected, because ovn-controller of chassis-1 has claimed the gateway port (i.e cr- > Before tackling the tricky cases, I would like to make it work when > it fails "as documented" :), just one chassis dying but traffic being > quickly dispatched somewhere else. > > Thanks > > On Tue, 27 Apr 2021 at 23:08, Numan Siddique wrote: > > > > On Tue, Apr 27, 2021 at 4:58 PM Francois wrote: > > > > > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique wrote: > > > > > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois > > > > wrote: > > > > > > > > > > > > The ovn-controller running on chassis-1 will not detect the BFD > > > > failover. > > > > > > Thanks for your answer! Ok for chassis-1. > > > > > > What I don't understand is why chassis-2, who is aware that chassis-1 > > > is down, is not able to act as a gateway for its own ports. > > > > I see what's going on. So ovn-controller on chassis-2 detects the failover > > and claims the cr-. But ovn-controller on chassis-1 which has > > higher priority claims it back because according to it, BFD is fine. > > > > You can probably monitor the ovn-controller logs on both chassis, and you > > might notice claim/release logs. > > > > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates > > to the cr-. > > > > Having 3 chassis will not result in this split brain scenario which you have > > probably observed. > > > > Thanks > > Numan > > > > > > > > > > Francois > > > ___ > > > discuss mailing list > > > disc...@openvswitch.org > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] active_backup failover issue
On Tue, 27 Apr 2021 at 23:08, Numan Siddique wrote: > > On Tue, Apr 27, 2021 at 4:58 PM Francois wrote: > > > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique wrote: > > > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois > > > wrote: > > > > > > > > > The ovn-controller running on chassis-1 will not detect the BFD failover. > > > > Thanks for your answer! Ok for chassis-1. > > > > What I don't understand is why chassis-2, who is aware that chassis-1 > > is down, is not able to act as a gateway for its own ports. > > I see what's going on. So ovn-controller on chassis-2 detects the failover > and claims the cr-. But ovn-controller on chassis-1 which has > higher priority claims it back because according to it, BFD is fine. > > You can probably monitor the ovn-controller logs on both chassis, and you > might notice claim/release logs. > > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates > to the cr-. > > Having 3 chassis will not result in this split brain scenario which you have > probably observed. I am going to do a bit more research and see what happens on some real OpenStack installation, maybe I messed up somewhere. There is nothing logged in the ovn-controller, and nothing flooding the DB (+one line saying port_binding is down). My understanding was that the move of gateway (as it happens for chassis-3) happens without the involvement of the control plane, in other words in case the first gateway fails, the flows to move to the second gateway are already installed and can be used straight away. I am puzzled because if I trace the packet from chassis-2 before and after chassis-1 dies, it always end up in flow 37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f set_field:0x4/0xff->tun_id set_field:0x3->tun_metadata0 move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30] -> NXM_NX_TUN_METADATA0[16..30] is now 0x1 bundle(eth_src,0,active_backup,ofport,members:7) Only difference is, when chassis-1 is up, the added -> output to kernel tunnel It seems that there is no backup flow for packets not going through a tunnel, straight to external. Before tackling the tricky cases, I would like to make it work when it fails "as documented" :), just one chassis dying but traffic being quickly dispatched somewhere else. Thanks On Tue, 27 Apr 2021 at 23:08, Numan Siddique wrote: > > On Tue, Apr 27, 2021 at 4:58 PM Francois wrote: > > > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique wrote: > > > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois > > > wrote: > > > > > > > > > The ovn-controller running on chassis-1 will not detect the BFD failover. > > > > Thanks for your answer! Ok for chassis-1. > > > > What I don't understand is why chassis-2, who is aware that chassis-1 > > is down, is not able to act as a gateway for its own ports. > > I see what's going on. So ovn-controller on chassis-2 detects the failover > and claims the cr-. But ovn-controller on chassis-1 which has > higher priority claims it back because according to it, BFD is fine. > > You can probably monitor the ovn-controller logs on both chassis, and you > might notice claim/release logs. > > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates > to the cr-. > > Having 3 chassis will not result in this split brain scenario which you have > probably observed. > > Thanks > Numan > > > > > > Francois > > ___ > > discuss mailing list > > disc...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] active_backup failover issue
On Tue, Apr 27, 2021 at 4:58 PM Francois wrote: > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique wrote: > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois wrote: > > > > > > The ovn-controller running on chassis-1 will not detect the BFD failover. > > Thanks for your answer! Ok for chassis-1. > > What I don't understand is why chassis-2, who is aware that chassis-1 > is down, is not able to act as a gateway for its own ports. I see what's going on. So ovn-controller on chassis-2 detects the failover and claims the cr-. But ovn-controller on chassis-1 which has higher priority claims it back because according to it, BFD is fine. You can probably monitor the ovn-controller logs on both chassis, and you might notice claim/release logs. Or you can do "tail -f ovnsb_db.db" and see that there are constant updates to the cr-. Having 3 chassis will not result in this split brain scenario which you have probably observed. Thanks Numan > > Francois > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] active_backup failover issue
On Tue, 27 Apr 2021 at 22:20, Numan Siddique wrote: > > On Tue, Apr 27, 2021 at 9:11 AM Francois wrote: > > > The ovn-controller running on chassis-1 will not detect the BFD failover. Thanks for your answer! Ok for chassis-1. What I don't understand is why chassis-2, who is aware that chassis-1 is down, is not able to act as a gateway for its own ports. Francois ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] active_backup failover issue
On Tue, Apr 27, 2021 at 9:11 AM Francois wrote: > > Hello OpenvSwitch! > I have 2 chassis with external connectivity, chassis-1 hosts port-1 > and chassis-2 hosts port-2. SNAT is done through a gateway hosted on > chassis-1, and both chassis exchange BFD. There is no floating IP. > > I see chassis-1 does not have any flow for tunnelling, which is logic > since it hosts the gateway. Traffic goes straight to the external port > of the chassis, which is fine. > I see however, chassis-2 having an extra flow: > > cookie=0x7a15360f, duration=4116.970s, table=37, n_packets=1471, > n_bytes=144158, priority=100,reg15=0x3,metadata=0x4 > actions=load:0x4->NXM_NX_TUN_ID[0..23],set_field:0x3->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],bundle(eth_src,0,active_backup,ofport,members:"ovn-chassi-0") > > In my case I have only 2 chassis, the bundle only contains a single member. > > I am now killing the ovs process from chassis-1. Chassis-2 properly > detects that chassis-1 is dead, however packets going out are still > using this flow, and are not sent outside. > > If I add a third chassis chassis-3, I see it monitors properly > chassis-1 and chassis-2, and the bundle members contain both chassis. > This case is fine and chassis-2 does the SNAT for chassis-3. > > I am wondering if there is something wrong with my set-up. I would > expect that when chassis-1 dies and the gateway fails over to > chassis-2, traffic from port-2 actually goes out from chassis-2. It > should not be dropped (or be sent to the next chassis in the list, > although I did not try this). Any help would be very appreciated! > > (this should be the master branch of ovn). ovn-controller comes to know about the BFD failures when ovs-vswitchd detects it and updates the OVS interface BFD information in the local ovs conf.db. In your case since you killed the ovs process, the BFD status is not updated in the local ovs conf.db. The ovn-controller running on chassis-1 will not detect the BFD failover. The other issue is since ovs-vswitchd is down, ovn-controller will lose connectivity to ovs-vswitchd. Also since ovs-vswitchd is down, the traffic originating from the VMs in that chassis will go through fine if there are datapath flows. Any new traffic will be anyway dropped since there is no ovs-vswitchd to handle the upcall. In my opinion, the correct way to test is to disconnect chassis-1 from your physical network rather than killing ovs-vswitchd. Thanks Numan > Thanks > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss