Re: [ovs-discuss] active_backup failover issue

2021-04-29 Thread Francois
> > On Tue, 27 Apr 2021 at 23:08, Numan Siddique  wrote:

> > > Having 3 chassis will not result in this split brain scenario which you 
> > > have
> > > probably observed.

I dig a little deeper. I guess what I experience is an issue when only having 2
chassis hosting gateways.

ha_chassis_group_is_active reads

if (sset_is_empty(active_tunnels)) {
/* If active tunnel sset is empty, it means it has lost
 * connectivity with other chassis. */
return false;
}

I think the code tries to prevent a split brain scenario here: if
there is no tunnel working it necessarily means the current chassis is broken
(although there should be a tunnel working towards a compute).

When there are 2 chassis in the group, when the first chassis goes down, the
only tunnel is down, and the port is never claimed. I can solve that by having
3 chassis in the group or returning true  (when a_ch_grp->n_ha_chassis == 2)
above.

I don't think practically anyone would run with only 2 chassis acting as gateway
though!

Thanks
Francois
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] active_backup failover issue

2021-04-27 Thread Numan Siddique
On Tue, Apr 27, 2021 at 6:00 PM Francois  wrote:
>
> On Tue, 27 Apr 2021 at 23:08, Numan Siddique  wrote:
> >
> > On Tue, Apr 27, 2021 at 4:58 PM Francois  wrote:
> > >
> > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique  wrote:
> > > >
> > > > On Tue, Apr 27, 2021 at 9:11 AM Francois  
> > > > wrote:
> > > > >
> > >
> > > > The ovn-controller running on chassis-1 will not detect the BFD 
> > > > failover.
> > >
> > > Thanks for your answer! Ok for chassis-1.
> > >
> > > What I don't understand is why chassis-2, who is aware that chassis-1
> > > is down, is not able to act as a gateway for its own ports.
> >
> > I see what's going on.  So ovn-controller on chassis-2 detects the failover
> > and claims the cr-. But ovn-controller on chassis-1 which has
> > higher priority claims it back because according to it, BFD is fine.
> >
> > You can probably monitor the ovn-controller logs on both chassis, and you
> > might notice claim/release logs.
> >
> > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> > to the cr-.
> >
> > Having 3 chassis will not result in this split brain scenario which you have
> > probably observed.
>
> I am going to do a bit more research and see what happens on some
> real OpenStack installation, maybe I messed up somewhere.
>
> There is nothing logged in the ovn-controller, and nothing flooding
> the DB (+one line saying port_binding is down). My understanding was
> that the move of gateway (as it happens for chassis-3) happens
> without the involvement of the control plane, in other words in case
> the first gateway fails, the flows to move to the second gateway are
> already installed and can be used straight away.
>
> I am puzzled because if I trace the packet from chassis-2 before and
> after chassis-1 dies, it always end up in flow
>
> 37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f
> set_field:0x4/0xff->tun_id
> set_field:0x3->tun_metadata0
> move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
>  -> NXM_NX_TUN_METADATA0[16..30] is now 0x1
> bundle(eth_src,0,active_backup,ofport,members:7)
>
> Only difference is, when chassis-1 is up, the added
>  -> output to kernel tunnel
>
> It seems that there is no backup flow for packets not going through a
> tunnel, straight to external.

I think it is expected, because ovn-controller of chassis-1 has claimed
the gateway port (i.e cr-
> Before tackling the tricky cases, I would like to make it work when
> it fails "as documented" :), just one chassis dying but traffic being
> quickly dispatched somewhere else.
>
> Thanks
>
> On Tue, 27 Apr 2021 at 23:08, Numan Siddique  wrote:
> >
> > On Tue, Apr 27, 2021 at 4:58 PM Francois  wrote:
> > >
> > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique  wrote:
> > > >
> > > > On Tue, Apr 27, 2021 at 9:11 AM Francois  
> > > > wrote:
> > > > >
> > >
> > > > The ovn-controller running on chassis-1 will not detect the BFD 
> > > > failover.
> > >
> > > Thanks for your answer! Ok for chassis-1.
> > >
> > > What I don't understand is why chassis-2, who is aware that chassis-1
> > > is down, is not able to act as a gateway for its own ports.
> >
> > I see what's going on.  So ovn-controller on chassis-2 detects the failover
> > and claims the cr-. But ovn-controller on chassis-1 which has
> > higher priority claims it back because according to it, BFD is fine.
> >
> > You can probably monitor the ovn-controller logs on both chassis, and you
> > might notice claim/release logs.
> >
> > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> > to the cr-.
> >
> > Having 3 chassis will not result in this split brain scenario which you have
> > probably observed.
> >
> > Thanks
> > Numan
> >
> >
> > >
> > > Francois
> > > ___
> > > discuss mailing list
> > > disc...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] active_backup failover issue

2021-04-27 Thread Francois
On Tue, 27 Apr 2021 at 23:08, Numan Siddique  wrote:
>
> On Tue, Apr 27, 2021 at 4:58 PM Francois  wrote:
> >
> > On Tue, 27 Apr 2021 at 22:20, Numan Siddique  wrote:
> > >
> > > On Tue, Apr 27, 2021 at 9:11 AM Francois  
> > > wrote:
> > > >
> >
> > > The ovn-controller running on chassis-1 will not detect the BFD failover.
> >
> > Thanks for your answer! Ok for chassis-1.
> >
> > What I don't understand is why chassis-2, who is aware that chassis-1
> > is down, is not able to act as a gateway for its own ports.
>
> I see what's going on.  So ovn-controller on chassis-2 detects the failover
> and claims the cr-. But ovn-controller on chassis-1 which has
> higher priority claims it back because according to it, BFD is fine.
>
> You can probably monitor the ovn-controller logs on both chassis, and you
> might notice claim/release logs.
>
> Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> to the cr-.
>
> Having 3 chassis will not result in this split brain scenario which you have
> probably observed.

I am going to do a bit more research and see what happens on some
real OpenStack installation, maybe I messed up somewhere.

There is nothing logged in the ovn-controller, and nothing flooding
the DB (+one line saying port_binding is down). My understanding was
that the move of gateway (as it happens for chassis-3) happens
without the involvement of the control plane, in other words in case
the first gateway fails, the flows to move to the second gateway are
already installed and can be used straight away.

I am puzzled because if I trace the packet from chassis-2 before and
after chassis-1 dies, it always end up in flow

37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f
set_field:0x4/0xff->tun_id
set_field:0x3->tun_metadata0
move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
 -> NXM_NX_TUN_METADATA0[16..30] is now 0x1
bundle(eth_src,0,active_backup,ofport,members:7)

Only difference is, when chassis-1 is up, the added
 -> output to kernel tunnel

It seems that there is no backup flow for packets not going through a
tunnel, straight to external.

Before tackling the tricky cases, I would like to make it work when
it fails "as documented" :), just one chassis dying but traffic being
quickly dispatched somewhere else.

Thanks

On Tue, 27 Apr 2021 at 23:08, Numan Siddique  wrote:
>
> On Tue, Apr 27, 2021 at 4:58 PM Francois  wrote:
> >
> > On Tue, 27 Apr 2021 at 22:20, Numan Siddique  wrote:
> > >
> > > On Tue, Apr 27, 2021 at 9:11 AM Francois  
> > > wrote:
> > > >
> >
> > > The ovn-controller running on chassis-1 will not detect the BFD failover.
> >
> > Thanks for your answer! Ok for chassis-1.
> >
> > What I don't understand is why chassis-2, who is aware that chassis-1
> > is down, is not able to act as a gateway for its own ports.
>
> I see what's going on.  So ovn-controller on chassis-2 detects the failover
> and claims the cr-. But ovn-controller on chassis-1 which has
> higher priority claims it back because according to it, BFD is fine.
>
> You can probably monitor the ovn-controller logs on both chassis, and you
> might notice claim/release logs.
>
> Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> to the cr-.
>
> Having 3 chassis will not result in this split brain scenario which you have
> probably observed.
>
> Thanks
> Numan
>
>
> >
> > Francois
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] active_backup failover issue

2021-04-27 Thread Numan Siddique
On Tue, Apr 27, 2021 at 4:58 PM Francois  wrote:
>
> On Tue, 27 Apr 2021 at 22:20, Numan Siddique  wrote:
> >
> > On Tue, Apr 27, 2021 at 9:11 AM Francois  wrote:
> > >
>
> > The ovn-controller running on chassis-1 will not detect the BFD failover.
>
> Thanks for your answer! Ok for chassis-1.
>
> What I don't understand is why chassis-2, who is aware that chassis-1
> is down, is not able to act as a gateway for its own ports.

I see what's going on.  So ovn-controller on chassis-2 detects the failover
and claims the cr-. But ovn-controller on chassis-1 which has
higher priority claims it back because according to it, BFD is fine.

You can probably monitor the ovn-controller logs on both chassis, and you
might notice claim/release logs.

Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
to the cr-.

Having 3 chassis will not result in this split brain scenario which you have
probably observed.

Thanks
Numan


>
> Francois
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] active_backup failover issue

2021-04-27 Thread Francois
On Tue, 27 Apr 2021 at 22:20, Numan Siddique  wrote:
>
> On Tue, Apr 27, 2021 at 9:11 AM Francois  wrote:
> >

> The ovn-controller running on chassis-1 will not detect the BFD failover.

Thanks for your answer! Ok for chassis-1.

What I don't understand is why chassis-2, who is aware that chassis-1
is down, is not able to act as a gateway for its own ports.

Francois
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] active_backup failover issue

2021-04-27 Thread Numan Siddique
On Tue, Apr 27, 2021 at 9:11 AM Francois  wrote:
>
> Hello OpenvSwitch!
> I have 2 chassis with external connectivity, chassis-1 hosts port-1
> and chassis-2 hosts port-2. SNAT is done through a gateway hosted on
> chassis-1, and both chassis exchange BFD. There is no floating IP.
>
> I see chassis-1 does not have any flow for tunnelling, which is logic
> since it hosts the gateway. Traffic goes straight to the external port
> of the chassis, which is fine.
> I see however, chassis-2 having an extra flow:
>
>  cookie=0x7a15360f, duration=4116.970s, table=37, n_packets=1471,
> n_bytes=144158, priority=100,reg15=0x3,metadata=0x4
> actions=load:0x4->NXM_NX_TUN_ID[0..23],set_field:0x3->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],bundle(eth_src,0,active_backup,ofport,members:"ovn-chassi-0")
>
> In my case I have only 2 chassis,  the bundle only contains a single member.
>
> I am now killing the ovs process from chassis-1. Chassis-2 properly
> detects that chassis-1 is dead, however packets going out are still
> using this flow, and are not sent outside.
>
> If I add a third chassis chassis-3, I see it monitors properly
> chassis-1 and chassis-2, and the bundle members contain both chassis.
> This case is fine and chassis-2 does the SNAT for chassis-3.
>
> I am wondering if there is something wrong with my set-up. I would
> expect that when chassis-1 dies and the gateway fails over to
> chassis-2, traffic from port-2 actually goes out from chassis-2. It
> should not be dropped (or be sent to the next chassis in the list,
> although I did not try this). Any help would be very appreciated!
>
> (this should be the master branch of ovn).

ovn-controller comes to know about the BFD failures when ovs-vswitchd detects it
and updates the OVS interface BFD information in the local ovs
conf.db.  In your case
since you killed the ovs process, the BFD status is not updated in the
local ovs conf.db.
The ovn-controller running on chassis-1 will not detect the BFD failover.

The other issue is since ovs-vswitchd is down, ovn-controller will
lose connectivity
to ovs-vswitchd.  Also since ovs-vswitchd is down, the traffic
originating from the VMs
in that chassis will go through fine if there are datapath flows.  Any
new traffic will
be anyway dropped since there is no ovs-vswitchd to handle the upcall.

In my opinion, the correct way to test is to disconnect chassis-1 from
your physical network
rather than killing ovs-vswitchd.

Thanks
Numan

> Thanks
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss