Re: [ovs-discuss] OVN at scale in production

2021-10-15 Thread Han Zhou
On Fri, Oct 15, 2021 at 3:53 AM Seena Fallah  wrote:

> In the case of having many projects each project has at least 2 security
> groups and each security group has 5 ACLs this ACL number should be not
> very high I think.
>

Ok, assume each project has 2 x 5 = 10 ACLs, 100k ACLs means you have 20k
projects. That sounds not a small number. If each project has its own LRs
and LSes, and 10 ~ 100 workloads, it sounds like something really big. Or
if they just share the LRs and LSes, and each project has only a few
workloads, then it may be ok. Still, regardless of the scale, I am
surprised that you hit scale problems in NB but not in SB.


> In ovs scenario, I have 250K ACLs and everything works fine!
>

What do you mean by ACLs in ovs? 250k ACLs meaning 250k OVS flows? An OVN
ACL can easily be translated into thousands of OVS flows, if a big address
set is referenced by the OVN ACL. So I am not sure what does these numbers
mean exactly in your deployment.

Do you think OVN is not ready for this number of ACLs?
> I'm switching from ovs to ovn.
>
> On Fri, Oct 15, 2021 at 4:41 AM Han Zhou  wrote:
>
>>
>>
>> On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah 
>> wrote:
>>
>>> It's mostly on nb.
>>>
>> I am surprised since we usually don't see any scale problem for the NB DB
>> servers, because usually SB data size is much bigger and also number of
>> clients are much bigger than NB DB. So if there are scale problems it would
>> always happen on SB already before NB hits any limit.
>> You would see NB scale problem but not on SB probably because ovn-northd
>> couldn't even translate the NB data to SB yet because of the NB problem you
>> hit. I'd suggest to start with smaller scale, and make sure it works end to
>> end, and then enlarge it gradually, then you would see the real limit.
>> Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so
>> big but each ACL could reference big address-sets and port-groups. You
>> could probably give more details about your topology and what your typical
>> ACLs look like.
>>
>>
>>> Yes, I set that value before to 6 but it didn't help!
>>>
>>> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou  wrote:
>>>


 On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah 
 wrote:
 >
 > Also I get many logs like this in ovn:
 >
 > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
 last 8 seconds (most recently, 3 seconds ago) due to excessive rate
 > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454:
 receive error: Connection reset by peer
 > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
 connection dropped (Connection reset by peer)
 > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
 connection dropped (Connection reset by peer)
 > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
 connection dropped (Connection reset by peer)
 > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
 connection dropped (Connection reset by peer)
 > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
 connection dropped (Connection reset by peer)
 > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
 connection dropped (Connection reset by peer)
 >
 > What does it mean about excessive rate? How many req/s is going to be
 an excessive rate?

 Don't worry about "excessive rate", which is talking about the log rate
 limit itself.
 The "connection reset by peer" indicates client side inactivity probe
 is enabled and it disconnects when the server hasn't responded for a while.
 What server is this? NB or SB? Usually SB DB would have this problem if
 there are lots of nodes and if the inactivity probe is not adjusted on the
 nodes (ovn-controllers). Try: ovs-vsctl set open .
 external_ids:ovn-remote-probe-interval=10 on each node.

 >
 > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
 wrote:
 >>
 >> Seems the most leader failure is for NB and the command you said is
 for SB.
 >>
 >> Do you have any benchmarks of how many ACLs can OVN perform normally?
 >> I see many failures after 100k ACLs.
 >>
 >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique 
 wrote:
 >>>
 >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
 wrote:
 >>> >
 >>> > I'm using these versions on a centos container:
 >>> > ovsdb-server (Open vSwitch) 2.15.2
 >>> > ovn-nbctl 21.06.0
 >>> > Open vSwitch Library 2.15.90
 >>> > DB Schema 5.32.0
 >>> >
 >>> > Today I see the election timed out too and I should increase
 ovsdb election timeout too. I saw the commits but I didn't find any related
 change to my problem.
 >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
 increase election timeout and disable the inactivity probe?
 >>>
 >>> Not sure on that

Re: [ovs-discuss] OVN at scale in production

2021-10-15 Thread Seena Fallah
In the case of having many projects each project has at least 2 security
groups and each security group has 5 ACLs this ACL number should be not
very high I think.
In ovs scenario, I have 250K ACLs and everything works fine!
Do you think OVN is not ready for this number of ACLs?
I'm switching from ovs to ovn.

On Fri, Oct 15, 2021 at 4:41 AM Han Zhou  wrote:

>
>
> On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah 
> wrote:
>
>> It's mostly on nb.
>>
> I am surprised since we usually don't see any scale problem for the NB DB
> servers, because usually SB data size is much bigger and also number of
> clients are much bigger than NB DB. So if there are scale problems it would
> always happen on SB already before NB hits any limit.
> You would see NB scale problem but not on SB probably because ovn-northd
> couldn't even translate the NB data to SB yet because of the NB problem you
> hit. I'd suggest to start with smaller scale, and make sure it works end to
> end, and then enlarge it gradually, then you would see the real limit.
> Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so
> big but each ACL could reference big address-sets and port-groups. You
> could probably give more details about your topology and what your typical
> ACLs look like.
>
>
>> Yes, I set that value before to 6 but it didn't help!
>>
>> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou  wrote:
>>
>>>
>>>
>>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah 
>>> wrote:
>>> >
>>> > Also I get many logs like this in ovn:
>>> >
>>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
>>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate
>>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454:
>>> receive error: Connection reset by peer
>>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
>>> connection dropped (Connection reset by peer)
>>> >
>>> > What does it mean about excessive rate? How many req/s is going to be
>>> an excessive rate?
>>>
>>> Don't worry about "excessive rate", which is talking about the log rate
>>> limit itself.
>>> The "connection reset by peer" indicates client side inactivity probe is
>>> enabled and it disconnects when the server hasn't responded for a while.
>>> What server is this? NB or SB? Usually SB DB would have this problem if
>>> there are lots of nodes and if the inactivity probe is not adjusted on the
>>> nodes (ovn-controllers). Try: ovs-vsctl set open .
>>> external_ids:ovn-remote-probe-interval=10 on each node.
>>>
>>> >
>>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
>>> wrote:
>>> >>
>>> >> Seems the most leader failure is for NB and the command you said is
>>> for SB.
>>> >>
>>> >> Do you have any benchmarks of how many ACLs can OVN perform normally?
>>> >> I see many failures after 100k ACLs.
>>> >>
>>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique 
>>> wrote:
>>> >>>
>>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
>>> wrote:
>>> >>> >
>>> >>> > I'm using these versions on a centos container:
>>> >>> > ovsdb-server (Open vSwitch) 2.15.2
>>> >>> > ovn-nbctl 21.06.0
>>> >>> > Open vSwitch Library 2.15.90
>>> >>> > DB Schema 5.32.0
>>> >>> >
>>> >>> > Today I see the election timed out too and I should increase ovsdb
>>> election timeout too. I saw the commits but I didn't find any related
>>> change to my problem.
>>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
>>> increase election timeout and disable the inactivity probe?
>>> >>>
>>> >>> Not sure on that.  It's worth a try if you have a test environment.
>>> >>>
>>> >>> > Also is there any limitation on the number of ACLs that can OVN
>>> handle?
>>> >>>
>>> >>> I don't think there is any limitation on the number of ACLs.  In
>>> >>> general as the size of the SB DB increases, we have seen issues.
>>> >>>
>>> >>> Can you run the below command on each of your nodes where
>>> >>> ovn-controller runs and see if that helps ?
>>> >>>
>>> >>> ---
>>> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>>> >>> ---
>>> >>>
>>> >>> Thanks
>>> >>> Numan
>>> >>>
>>> >>>
>>> >>> >
>>> >>> > Thanks.
>>> >>> >
>>> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique 
>>> wrote:
>>> >>> >>
>>> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah <
>>> seenafal...@gmail.com> wrote:
>>> >>> >> >
>>> >>> >> > Hi,
>>> >>> >> >
>>> >>> >> > I 

Re: [ovs-discuss] OVN at scale in production

2021-10-14 Thread Han Zhou
On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah  wrote:

> It's mostly on nb.
>
I am surprised since we usually don't see any scale problem for the NB DB
servers, because usually SB data size is much bigger and also number of
clients are much bigger than NB DB. So if there are scale problems it would
always happen on SB already before NB hits any limit.
You would see NB scale problem but not on SB probably because ovn-northd
couldn't even translate the NB data to SB yet because of the NB problem you
hit. I'd suggest to start with smaller scale, and make sure it works end to
end, and then enlarge it gradually, then you would see the real limit.
Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so
big but each ACL could reference big address-sets and port-groups. You
could probably give more details about your topology and what your typical
ACLs look like.


> Yes, I set that value before to 6 but it didn't help!
>
> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou  wrote:
>
>>
>>
>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah 
>> wrote:
>> >
>> > Also I get many logs like this in ovn:
>> >
>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate
>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454:
>> receive error: Connection reset by peer
>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
>> connection dropped (Connection reset by peer)
>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
>> connection dropped (Connection reset by peer)
>> >
>> > What does it mean about excessive rate? How many req/s is going to be
>> an excessive rate?
>>
>> Don't worry about "excessive rate", which is talking about the log rate
>> limit itself.
>> The "connection reset by peer" indicates client side inactivity probe is
>> enabled and it disconnects when the server hasn't responded for a while.
>> What server is this? NB or SB? Usually SB DB would have this problem if
>> there are lots of nodes and if the inactivity probe is not adjusted on the
>> nodes (ovn-controllers). Try: ovs-vsctl set open .
>> external_ids:ovn-remote-probe-interval=10 on each node.
>>
>> >
>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
>> wrote:
>> >>
>> >> Seems the most leader failure is for NB and the command you said is
>> for SB.
>> >>
>> >> Do you have any benchmarks of how many ACLs can OVN perform normally?
>> >> I see many failures after 100k ACLs.
>> >>
>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:
>> >>>
>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
>> wrote:
>> >>> >
>> >>> > I'm using these versions on a centos container:
>> >>> > ovsdb-server (Open vSwitch) 2.15.2
>> >>> > ovn-nbctl 21.06.0
>> >>> > Open vSwitch Library 2.15.90
>> >>> > DB Schema 5.32.0
>> >>> >
>> >>> > Today I see the election timed out too and I should increase ovsdb
>> election timeout too. I saw the commits but I didn't find any related
>> change to my problem.
>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
>> increase election timeout and disable the inactivity probe?
>> >>>
>> >>> Not sure on that.  It's worth a try if you have a test environment.
>> >>>
>> >>> > Also is there any limitation on the number of ACLs that can OVN
>> handle?
>> >>>
>> >>> I don't think there is any limitation on the number of ACLs.  In
>> >>> general as the size of the SB DB increases, we have seen issues.
>> >>>
>> >>> Can you run the below command on each of your nodes where
>> >>> ovn-controller runs and see if that helps ?
>> >>>
>> >>> ---
>> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>> >>> ---
>> >>>
>> >>> Thanks
>> >>> Numan
>> >>>
>> >>>
>> >>> >
>> >>> > Thanks.
>> >>> >
>> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique 
>> wrote:
>> >>> >>
>> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah <
>> seenafal...@gmail.com> wrote:
>> >>> >> >
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > I use ovn for OpenStack neutron plugin for my production. After
>> days I see issues about losing a leader in ovsdb. It seems it was because
>> of the failing inactivity probe and because I had 17k acls. After I disable
>> the inactivity probe it works fine but when I did a scale test on it (about
>> 40k ACLS) again it fails the leader.
>> >>> >> > I saw many docs about ovn at scale issues that were raised by
>> both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
>> checked 

Re: [ovs-discuss] OVN at scale in production

2021-10-14 Thread Seena Fallah
It's mostly on nb.
Yes, I set that value before to 6 but it didn't help!

On Sun, Oct 10, 2021 at 10:34 PM Han Zhou  wrote:

>
>
> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah 
> wrote:
> >
> > Also I get many logs like this in ovn:
> >
> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
> last 8 seconds (most recently, 3 seconds ago) due to excessive rate
> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive
> error: Connection reset by peer
> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
> connection dropped (Connection reset by peer)
> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
> connection dropped (Connection reset by peer)
> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
> connection dropped (Connection reset by peer)
> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
> connection dropped (Connection reset by peer)
> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
> connection dropped (Connection reset by peer)
> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
> connection dropped (Connection reset by peer)
> >
> > What does it mean about excessive rate? How many req/s is going to be an
> excessive rate?
>
> Don't worry about "excessive rate", which is talking about the log rate
> limit itself.
> The "connection reset by peer" indicates client side inactivity probe is
> enabled and it disconnects when the server hasn't responded for a while.
> What server is this? NB or SB? Usually SB DB would have this problem if
> there are lots of nodes and if the inactivity probe is not adjusted on the
> nodes (ovn-controllers). Try: ovs-vsctl set open .
> external_ids:ovn-remote-probe-interval=10 on each node.
>
> >
> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
> wrote:
> >>
> >> Seems the most leader failure is for NB and the command you said is for
> SB.
> >>
> >> Do you have any benchmarks of how many ACLs can OVN perform normally?
> >> I see many failures after 100k ACLs.
> >>
> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:
> >>>
> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
> wrote:
> >>> >
> >>> > I'm using these versions on a centos container:
> >>> > ovsdb-server (Open vSwitch) 2.15.2
> >>> > ovn-nbctl 21.06.0
> >>> > Open vSwitch Library 2.15.90
> >>> > DB Schema 5.32.0
> >>> >
> >>> > Today I see the election timed out too and I should increase ovsdb
> election timeout too. I saw the commits but I didn't find any related
> change to my problem.
> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
> increase election timeout and disable the inactivity probe?
> >>>
> >>> Not sure on that.  It's worth a try if you have a test environment.
> >>>
> >>> > Also is there any limitation on the number of ACLs that can OVN
> handle?
> >>>
> >>> I don't think there is any limitation on the number of ACLs.  In
> >>> general as the size of the SB DB increases, we have seen issues.
> >>>
> >>> Can you run the below command on each of your nodes where
> >>> ovn-controller runs and see if that helps ?
> >>>
> >>> ---
> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
> >>> ---
> >>>
> >>> Thanks
> >>> Numan
> >>>
> >>>
> >>> >
> >>> > Thanks.
> >>> >
> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique 
> wrote:
> >>> >>
> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah 
> wrote:
> >>> >> >
> >>> >> > Hi,
> >>> >> >
> >>> >> > I use ovn for OpenStack neutron plugin for my production. After
> days I see issues about losing a leader in ovsdb. It seems it was because
> of the failing inactivity probe and because I had 17k acls. After I disable
> the inactivity probe it works fine but when I did a scale test on it (about
> 40k ACLS) again it fails the leader.
> >>> >> > I saw many docs about ovn at scale issues that were raised by
> both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
> checked it with northd-ddlog but nothing changes.
> >>> >> >
> >>> >> > My question is should I wait more for ovn to be stable for high
> scale or is there any tuning I miss in my deployment?
> >>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the
> issues at a high scale? if yes is there any due time?
> >>> >>
> >>> >> What is the ovsdb-server version you're using ?  There are many
> >>> >> improvements in the ovsdb-server in 2.16.
> >>> >> Maybe that would help in your deployment.  And also there were many
> >>> >> improvements which went into OVN 21.09
> >>> >> if you want to test it out.
> >>> >>
> >>> >> Thanks
> >>> >> Numan
> >>> >>
> >>> >> >
> >>> >> > Thanks.
> >>> >> > ___
> >>> >> > discuss mailing list
> >>> >> > disc...@openvswitch.org
> >>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >>> >
> >>> > ___
> >>> > discuss mailing list
> >>> > disc...@openvswit

Re: [ovs-discuss] OVN at scale in production

2021-10-10 Thread Han Zhou
On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah  wrote:
>
> Also I get many logs like this in ovn:
>
> 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
last 8 seconds (most recently, 3 seconds ago) due to excessive rate
> 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive
error: Connection reset by peer
> 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
connection dropped (Connection reset by peer)
> 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
connection dropped (Connection reset by peer)
>
> What does it mean about excessive rate? How many req/s is going to be an
excessive rate?

Don't worry about "excessive rate", which is talking about the log rate
limit itself.
The "connection reset by peer" indicates client side inactivity probe is
enabled and it disconnects when the server hasn't responded for a while.
What server is this? NB or SB? Usually SB DB would have this problem if
there are lots of nodes and if the inactivity probe is not adjusted on the
nodes (ovn-controllers). Try: ovs-vsctl set open .
external_ids:ovn-remote-probe-interval=10 on each node.

>
> On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah 
wrote:
>>
>> Seems the most leader failure is for NB and the command you said is for
SB.
>>
>> Do you have any benchmarks of how many ACLs can OVN perform normally?
>> I see many failures after 100k ACLs.
>>
>> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:
>>>
>>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
wrote:
>>> >
>>> > I'm using these versions on a centos container:
>>> > ovsdb-server (Open vSwitch) 2.15.2
>>> > ovn-nbctl 21.06.0
>>> > Open vSwitch Library 2.15.90
>>> > DB Schema 5.32.0
>>> >
>>> > Today I see the election timed out too and I should increase ovsdb
election timeout too. I saw the commits but I didn't find any related
change to my problem.
>>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
increase election timeout and disable the inactivity probe?
>>>
>>> Not sure on that.  It's worth a try if you have a test environment.
>>>
>>> > Also is there any limitation on the number of ACLs that can OVN
handle?
>>>
>>> I don't think there is any limitation on the number of ACLs.  In
>>> general as the size of the SB DB increases, we have seen issues.
>>>
>>> Can you run the below command on each of your nodes where
>>> ovn-controller runs and see if that helps ?
>>>
>>> ---
>>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>>> ---
>>>
>>> Thanks
>>> Numan
>>>
>>>
>>> >
>>> > Thanks.
>>> >
>>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique  wrote:
>>> >>
>>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah 
wrote:
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > I use ovn for OpenStack neutron plugin for my production. After
days I see issues about losing a leader in ovsdb. It seems it was because
of the failing inactivity probe and because I had 17k acls. After I disable
the inactivity probe it works fine but when I did a scale test on it (about
40k ACLS) again it fails the leader.
>>> >> > I saw many docs about ovn at scale issues that were raised by both
RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
checked it with northd-ddlog but nothing changes.
>>> >> >
>>> >> > My question is should I wait more for ovn to be stable for high
scale or is there any tuning I miss in my deployment?
>>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the
issues at a high scale? if yes is there any due time?
>>> >>
>>> >> What is the ovsdb-server version you're using ?  There are many
>>> >> improvements in the ovsdb-server in 2.16.
>>> >> Maybe that would help in your deployment.  And also there were many
>>> >> improvements which went into OVN 21.09
>>> >> if you want to test it out.
>>> >>
>>> >> Thanks
>>> >> Numan
>>> >>
>>> >> >
>>> >> > Thanks.
>>> >> > ___
>>> >> > discuss mailing list
>>> >> > disc...@openvswitch.org
>>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>> >
>>> > ___
>>> > discuss mailing list
>>> > disc...@openvswitch.org
>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-disc

Re: [ovs-discuss] OVN at scale in production

2021-10-09 Thread Seena Fallah
Also I get many logs like this in ovn:

2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in last
8 seconds (most recently, 3 seconds ago) due to excessive rate
2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive
error: Connection reset by peer
2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: connection
dropped (Connection reset by peer)
2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
connection dropped (Connection reset by peer)
2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
connection dropped (Connection reset by peer)
2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
connection dropped (Connection reset by peer)
2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
connection dropped (Connection reset by peer)
2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
connection dropped (Connection reset by peer)

What does it mean about excessive rate? How many req/s is going to be an
excessive rate?

On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah  wrote:

> Seems the most leader failure is for NB and the command you said is for SB.
>
> Do you have any benchmarks of how many ACLs can OVN perform normally?
> I see many failures after 100k ACLs.
>
> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:
>
>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah 
>> wrote:
>> >
>> > I'm using these versions on a centos container:
>> > ovsdb-server (Open vSwitch) 2.15.2
>> > ovn-nbctl 21.06.0
>> > Open vSwitch Library 2.15.90
>> > DB Schema 5.32.0
>> >
>> > Today I see the election timed out too and I should increase ovsdb
>> election timeout too. I saw the commits but I didn't find any related
>> change to my problem.
>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase
>> election timeout and disable the inactivity probe?
>>
>> Not sure on that.  It's worth a try if you have a test environment.
>>
>> > Also is there any limitation on the number of ACLs that can OVN handle?
>>
>> I don't think there is any limitation on the number of ACLs.  In
>> general as the size of the SB DB increases, we have seen issues.
>>
>> Can you run the below command on each of your nodes where
>> ovn-controller runs and see if that helps ?
>>
>> ---
>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>> ---
>>
>> Thanks
>> Numan
>>
>>
>> >
>> > Thanks.
>> >
>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique  wrote:
>> >>
>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah 
>> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I use ovn for OpenStack neutron plugin for my production. After days
>> I see issues about losing a leader in ovsdb. It seems it was because of the
>> failing inactivity probe and because I had 17k acls. After I disable the
>> inactivity probe it works fine but when I did a scale test on it (about 40k
>> ACLS) again it fails the leader.
>> >> > I saw many docs about ovn at scale issues that were raised by both
>> RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
>> checked it with northd-ddlog but nothing changes.
>> >> >
>> >> > My question is should I wait more for ovn to be stable for high
>> scale or is there any tuning I miss in my deployment?
>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues
>> at a high scale? if yes is there any due time?
>> >>
>> >> What is the ovsdb-server version you're using ?  There are many
>> >> improvements in the ovsdb-server in 2.16.
>> >> Maybe that would help in your deployment.  And also there were many
>> >> improvements which went into OVN 21.09
>> >> if you want to test it out.
>> >>
>> >> Thanks
>> >> Numan
>> >>
>> >> >
>> >> > Thanks.
>> >> > ___
>> >> > discuss mailing list
>> >> > disc...@openvswitch.org
>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >
>> > ___
>> > discuss mailing list
>> > disc...@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN at scale in production

2021-10-06 Thread Seena Fallah
Seems the most leader failure is for NB and the command you said is for SB.

Do you have any benchmarks of how many ACLs can OVN perform normally?
I see many failures after 100k ACLs.

On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique  wrote:

> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah  wrote:
> >
> > I'm using these versions on a centos container:
> > ovsdb-server (Open vSwitch) 2.15.2
> > ovn-nbctl 21.06.0
> > Open vSwitch Library 2.15.90
> > DB Schema 5.32.0
> >
> > Today I see the election timed out too and I should increase ovsdb
> election timeout too. I saw the commits but I didn't find any related
> change to my problem.
> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase
> election timeout and disable the inactivity probe?
>
> Not sure on that.  It's worth a try if you have a test environment.
>
> > Also is there any limitation on the number of ACLs that can OVN handle?
>
> I don't think there is any limitation on the number of ACLs.  In
> general as the size of the SB DB increases, we have seen issues.
>
> Can you run the below command on each of your nodes where
> ovn-controller runs and see if that helps ?
>
> ---
> ovs-vsctl set open . external_ids:ovn-monitor-all=true
> ---
>
> Thanks
> Numan
>
>
> >
> > Thanks.
> >
> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique  wrote:
> >>
> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I use ovn for OpenStack neutron plugin for my production. After days
> I see issues about losing a leader in ovsdb. It seems it was because of the
> failing inactivity probe and because I had 17k acls. After I disable the
> inactivity probe it works fine but when I did a scale test on it (about 40k
> ACLS) again it fails the leader.
> >> > I saw many docs about ovn at scale issues that were raised by both
> RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
> checked it with northd-ddlog but nothing changes.
> >> >
> >> > My question is should I wait more for ovn to be stable for high scale
> or is there any tuning I miss in my deployment?
> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues
> at a high scale? if yes is there any due time?
> >>
> >> What is the ovsdb-server version you're using ?  There are many
> >> improvements in the ovsdb-server in 2.16.
> >> Maybe that would help in your deployment.  And also there were many
> >> improvements which went into OVN 21.09
> >> if you want to test it out.
> >>
> >> Thanks
> >> Numan
> >>
> >> >
> >> > Thanks.
> >> > ___
> >> > discuss mailing list
> >> > disc...@openvswitch.org
> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN at scale in production

2021-10-06 Thread Numan Siddique
On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah  wrote:
>
> I'm using these versions on a centos container:
> ovsdb-server (Open vSwitch) 2.15.2
> ovn-nbctl 21.06.0
> Open vSwitch Library 2.15.90
> DB Schema 5.32.0
>
> Today I see the election timed out too and I should increase ovsdb election 
> timeout too. I saw the commits but I didn't find any related change to my 
> problem.
> If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase 
> election timeout and disable the inactivity probe?

Not sure on that.  It's worth a try if you have a test environment.

> Also is there any limitation on the number of ACLs that can OVN handle?

I don't think there is any limitation on the number of ACLs.  In
general as the size of the SB DB increases, we have seen issues.

Can you run the below command on each of your nodes where
ovn-controller runs and see if that helps ?

---
ovs-vsctl set open . external_ids:ovn-monitor-all=true
---

Thanks
Numan


>
> Thanks.
>
> On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique  wrote:
>>
>> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah  wrote:
>> >
>> > Hi,
>> >
>> > I use ovn for OpenStack neutron plugin for my production. After days I see 
>> > issues about losing a leader in ovsdb. It seems it was because of the 
>> > failing inactivity probe and because I had 17k acls. After I disable the 
>> > inactivity probe it works fine but when I did a scale test on it (about 
>> > 40k ACLS) again it fails the leader.
>> > I saw many docs about ovn at scale issues that were raised by both RedHat 
>> > and eBay and seems the solution is to rewrite ovn with ddlog. I checked it 
>> > with northd-ddlog but nothing changes.
>> >
>> > My question is should I wait more for ovn to be stable for high scale or 
>> > is there any tuning I miss in my deployment?
>> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a 
>> > high scale? if yes is there any due time?
>>
>> What is the ovsdb-server version you're using ?  There are many
>> improvements in the ovsdb-server in 2.16.
>> Maybe that would help in your deployment.  And also there were many
>> improvements which went into OVN 21.09
>> if you want to test it out.
>>
>> Thanks
>> Numan
>>
>> >
>> > Thanks.
>> > ___
>> > discuss mailing list
>> > disc...@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN at scale in production

2021-10-06 Thread Seena Fallah
I'm using these versions on a centos container:
ovsdb-server (Open vSwitch) 2.15.2
ovn-nbctl 21.06.0
Open vSwitch Library 2.15.90
DB Schema 5.32.0

Today I see the election timed out too and I should increase ovsdb election
timeout too. I saw the commits but I didn't find any related change to my
problem.
If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase
election timeout and disable the inactivity probe?
Also is there any limitation on the number of ACLs that can OVN handle?

Thanks.

On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique  wrote:

> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah 
> wrote:
> >
> > Hi,
> >
> > I use ovn for OpenStack neutron plugin for my production. After days I
> see issues about losing a leader in ovsdb. It seems it was because of the
> failing inactivity probe and because I had 17k acls. After I disable the
> inactivity probe it works fine but when I did a scale test on it (about 40k
> ACLS) again it fails the leader.
> > I saw many docs about ovn at scale issues that were raised by both
> RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
> checked it with northd-ddlog but nothing changes.
> >
> > My question is should I wait more for ovn to be stable for high scale or
> is there any tuning I miss in my deployment?
> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a
> high scale? if yes is there any due time?
>
> What is the ovsdb-server version you're using ?  There are many
> improvements in the ovsdb-server in 2.16.
> Maybe that would help in your deployment.  And also there were many
> improvements which went into OVN 21.09
> if you want to test it out.
>
> Thanks
> Numan
>
> >
> > Thanks.
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN at scale in production

2021-10-06 Thread Numan Siddique
On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah  wrote:
>
> Hi,
>
> I use ovn for OpenStack neutron plugin for my production. After days I see 
> issues about losing a leader in ovsdb. It seems it was because of the failing 
> inactivity probe and because I had 17k acls. After I disable the inactivity 
> probe it works fine but when I did a scale test on it (about 40k ACLS) again 
> it fails the leader.
> I saw many docs about ovn at scale issues that were raised by both RedHat and 
> eBay and seems the solution is to rewrite ovn with ddlog. I checked it with 
> northd-ddlog but nothing changes.
>
> My question is should I wait more for ovn to be stable for high scale or is 
> there any tuning I miss in my deployment?
> Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a high 
> scale? if yes is there any due time?

What is the ovsdb-server version you're using ?  There are many
improvements in the ovsdb-server in 2.16.
Maybe that would help in your deployment.  And also there were many
improvements which went into OVN 21.09
if you want to test it out.

Thanks
Numan

>
> Thanks.
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN at scale in production

2021-10-06 Thread Seena Fallah
Hi,

I use ovn for OpenStack neutron plugin for my production. After days I see
issues about losing a leader in ovsdb. It seems it was because of the
failing inactivity probe and because I had 17k acls. After I disable the
inactivity probe it works fine but when I did a scale test on it (about 40k
ACLS) again it fails the leader.
I saw many docs about ovn at scale issues that were raised by both RedHat
and eBay and seems the solution is to rewrite ovn with ddlog. I checked it
with northd-ddlog but nothing changes.

My question is should I wait more for ovn to be stable for high scale or is
there any tuning I miss in my deployment?
Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a
high scale? if yes is there any due time?

Thanks.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss