Re: [ovs-discuss] OVN at scale in production
On Fri, Oct 15, 2021 at 3:53 AM Seena Fallah wrote: > In the case of having many projects each project has at least 2 security > groups and each security group has 5 ACLs this ACL number should be not > very high I think. > Ok, assume each project has 2 x 5 = 10 ACLs, 100k ACLs means you have 20k projects. That sounds not a small number. If each project has its own LRs and LSes, and 10 ~ 100 workloads, it sounds like something really big. Or if they just share the LRs and LSes, and each project has only a few workloads, then it may be ok. Still, regardless of the scale, I am surprised that you hit scale problems in NB but not in SB. > In ovs scenario, I have 250K ACLs and everything works fine! > What do you mean by ACLs in ovs? 250k ACLs meaning 250k OVS flows? An OVN ACL can easily be translated into thousands of OVS flows, if a big address set is referenced by the OVN ACL. So I am not sure what does these numbers mean exactly in your deployment. Do you think OVN is not ready for this number of ACLs? > I'm switching from ovs to ovn. > > On Fri, Oct 15, 2021 at 4:41 AM Han Zhou wrote: > >> >> >> On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah >> wrote: >> >>> It's mostly on nb. >>> >> I am surprised since we usually don't see any scale problem for the NB DB >> servers, because usually SB data size is much bigger and also number of >> clients are much bigger than NB DB. So if there are scale problems it would >> always happen on SB already before NB hits any limit. >> You would see NB scale problem but not on SB probably because ovn-northd >> couldn't even translate the NB data to SB yet because of the NB problem you >> hit. I'd suggest to start with smaller scale, and make sure it works end to >> end, and then enlarge it gradually, then you would see the real limit. >> Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so >> big but each ACL could reference big address-sets and port-groups. You >> could probably give more details about your topology and what your typical >> ACLs look like. >> >> >>> Yes, I set that value before to 6 but it didn't help! >>> >>> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou wrote: >>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah wrote: > > Also I get many logs like this in ovn: > > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in last 8 seconds (most recently, 3 seconds ago) due to excessive rate > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive error: Connection reset by peer > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: connection dropped (Connection reset by peer) > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: connection dropped (Connection reset by peer) > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: connection dropped (Connection reset by peer) > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: connection dropped (Connection reset by peer) > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: connection dropped (Connection reset by peer) > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: connection dropped (Connection reset by peer) > > What does it mean about excessive rate? How many req/s is going to be an excessive rate? Don't worry about "excessive rate", which is talking about the log rate limit itself. The "connection reset by peer" indicates client side inactivity probe is enabled and it disconnects when the server hasn't responded for a while. What server is this? NB or SB? Usually SB DB would have this problem if there are lots of nodes and if the inactivity probe is not adjusted on the nodes (ovn-controllers). Try: ovs-vsctl set open . external_ids:ovn-remote-probe-interval=10 on each node. > > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah wrote: >> >> Seems the most leader failure is for NB and the command you said is for SB. >> >> Do you have any benchmarks of how many ACLs can OVN perform normally? >> I see many failures after 100k ACLs. >> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique wrote: >>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah wrote: >>> > >>> > I'm using these versions on a centos container: >>> > ovsdb-server (Open vSwitch) 2.15.2 >>> > ovn-nbctl 21.06.0 >>> > Open vSwitch Library 2.15.90 >>> > DB Schema 5.32.0 >>> > >>> > Today I see the election timed out too and I should increase ovsdb election timeout too. I saw the commits but I didn't find any related change to my problem. >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase election timeout and disable the inactivity probe? >>> >>> Not sure on that
Re: [ovs-discuss] OVN at scale in production
In the case of having many projects each project has at least 2 security groups and each security group has 5 ACLs this ACL number should be not very high I think. In ovs scenario, I have 250K ACLs and everything works fine! Do you think OVN is not ready for this number of ACLs? I'm switching from ovs to ovn. On Fri, Oct 15, 2021 at 4:41 AM Han Zhou wrote: > > > On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah > wrote: > >> It's mostly on nb. >> > I am surprised since we usually don't see any scale problem for the NB DB > servers, because usually SB data size is much bigger and also number of > clients are much bigger than NB DB. So if there are scale problems it would > always happen on SB already before NB hits any limit. > You would see NB scale problem but not on SB probably because ovn-northd > couldn't even translate the NB data to SB yet because of the NB problem you > hit. I'd suggest to start with smaller scale, and make sure it works end to > end, and then enlarge it gradually, then you would see the real limit. > Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so > big but each ACL could reference big address-sets and port-groups. You > could probably give more details about your topology and what your typical > ACLs look like. > > >> Yes, I set that value before to 6 but it didn't help! >> >> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou wrote: >> >>> >>> >>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah >>> wrote: >>> > >>> > Also I get many logs like this in ovn: >>> > >>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in >>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate >>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: >>> receive error: Connection reset by peer >>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: >>> connection dropped (Connection reset by peer) >>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: >>> connection dropped (Connection reset by peer) >>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: >>> connection dropped (Connection reset by peer) >>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: >>> connection dropped (Connection reset by peer) >>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: >>> connection dropped (Connection reset by peer) >>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: >>> connection dropped (Connection reset by peer) >>> > >>> > What does it mean about excessive rate? How many req/s is going to be >>> an excessive rate? >>> >>> Don't worry about "excessive rate", which is talking about the log rate >>> limit itself. >>> The "connection reset by peer" indicates client side inactivity probe is >>> enabled and it disconnects when the server hasn't responded for a while. >>> What server is this? NB or SB? Usually SB DB would have this problem if >>> there are lots of nodes and if the inactivity probe is not adjusted on the >>> nodes (ovn-controllers). Try: ovs-vsctl set open . >>> external_ids:ovn-remote-probe-interval=10 on each node. >>> >>> > >>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah >>> wrote: >>> >> >>> >> Seems the most leader failure is for NB and the command you said is >>> for SB. >>> >> >>> >> Do you have any benchmarks of how many ACLs can OVN perform normally? >>> >> I see many failures after 100k ACLs. >>> >> >>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique >>> wrote: >>> >>> >>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah >>> wrote: >>> >>> > >>> >>> > I'm using these versions on a centos container: >>> >>> > ovsdb-server (Open vSwitch) 2.15.2 >>> >>> > ovn-nbctl 21.06.0 >>> >>> > Open vSwitch Library 2.15.90 >>> >>> > DB Schema 5.32.0 >>> >>> > >>> >>> > Today I see the election timed out too and I should increase ovsdb >>> election timeout too. I saw the commits but I didn't find any related >>> change to my problem. >>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to >>> increase election timeout and disable the inactivity probe? >>> >>> >>> >>> Not sure on that. It's worth a try if you have a test environment. >>> >>> >>> >>> > Also is there any limitation on the number of ACLs that can OVN >>> handle? >>> >>> >>> >>> I don't think there is any limitation on the number of ACLs. In >>> >>> general as the size of the SB DB increases, we have seen issues. >>> >>> >>> >>> Can you run the below command on each of your nodes where >>> >>> ovn-controller runs and see if that helps ? >>> >>> >>> >>> --- >>> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true >>> >>> --- >>> >>> >>> >>> Thanks >>> >>> Numan >>> >>> >>> >>> >>> >>> > >>> >>> > Thanks. >>> >>> > >>> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique >>> wrote: >>> >>> >> >>> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah < >>> seenafal...@gmail.com> wrote: >>> >>> >> > >>> >>> >> > Hi, >>> >>> >> > >>> >>> >> > I
Re: [ovs-discuss] OVN at scale in production
On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah wrote: > It's mostly on nb. > I am surprised since we usually don't see any scale problem for the NB DB servers, because usually SB data size is much bigger and also number of clients are much bigger than NB DB. So if there are scale problems it would always happen on SB already before NB hits any limit. You would see NB scale problem but not on SB probably because ovn-northd couldn't even translate the NB data to SB yet because of the NB problem you hit. I'd suggest to start with smaller scale, and make sure it works end to end, and then enlarge it gradually, then you would see the real limit. Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so big but each ACL could reference big address-sets and port-groups. You could probably give more details about your topology and what your typical ACLs look like. > Yes, I set that value before to 6 but it didn't help! > > On Sun, Oct 10, 2021 at 10:34 PM Han Zhou wrote: > >> >> >> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah >> wrote: >> > >> > Also I get many logs like this in ovn: >> > >> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in >> last 8 seconds (most recently, 3 seconds ago) due to excessive rate >> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: >> receive error: Connection reset by peer >> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: >> connection dropped (Connection reset by peer) >> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: >> connection dropped (Connection reset by peer) >> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: >> connection dropped (Connection reset by peer) >> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: >> connection dropped (Connection reset by peer) >> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: >> connection dropped (Connection reset by peer) >> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: >> connection dropped (Connection reset by peer) >> > >> > What does it mean about excessive rate? How many req/s is going to be >> an excessive rate? >> >> Don't worry about "excessive rate", which is talking about the log rate >> limit itself. >> The "connection reset by peer" indicates client side inactivity probe is >> enabled and it disconnects when the server hasn't responded for a while. >> What server is this? NB or SB? Usually SB DB would have this problem if >> there are lots of nodes and if the inactivity probe is not adjusted on the >> nodes (ovn-controllers). Try: ovs-vsctl set open . >> external_ids:ovn-remote-probe-interval=10 on each node. >> >> > >> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah >> wrote: >> >> >> >> Seems the most leader failure is for NB and the command you said is >> for SB. >> >> >> >> Do you have any benchmarks of how many ACLs can OVN perform normally? >> >> I see many failures after 100k ACLs. >> >> >> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique wrote: >> >>> >> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah >> wrote: >> >>> > >> >>> > I'm using these versions on a centos container: >> >>> > ovsdb-server (Open vSwitch) 2.15.2 >> >>> > ovn-nbctl 21.06.0 >> >>> > Open vSwitch Library 2.15.90 >> >>> > DB Schema 5.32.0 >> >>> > >> >>> > Today I see the election timed out too and I should increase ovsdb >> election timeout too. I saw the commits but I didn't find any related >> change to my problem. >> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to >> increase election timeout and disable the inactivity probe? >> >>> >> >>> Not sure on that. It's worth a try if you have a test environment. >> >>> >> >>> > Also is there any limitation on the number of ACLs that can OVN >> handle? >> >>> >> >>> I don't think there is any limitation on the number of ACLs. In >> >>> general as the size of the SB DB increases, we have seen issues. >> >>> >> >>> Can you run the below command on each of your nodes where >> >>> ovn-controller runs and see if that helps ? >> >>> >> >>> --- >> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true >> >>> --- >> >>> >> >>> Thanks >> >>> Numan >> >>> >> >>> >> >>> > >> >>> > Thanks. >> >>> > >> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique >> wrote: >> >>> >> >> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah < >> seenafal...@gmail.com> wrote: >> >>> >> > >> >>> >> > Hi, >> >>> >> > >> >>> >> > I use ovn for OpenStack neutron plugin for my production. After >> days I see issues about losing a leader in ovsdb. It seems it was because >> of the failing inactivity probe and because I had 17k acls. After I disable >> the inactivity probe it works fine but when I did a scale test on it (about >> 40k ACLS) again it fails the leader. >> >>> >> > I saw many docs about ovn at scale issues that were raised by >> both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I >> checked
Re: [ovs-discuss] OVN at scale in production
It's mostly on nb. Yes, I set that value before to 6 but it didn't help! On Sun, Oct 10, 2021 at 10:34 PM Han Zhou wrote: > > > On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah > wrote: > > > > Also I get many logs like this in ovn: > > > > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in > last 8 seconds (most recently, 3 seconds ago) due to excessive rate > > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive > error: Connection reset by peer > > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: > connection dropped (Connection reset by peer) > > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: > connection dropped (Connection reset by peer) > > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: > connection dropped (Connection reset by peer) > > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: > connection dropped (Connection reset by peer) > > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: > connection dropped (Connection reset by peer) > > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: > connection dropped (Connection reset by peer) > > > > What does it mean about excessive rate? How many req/s is going to be an > excessive rate? > > Don't worry about "excessive rate", which is talking about the log rate > limit itself. > The "connection reset by peer" indicates client side inactivity probe is > enabled and it disconnects when the server hasn't responded for a while. > What server is this? NB or SB? Usually SB DB would have this problem if > there are lots of nodes and if the inactivity probe is not adjusted on the > nodes (ovn-controllers). Try: ovs-vsctl set open . > external_ids:ovn-remote-probe-interval=10 on each node. > > > > > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah > wrote: > >> > >> Seems the most leader failure is for NB and the command you said is for > SB. > >> > >> Do you have any benchmarks of how many ACLs can OVN perform normally? > >> I see many failures after 100k ACLs. > >> > >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique wrote: > >>> > >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah > wrote: > >>> > > >>> > I'm using these versions on a centos container: > >>> > ovsdb-server (Open vSwitch) 2.15.2 > >>> > ovn-nbctl 21.06.0 > >>> > Open vSwitch Library 2.15.90 > >>> > DB Schema 5.32.0 > >>> > > >>> > Today I see the election timed out too and I should increase ovsdb > election timeout too. I saw the commits but I didn't find any related > change to my problem. > >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to > increase election timeout and disable the inactivity probe? > >>> > >>> Not sure on that. It's worth a try if you have a test environment. > >>> > >>> > Also is there any limitation on the number of ACLs that can OVN > handle? > >>> > >>> I don't think there is any limitation on the number of ACLs. In > >>> general as the size of the SB DB increases, we have seen issues. > >>> > >>> Can you run the below command on each of your nodes where > >>> ovn-controller runs and see if that helps ? > >>> > >>> --- > >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true > >>> --- > >>> > >>> Thanks > >>> Numan > >>> > >>> > >>> > > >>> > Thanks. > >>> > > >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique > wrote: > >>> >> > >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah > wrote: > >>> >> > > >>> >> > Hi, > >>> >> > > >>> >> > I use ovn for OpenStack neutron plugin for my production. After > days I see issues about losing a leader in ovsdb. It seems it was because > of the failing inactivity probe and because I had 17k acls. After I disable > the inactivity probe it works fine but when I did a scale test on it (about > 40k ACLS) again it fails the leader. > >>> >> > I saw many docs about ovn at scale issues that were raised by > both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I > checked it with northd-ddlog but nothing changes. > >>> >> > > >>> >> > My question is should I wait more for ovn to be stable for high > scale or is there any tuning I miss in my deployment? > >>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the > issues at a high scale? if yes is there any due time? > >>> >> > >>> >> What is the ovsdb-server version you're using ? There are many > >>> >> improvements in the ovsdb-server in 2.16. > >>> >> Maybe that would help in your deployment. And also there were many > >>> >> improvements which went into OVN 21.09 > >>> >> if you want to test it out. > >>> >> > >>> >> Thanks > >>> >> Numan > >>> >> > >>> >> > > >>> >> > Thanks. > >>> >> > ___ > >>> >> > discuss mailing list > >>> >> > disc...@openvswitch.org > >>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >>> > > >>> > ___ > >>> > discuss mailing list > >>> > disc...@openvswit
Re: [ovs-discuss] OVN at scale in production
On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah wrote: > > Also I get many logs like this in ovn: > > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in last 8 seconds (most recently, 3 seconds ago) due to excessive rate > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive error: Connection reset by peer > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: connection dropped (Connection reset by peer) > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: connection dropped (Connection reset by peer) > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: connection dropped (Connection reset by peer) > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: connection dropped (Connection reset by peer) > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: connection dropped (Connection reset by peer) > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: connection dropped (Connection reset by peer) > > What does it mean about excessive rate? How many req/s is going to be an excessive rate? Don't worry about "excessive rate", which is talking about the log rate limit itself. The "connection reset by peer" indicates client side inactivity probe is enabled and it disconnects when the server hasn't responded for a while. What server is this? NB or SB? Usually SB DB would have this problem if there are lots of nodes and if the inactivity probe is not adjusted on the nodes (ovn-controllers). Try: ovs-vsctl set open . external_ids:ovn-remote-probe-interval=10 on each node. > > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah wrote: >> >> Seems the most leader failure is for NB and the command you said is for SB. >> >> Do you have any benchmarks of how many ACLs can OVN perform normally? >> I see many failures after 100k ACLs. >> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique wrote: >>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah wrote: >>> > >>> > I'm using these versions on a centos container: >>> > ovsdb-server (Open vSwitch) 2.15.2 >>> > ovn-nbctl 21.06.0 >>> > Open vSwitch Library 2.15.90 >>> > DB Schema 5.32.0 >>> > >>> > Today I see the election timed out too and I should increase ovsdb election timeout too. I saw the commits but I didn't find any related change to my problem. >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase election timeout and disable the inactivity probe? >>> >>> Not sure on that. It's worth a try if you have a test environment. >>> >>> > Also is there any limitation on the number of ACLs that can OVN handle? >>> >>> I don't think there is any limitation on the number of ACLs. In >>> general as the size of the SB DB increases, we have seen issues. >>> >>> Can you run the below command on each of your nodes where >>> ovn-controller runs and see if that helps ? >>> >>> --- >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true >>> --- >>> >>> Thanks >>> Numan >>> >>> >>> > >>> > Thanks. >>> > >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique wrote: >>> >> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah wrote: >>> >> > >>> >> > Hi, >>> >> > >>> >> > I use ovn for OpenStack neutron plugin for my production. After days I see issues about losing a leader in ovsdb. It seems it was because of the failing inactivity probe and because I had 17k acls. After I disable the inactivity probe it works fine but when I did a scale test on it (about 40k ACLS) again it fails the leader. >>> >> > I saw many docs about ovn at scale issues that were raised by both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I checked it with northd-ddlog but nothing changes. >>> >> > >>> >> > My question is should I wait more for ovn to be stable for high scale or is there any tuning I miss in my deployment? >>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a high scale? if yes is there any due time? >>> >> >>> >> What is the ovsdb-server version you're using ? There are many >>> >> improvements in the ovsdb-server in 2.16. >>> >> Maybe that would help in your deployment. And also there were many >>> >> improvements which went into OVN 21.09 >>> >> if you want to test it out. >>> >> >>> >> Thanks >>> >> Numan >>> >> >>> >> > >>> >> > Thanks. >>> >> > ___ >>> >> > discuss mailing list >>> >> > disc...@openvswitch.org >>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>> > >>> > ___ >>> > discuss mailing list >>> > disc...@openvswitch.org >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-disc
Re: [ovs-discuss] OVN at scale in production
Also I get many logs like this in ovn: 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in last 8 seconds (most recently, 3 seconds ago) due to excessive rate 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454: receive error: Connection reset by peer 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454: connection dropped (Connection reset by peer) 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224: connection dropped (Connection reset by peer) 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514: connection dropped (Connection reset by peer) 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544: connection dropped (Connection reset by peer) 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846: connection dropped (Connection reset by peer) 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796: connection dropped (Connection reset by peer) What does it mean about excessive rate? How many req/s is going to be an excessive rate? On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah wrote: > Seems the most leader failure is for NB and the command you said is for SB. > > Do you have any benchmarks of how many ACLs can OVN perform normally? > I see many failures after 100k ACLs. > > On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique wrote: > >> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah >> wrote: >> > >> > I'm using these versions on a centos container: >> > ovsdb-server (Open vSwitch) 2.15.2 >> > ovn-nbctl 21.06.0 >> > Open vSwitch Library 2.15.90 >> > DB Schema 5.32.0 >> > >> > Today I see the election timed out too and I should increase ovsdb >> election timeout too. I saw the commits but I didn't find any related >> change to my problem. >> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase >> election timeout and disable the inactivity probe? >> >> Not sure on that. It's worth a try if you have a test environment. >> >> > Also is there any limitation on the number of ACLs that can OVN handle? >> >> I don't think there is any limitation on the number of ACLs. In >> general as the size of the SB DB increases, we have seen issues. >> >> Can you run the below command on each of your nodes where >> ovn-controller runs and see if that helps ? >> >> --- >> ovs-vsctl set open . external_ids:ovn-monitor-all=true >> --- >> >> Thanks >> Numan >> >> >> > >> > Thanks. >> > >> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique wrote: >> >> >> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah >> wrote: >> >> > >> >> > Hi, >> >> > >> >> > I use ovn for OpenStack neutron plugin for my production. After days >> I see issues about losing a leader in ovsdb. It seems it was because of the >> failing inactivity probe and because I had 17k acls. After I disable the >> inactivity probe it works fine but when I did a scale test on it (about 40k >> ACLS) again it fails the leader. >> >> > I saw many docs about ovn at scale issues that were raised by both >> RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I >> checked it with northd-ddlog but nothing changes. >> >> > >> >> > My question is should I wait more for ovn to be stable for high >> scale or is there any tuning I miss in my deployment? >> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues >> at a high scale? if yes is there any due time? >> >> >> >> What is the ovsdb-server version you're using ? There are many >> >> improvements in the ovsdb-server in 2.16. >> >> Maybe that would help in your deployment. And also there were many >> >> improvements which went into OVN 21.09 >> >> if you want to test it out. >> >> >> >> Thanks >> >> Numan >> >> >> >> > >> >> > Thanks. >> >> > ___ >> >> > discuss mailing list >> >> > disc...@openvswitch.org >> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >> > >> > ___ >> > discuss mailing list >> > disc...@openvswitch.org >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >> > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN at scale in production
Seems the most leader failure is for NB and the command you said is for SB. Do you have any benchmarks of how many ACLs can OVN perform normally? I see many failures after 100k ACLs. On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique wrote: > On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah wrote: > > > > I'm using these versions on a centos container: > > ovsdb-server (Open vSwitch) 2.15.2 > > ovn-nbctl 21.06.0 > > Open vSwitch Library 2.15.90 > > DB Schema 5.32.0 > > > > Today I see the election timed out too and I should increase ovsdb > election timeout too. I saw the commits but I didn't find any related > change to my problem. > > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase > election timeout and disable the inactivity probe? > > Not sure on that. It's worth a try if you have a test environment. > > > Also is there any limitation on the number of ACLs that can OVN handle? > > I don't think there is any limitation on the number of ACLs. In > general as the size of the SB DB increases, we have seen issues. > > Can you run the below command on each of your nodes where > ovn-controller runs and see if that helps ? > > --- > ovs-vsctl set open . external_ids:ovn-monitor-all=true > --- > > Thanks > Numan > > > > > > Thanks. > > > > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique wrote: > >> > >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah > wrote: > >> > > >> > Hi, > >> > > >> > I use ovn for OpenStack neutron plugin for my production. After days > I see issues about losing a leader in ovsdb. It seems it was because of the > failing inactivity probe and because I had 17k acls. After I disable the > inactivity probe it works fine but when I did a scale test on it (about 40k > ACLS) again it fails the leader. > >> > I saw many docs about ovn at scale issues that were raised by both > RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I > checked it with northd-ddlog but nothing changes. > >> > > >> > My question is should I wait more for ovn to be stable for high scale > or is there any tuning I miss in my deployment? > >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues > at a high scale? if yes is there any due time? > >> > >> What is the ovsdb-server version you're using ? There are many > >> improvements in the ovsdb-server in 2.16. > >> Maybe that would help in your deployment. And also there were many > >> improvements which went into OVN 21.09 > >> if you want to test it out. > >> > >> Thanks > >> Numan > >> > >> > > >> > Thanks. > >> > ___ > >> > discuss mailing list > >> > disc...@openvswitch.org > >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > > ___ > > discuss mailing list > > disc...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN at scale in production
On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah wrote: > > I'm using these versions on a centos container: > ovsdb-server (Open vSwitch) 2.15.2 > ovn-nbctl 21.06.0 > Open vSwitch Library 2.15.90 > DB Schema 5.32.0 > > Today I see the election timed out too and I should increase ovsdb election > timeout too. I saw the commits but I didn't find any related change to my > problem. > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase > election timeout and disable the inactivity probe? Not sure on that. It's worth a try if you have a test environment. > Also is there any limitation on the number of ACLs that can OVN handle? I don't think there is any limitation on the number of ACLs. In general as the size of the SB DB increases, we have seen issues. Can you run the below command on each of your nodes where ovn-controller runs and see if that helps ? --- ovs-vsctl set open . external_ids:ovn-monitor-all=true --- Thanks Numan > > Thanks. > > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique wrote: >> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah wrote: >> > >> > Hi, >> > >> > I use ovn for OpenStack neutron plugin for my production. After days I see >> > issues about losing a leader in ovsdb. It seems it was because of the >> > failing inactivity probe and because I had 17k acls. After I disable the >> > inactivity probe it works fine but when I did a scale test on it (about >> > 40k ACLS) again it fails the leader. >> > I saw many docs about ovn at scale issues that were raised by both RedHat >> > and eBay and seems the solution is to rewrite ovn with ddlog. I checked it >> > with northd-ddlog but nothing changes. >> > >> > My question is should I wait more for ovn to be stable for high scale or >> > is there any tuning I miss in my deployment? >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a >> > high scale? if yes is there any due time? >> >> What is the ovsdb-server version you're using ? There are many >> improvements in the ovsdb-server in 2.16. >> Maybe that would help in your deployment. And also there were many >> improvements which went into OVN 21.09 >> if you want to test it out. >> >> Thanks >> Numan >> >> > >> > Thanks. >> > ___ >> > discuss mailing list >> > disc...@openvswitch.org >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN at scale in production
I'm using these versions on a centos container: ovsdb-server (Open vSwitch) 2.15.2 ovn-nbctl 21.06.0 Open vSwitch Library 2.15.90 DB Schema 5.32.0 Today I see the election timed out too and I should increase ovsdb election timeout too. I saw the commits but I didn't find any related change to my problem. If I use ovn 21.09 with ovsdb 2.16 Is there still any need to increase election timeout and disable the inactivity probe? Also is there any limitation on the number of ACLs that can OVN handle? Thanks. On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique wrote: > On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah > wrote: > > > > Hi, > > > > I use ovn for OpenStack neutron plugin for my production. After days I > see issues about losing a leader in ovsdb. It seems it was because of the > failing inactivity probe and because I had 17k acls. After I disable the > inactivity probe it works fine but when I did a scale test on it (about 40k > ACLS) again it fails the leader. > > I saw many docs about ovn at scale issues that were raised by both > RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I > checked it with northd-ddlog but nothing changes. > > > > My question is should I wait more for ovn to be stable for high scale or > is there any tuning I miss in my deployment? > > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a > high scale? if yes is there any due time? > > What is the ovsdb-server version you're using ? There are many > improvements in the ovsdb-server in 2.16. > Maybe that would help in your deployment. And also there were many > improvements which went into OVN 21.09 > if you want to test it out. > > Thanks > Numan > > > > > Thanks. > > ___ > > discuss mailing list > > disc...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN at scale in production
On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah wrote: > > Hi, > > I use ovn for OpenStack neutron plugin for my production. After days I see > issues about losing a leader in ovsdb. It seems it was because of the failing > inactivity probe and because I had 17k acls. After I disable the inactivity > probe it works fine but when I did a scale test on it (about 40k ACLS) again > it fails the leader. > I saw many docs about ovn at scale issues that were raised by both RedHat and > eBay and seems the solution is to rewrite ovn with ddlog. I checked it with > northd-ddlog but nothing changes. > > My question is should I wait more for ovn to be stable for high scale or is > there any tuning I miss in my deployment? > Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a high > scale? if yes is there any due time? What is the ovsdb-server version you're using ? There are many improvements in the ovsdb-server in 2.16. Maybe that would help in your deployment. And also there were many improvements which went into OVN 21.09 if you want to test it out. Thanks Numan > > Thanks. > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
[ovs-discuss] OVN at scale in production
Hi, I use ovn for OpenStack neutron plugin for my production. After days I see issues about losing a leader in ovsdb. It seems it was because of the failing inactivity probe and because I had 17k acls. After I disable the inactivity probe it works fine but when I did a scale test on it (about 40k ACLS) again it fails the leader. I saw many docs about ovn at scale issues that were raised by both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I checked it with northd-ddlog but nothing changes. My question is should I wait more for ovn to be stable for high scale or is there any tuning I miss in my deployment? Also, will the ovn-nb/sb rewrite with ddlog and can help the issues at a high scale? if yes is there any due time? Thanks. ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss