Re: [ovs-dev] raft ovsdb clustering with scale test
Hi : I was able to go little further from log perspective with debug level and see proper error: However there were couple of changes that I did for current setup. For controllers setup, since I am using 3 vms, to keep the specs in line with current ovn scale env: 1. Increased the virtual memory of my vms to 40Gb as I saw out of memory erros with 4g memory. 2. Also set the mtu to 9k instead of 1500 as scale env BM used 1500 3. Also changed net.core.somaxconn=2048 as ubuntu image just using 128 as the connection rate is high. After 1.5k lports are created, another request for 250 lports took ~ 20+ minutes to go through as per logs which is huge. With above settings in place, active northd even though was processing little slow ,did not show any errors which stretched to around 2k lports instead of 1k lports in the previous runs: northd logs when wait-until command is getting run with debug mode for reference is @ https://github.com/noah8713/ovn-scale-test/blob/scale_results/results/active_northd_nod3.txt Finally, with wait-until; nb db process exited with segmentation fault on all 3 nodes. e.g. ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: 192.168.220.103:6641" show ovn-nbctl: tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: 192.168.220.103:6641: database connection failed (Connection refused) #node3 ovsdb-server-nb logs 2018-04-23T05:11:34.237Z|00015|raft|INFO|tcp:192.168.10.10:43412: learned server ID 9e1c 2018-04-23T05:11:34.237Z|00016|raft|INFO|tcp:192.168.10.10:43412: learned remote address tcp:192.168.220.101:6643 2018-04-23T05:11:34.240Z|00017|raft|INFO|tcp:192.168.10.10:43006: learned server ID f038 2018-04-23T05:11:34.240Z|00018|raft|INFO|tcp:192.168.10.10:43006: learned remote address tcp:192.168.220.102:6643 2018-04-23T05:11:44.216Z|00019|memory|INFO|2536 kB peak resident set size after 10.0 seconds 2018-04-23T05:11:44.216Z|00020|memory|INFO|cells:28 monitors:0 2018-04-23T05:21:04.276Z|00021|memory|INFO|peak resident set size grew 51% in last 560.1 seconds, from 2536 kB to 3828 kB 2018-04-23T05:21:04.276Z|00022|memory|INFO|cells:5588 monitors:0 2018-04-23T05:24:14.327Z|00023|memory|INFO|peak resident set size grew 76% in last 190.1 seconds, from 3828 kB to 6744 kB 2018-04-23T05:24:14.328Z|00024|memory|INFO|cells:13588 monitors:0 2018-04-23T05:26:54.365Z|00025|memory|INFO|peak resident set size grew 52% in last 160.0 seconds, from 6744 kB to 10220 kB 2018-04-23T05:26:54.365Z|00026|memory|INFO|cells:13588 monitors:0 2018-04-23T05:27:04.370Z|00027|memory|INFO|peak resident set size grew 51% in last 10.0 seconds, from 10220 kB to 15472 kB 2018-04-23T05:27:04.370Z|00028|memory|INFO|cells:13588 monitors:0 2018-04-23T05:28:54.388Z|00029|memory|INFO|peak resident set size grew 60% in last 110.0 seconds, from 15472 kB to 24780 kB 2018-04-23T05:28:54.388Z|00030|memory|INFO|cells:17588 monitors:0 2018-04-23T05:29:44.399Z|00031|memory|INFO|peak resident set size grew 59% in last 50.0 seconds, from 24780 kB to 39480 kB 2018-04-23T05:29:44.400Z|00032|memory|INFO|cells:17588 monitors:0 2018-04-23T05:31:44.470Z|00033|memory|INFO|peak resident set size grew 52% in last 120.1 seconds, from 39480 kB to 6 kB 2018-04-23T05:31:44.470Z|00034|memory|INFO|cells:17588 monitors:0 2018-04-23T05:32:24.474Z|00035|memory|INFO|peak resident set size grew 56% in last 40.0 seconds, from 6 kB to 93528 kB 2018-04-23T05:32:24.475Z|00036|memory|INFO|cells:17588 monitors:0 2018-04-23T05:33:44.482Z|00037|memory|INFO|peak resident set size grew 52% in last 80.0 seconds, from 93528 kB to 142304 kB 2018-04-23T05:33:44.482Z|00038|memory|INFO|cells:17588 monitors:0 2018-04-23T05:37:34.511Z|00039|memory|INFO|peak resident set size grew 52% in last 230.0 seconds, from 142304 kB to 215640 kB 2018-04-23T05:37:34.511Z|00040|memory|INFO|cells:21588 monitors:0 2018-04-23T05:56:30.576Z|2|daemon_unix(monitor)|ERR|1 crashes: pid 16603 died, killed (Segmentation fault), core dumped, restarting #node 2 ovsdb-server-nb logs 2018-04-23T05:56:30.298Z|00040|reconnect|INFO|tcp:192.168.220.103:6643: connection closed by peer 2018-04-23T05:56:31.298Z|00041|reconnect|INFO|tcp:192.168.220.103:6643: connecting... 2018-04-23T05:56:31.299Z|00042|reconnect|INFO|tcp:192.168.220.103:6643: connection attempt failed (Connection refused) 2018-04-23T05:56:31.299Z|00043|reconnect|INFO|tcp:192.168.220.103:6643: waiting 2 seconds before reconnect 2018-04-23T05:56:33.298Z|00044|reconnect|INFO|tcp:192.168.220.103:6643: connecting... 2018-04-23T05:56:33.298Z|00045|reconnect|INFO|tcp:192.168.220.103:6643: connection attempt failed (Connection refused) 2018-04-23T05:56:33.298Z|00046|reconnect|INFO|tcp:192.168.220.103:6643: waiting 4 seconds before reconnect 2018-04-23T05:56:37.298Z|00047|reconnect|INFO|tcp:192.168.220.103:6643: connecting... 2018-04-23T05:56:37.298Z|00048|reconnect|INFO|tcp:192.168.220.103:6643: connection attempt failed (Connection refused)
Re: [ovs-dev] raft ovsdb clustering with scale test
Thanks Numan: Will also try to go little deep and see if we get little more hints by increasing the log levels to debug and will wait for inputs from you and Ben further . Hope we can find the culprit soon :) Regards, On Tue, Apr 17, 2018 at 12:13 AM, Numan Siddiquewrote: > > > On Tue, Apr 17, 2018 at 5:52 AM, aginwala wrote: > >> Hi Ben/Numan: >> >> >> Just following up again. Let me know for any further >> improvements/suggestions on this. >> >> > Hi Aliasgar, > > I need to educate myself first by studing raft and Ben's patches :). I > hope I will do it soon. > > Thanks > Numan > > > > >> Regards, >> >> >> On Mon, Apr 9, 2018 at 10:53 AM, aginwala wrote: >> >> > Further Analysis indicates that cluster is not able to select a new >> leader >> > and hence every node keeps retrying: >> > >> > ovn-northd logs around the same time on all three nodes complains about >> > disconnection from cluster and failing to choose leader >> > >> > >> > #Node1 >> > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. >> This >> > ovn-northd instance is now on standby. >> > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock >> acquired. >> > This ovn-northd instance is now active. >> > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642: >> > clustered database server is not cluster leader; trying another server >> > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642: >> > connected >> > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. >> This >> > ovn-northd instance is now on standby. >> > >> > #Node2 >> > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock >> acquired. >> > This ovn-northd instance is now active. >> > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642: >> > clustered database server is disconnected from cluster; trying another >> > server >> > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642: >> > connected >> > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. >> This >> > ovn-northd instance is now on standby. >> > >> > >> > #Node3 >> > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock >> acquired. >> > This ovn-northd instance is now active. >> > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642: >> > clustered database server is disconnected frr >> > om cluster; trying another server >> > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642: >> > connected >> > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. >> This >> > ovn-northd instance is now on standby. >> > >> > >> > >> > Regards, >> > Aliasgar >> > >> > On Fri, Apr 6, 2018 at 2:16 PM, aginwala wrote: >> > >> >> Hi Ben/Numan: >> >> >> >> So I went ahead to try out clustered db in scale env and results are >> not >> >> as expected. >> >> OVS branch used is master; commit-id(2062840612904ff0873d >> >> 46b2f4f226bc23f2296d) >> >> >> >> Setup is uing 3 nodes. >> >> >> >> Also disabled inactivity_probe, >> >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: >> >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 >> >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: >> >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 >> >> >> >> >> >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses >> >> wait-until, it was able to create around 1k lports and rally exited >> due db >> >> connection error. >> >> Cause of failure: database connection failed is @ >> >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3 >> >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html >> >> >> >> Some data from sb db logs are as below >> >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms >> timeout >> >> at lib/reconnect.c:643 (90% CPU usage) >> >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid >> >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite >> >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request >> >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid >> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid >> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid >> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid >> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> >>
Re: [ovs-dev] raft ovsdb clustering with scale test
On Tue, Apr 17, 2018 at 5:52 AM, aginwalawrote: > Hi Ben/Numan: > > > Just following up again. Let me know for any further > improvements/suggestions on this. > > Hi Aliasgar, I need to educate myself first by studing raft and Ben's patches :). I hope I will do it soon. Thanks Numan > Regards, > > > On Mon, Apr 9, 2018 at 10:53 AM, aginwala wrote: > > > Further Analysis indicates that cluster is not able to select a new > leader > > and hence every node keeps retrying: > > > > ovn-northd logs around the same time on all three nodes complains about > > disconnection from cluster and failing to choose leader > > > > > > #Node1 > > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired. > > This ovn-northd instance is now active. > > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > > clustered database server is not cluster leader; trying another server > > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642: > > connected > > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > > > #Node2 > > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired. > > This ovn-northd instance is now active. > > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642: > > clustered database server is disconnected from cluster; trying another > > server > > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642: > > connected > > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > > > > > #Node3 > > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired. > > This ovn-northd instance is now active. > > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > > clustered database server is disconnected frr > > om cluster; trying another server > > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642: > > connected > > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > > > > > > > Regards, > > Aliasgar > > > > On Fri, Apr 6, 2018 at 2:16 PM, aginwala wrote: > > > >> Hi Ben/Numan: > >> > >> So I went ahead to try out clustered db in scale env and results are not > >> as expected. > >> OVS branch used is master; commit-id(2062840612904ff0873d > >> 46b2f4f226bc23f2296d) > >> > >> Setup is uing 3 nodes. > >> > >> Also disabled inactivity_probe, > >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: > >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 > >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: > >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 > >> > >> > >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses > >> wait-until, it was able to create around 1k lports and rally exited due > db > >> connection error. > >> Cause of failure: database connection failed is @ > >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3 > >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html > >> > >> Some data from sb db logs are as below > >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms > timeout > >> at lib/reconnect.c:643 (90% CPU usage) > >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid > >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite > >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request > >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew > >> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB > >> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1 > >>
Re: [ovs-dev] raft ovsdb clustering with scale test
Hi Ben/Numan: Just following up again. Let me know for any further improvements/suggestions on this. Regards, On Mon, Apr 9, 2018 at 10:53 AM, aginwalawrote: > Further Analysis indicates that cluster is not able to select a new leader > and hence every node keeps retrying: > > ovn-northd logs around the same time on all three nodes complains about > disconnection from cluster and failing to choose leader > > > #Node1 > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. This > ovn-northd instance is now on standby. > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired. > This ovn-northd instance is now active. > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > clustered database server is not cluster leader; trying another server > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642: > connected > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. This > ovn-northd instance is now on standby. > > #Node2 > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired. > This ovn-northd instance is now active. > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642: > clustered database server is disconnected from cluster; trying another > server > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642: > connected > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. This > ovn-northd instance is now on standby. > > > #Node3 > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired. > This ovn-northd instance is now active. > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > clustered database server is disconnected frr > om cluster; trying another server > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642: > connected > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. This > ovn-northd instance is now on standby. > > > > Regards, > Aliasgar > > On Fri, Apr 6, 2018 at 2:16 PM, aginwala wrote: > >> Hi Ben/Numan: >> >> So I went ahead to try out clustered db in scale env and results are not >> as expected. >> OVS branch used is master; commit-id(2062840612904ff0873d >> 46b2f4f226bc23f2296d) >> >> Setup is uing 3 nodes. >> >> Also disabled inactivity_probe, >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 >> >> >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses >> wait-until, it was able to create around 1k lports and rally exited due db >> connection error. >> Cause of failure: database connection failed is @ >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3 >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html >> >> Some data from sb db logs are as below >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout >> at lib/reconnect.c:643 (90% CPU usage) >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request >> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew >> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB >> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1 >> monitors:3 sessions:431 >> 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received >> as leader >> 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms >> poll interval (1044ms user, 232ms system) >> 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major >> 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 >> voluntary, 3 involuntary >>
Re: [ovs-dev] raft ovsdb clustering with scale test
Further Analysis indicates that cluster is not able to select a new leader and hence every node keeps retrying: ovn-northd logs around the same time on all three nodes complains about disconnection from cluster and failing to choose leader #Node1 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642: clustered database server is not cluster leader; trying another server 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642: connected 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. #Node2 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642: clustered database server is disconnected from cluster; trying another server 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642: connected 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. #Node3 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642: clustered database server is disconnected frr om cluster; trying another server 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642: connected 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. Regards, Aliasgar On Fri, Apr 6, 2018 at 2:16 PM, aginwalawrote: > Hi Ben/Numan: > > So I went ahead to try out clustered db in scale env and results are not > as expected. > OVS branch used is master; commit-id(2062840612904ff0873d46b2f4f226 > bc23f2296d) > > Setup is uing 3 nodes. > > Also disabled inactivity_probe, > ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: > 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 > ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: > 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 > > > 1. With wait_up = true for 10k lports and 1k HVs which internally uses > wait-until, it was able to create around 1k lports and rally exited due db > connection error. > Cause of failure: database connection failed is @ > https://raw.githubusercontent.com/noah8713/ovn-scale-test/ > e398ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html > > Some data from sb db logs are as below > 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout > at lib/reconnect.c:643 (90% CPU usage) > 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid > 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite > 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request > 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew > 51% in last 35031.7 seconds, from 24748 kB to 37332 kB > 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1 > monitors:3 sessions:431 > 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received > as leader > 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms poll > interval (1044ms user, 232ms system) > 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major > 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 > voluntary, 3 involuntary > 2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate > over last: 5 seconds, last minute, last hour, hash=1dd2af20: > 2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological > 25.0/sec24.633/sec1.5550/sec total: 19274 > 2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand > 4202.6/sec 4041.417/sec
[ovs-dev] raft ovsdb clustering with scale test
Hi Ben/Numan: So I went ahead to try out clustered db in scale env and results are not as expected. OVS branch used is master; commit-id(2062840612904ff0873d46b2f4f226bc23f2296d) Setup is uing 3 nodes. Also disabled inactivity_probe, ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 1. With wait_up = true for 10k lports and 1k HVs which internally uses wait-until, it was able to create around 1k lports and rally exited due db connection error. Cause of failure: database connection failed is @ https://raw.githubusercontent.com/noah8713/ovn-scale-test/e398ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html Some data from sb db logs are as below 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout at lib/reconnect.c:643 (90% CPU usage) 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew 51% in last 35031.7 seconds, from 24748 kB to 37332 kB 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1 monitors:3 sessions:431 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received as leader 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms poll interval (1044ms user, 232ms system) 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 voluntary, 3 involuntary 2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=1dd2af20: 2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological 25.0/sec24.633/sec1.5550/sec total: 19274 2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand 4202.6/sec 4041.417/sec 578.9233/sec total: 14158936 2018-04-05T06:32:27.959Z|00081|coverage|INFO|lockfile_lock 0.0/sec 0.000/sec0./sec total: 1 2018-04-05T06:32:27.960Z|00082|coverage|INFO|poll_create_node 50281.4/sec 47195.233/sec29424.6886/sec total: 1001238470 2018-04-05T06:32:27.960Z|00083|coverage|INFO|poll_zero_timeout 9.8/sec12.167/sec8.2392/sec total: 312540 2018-04-05T06:32:27.960Z|00084|coverage|INFO|seq_change 122.8/sec 115.967/sec 68.2761/sec total: 2309640 2018-04-05T06:32:27.961Z|00085|coverage|INFO|pstream_open 0.0/sec 0.000/sec0./sec total: 4 2018-04-05T06:32:27.961Z|00086|coverage|INFO|stream_open 0.0/sec 0.000/sec0./sec total: 9 2018-04-05T06:32:27.961Z|00087|coverage|INFO|util_xalloc 229362.2/sec 218086.767/sec46494.5058/sec total: 1344641131 some data from nb db logs: 2018-04-05T23:35:33.458Z|00020|raft|INFO|adding b7bd (b7bd at tcp: 192.168.220.103:6645) to cluster a36d succeeded (completed) 2018-04-05T23:53:06.360Z|00021|jsonrpc|WARN|tcp:192.168.10.10:57298: receive error: Connection reset by peer 2018-04-05T23:53:06.360Z|00022|reconnect|WARN|tcp:192.168.10.10:57298: connection dropped (Connection reset by peer) 2018-04-05T23:53:09.071Z|00023|jsonrpc|WARN|tcp:192.168.10.10:57348: receive error: Connection reset by peer 2018-04-05T23:53:09.071Z|00024|reconnect|WARN|tcp:192.168.10.10:57348: connection dropped (Connection reset by peer) ^2018-04-06T00:02:03.864Z|00076|raft|INFO|current entry eid 0b9eeff2-8377-4f6d-a907-cb4e98396d14 does not match prerequisite 277c09b4-96b3-43e9-b482-a9df29576988 2018-04-06T00:02:06.882Z|00077|poll_loop|INFO|Dropped 6187 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2018-04-06T00:02:06.882Z|00078|poll_loop|INFO|wakeup due to [POLLIN] on fd 31 (192.168.220.102:6645<->192.168.10.10:54782) at lib/stream-fd.c:157 (62% CPU usage)