Re: [ovs-dev] raft ovsdb clustering with scale test

2018-04-23 Thread aginwala
Hi :

I was able to go little further from log perspective with debug level and
see proper error:
However there were couple of changes that I did for current setup.
For controllers setup, since I am using 3 vms, to keep the specs in line
with current ovn scale env:

   1. Increased the virtual memory of my vms to 40Gb as I saw out of memory
   erros with 4g memory.
   2. Also set the mtu to 9k instead of 1500 as scale env BM used 1500
   3. Also changed net.core.somaxconn=2048 as ubuntu image just using 128
   as the connection rate is high.



After 1.5k lports are created, another request for 250 lports took ~ 20+
minutes to go through as per logs which is huge.

With above settings in place, active northd even though was processing
little slow ,did not show any errors which stretched to around 2k lports
instead of 1k lports in the previous runs:

northd logs when wait-until command is getting run with debug mode for
reference is @
https://github.com/noah8713/ovn-scale-test/blob/scale_results/results/active_northd_nod3.txt


Finally, with wait-until; nb db process exited with segmentation fault on
all 3 nodes.

e.g. ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
192.168.220.103:6641" show
ovn-nbctl: tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
192.168.220.103:6641: database connection failed (Connection refused)

#node3 ovsdb-server-nb logs

2018-04-23T05:11:34.237Z|00015|raft|INFO|tcp:192.168.10.10:43412: learned
server ID 9e1c
2018-04-23T05:11:34.237Z|00016|raft|INFO|tcp:192.168.10.10:43412: learned
remote address tcp:192.168.220.101:6643
2018-04-23T05:11:34.240Z|00017|raft|INFO|tcp:192.168.10.10:43006: learned
server ID f038
2018-04-23T05:11:34.240Z|00018|raft|INFO|tcp:192.168.10.10:43006: learned
remote address tcp:192.168.220.102:6643
2018-04-23T05:11:44.216Z|00019|memory|INFO|2536 kB peak resident set size
after 10.0 seconds
2018-04-23T05:11:44.216Z|00020|memory|INFO|cells:28 monitors:0
2018-04-23T05:21:04.276Z|00021|memory|INFO|peak resident set size grew 51%
in last 560.1 seconds, from 2536 kB to 3828 kB
2018-04-23T05:21:04.276Z|00022|memory|INFO|cells:5588 monitors:0
2018-04-23T05:24:14.327Z|00023|memory|INFO|peak resident set size grew 76%
in last 190.1 seconds, from 3828 kB to 6744 kB
2018-04-23T05:24:14.328Z|00024|memory|INFO|cells:13588 monitors:0
2018-04-23T05:26:54.365Z|00025|memory|INFO|peak resident set size grew 52%
in last 160.0 seconds, from 6744 kB to 10220 kB
2018-04-23T05:26:54.365Z|00026|memory|INFO|cells:13588 monitors:0
2018-04-23T05:27:04.370Z|00027|memory|INFO|peak resident set size grew 51%
in last 10.0 seconds, from 10220 kB to 15472 kB
2018-04-23T05:27:04.370Z|00028|memory|INFO|cells:13588 monitors:0
2018-04-23T05:28:54.388Z|00029|memory|INFO|peak resident set size grew 60%
in last 110.0 seconds, from 15472 kB to 24780 kB
2018-04-23T05:28:54.388Z|00030|memory|INFO|cells:17588 monitors:0
2018-04-23T05:29:44.399Z|00031|memory|INFO|peak resident set size grew 59%
in last 50.0 seconds, from 24780 kB to 39480 kB
2018-04-23T05:29:44.400Z|00032|memory|INFO|cells:17588 monitors:0
2018-04-23T05:31:44.470Z|00033|memory|INFO|peak resident set size grew 52%
in last 120.1 seconds, from 39480 kB to 6 kB
2018-04-23T05:31:44.470Z|00034|memory|INFO|cells:17588 monitors:0
2018-04-23T05:32:24.474Z|00035|memory|INFO|peak resident set size grew 56%
in last 40.0 seconds, from 6 kB to 93528 kB
2018-04-23T05:32:24.475Z|00036|memory|INFO|cells:17588 monitors:0
2018-04-23T05:33:44.482Z|00037|memory|INFO|peak resident set size grew 52%
in last 80.0 seconds, from 93528 kB to 142304 kB
2018-04-23T05:33:44.482Z|00038|memory|INFO|cells:17588 monitors:0
2018-04-23T05:37:34.511Z|00039|memory|INFO|peak resident set size grew 52%
in last 230.0 seconds, from 142304 kB to 215640 kB
2018-04-23T05:37:34.511Z|00040|memory|INFO|cells:21588 monitors:0
2018-04-23T05:56:30.576Z|2|daemon_unix(monitor)|ERR|1 crashes: pid
16603 died, killed (Segmentation fault), core dumped, restarting



#node 2 ovsdb-server-nb logs
2018-04-23T05:56:30.298Z|00040|reconnect|INFO|tcp:192.168.220.103:6643:
connection closed by peer
2018-04-23T05:56:31.298Z|00041|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:31.299Z|00042|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:31.299Z|00043|reconnect|INFO|tcp:192.168.220.103:6643:
waiting 2 seconds before reconnect
2018-04-23T05:56:33.298Z|00044|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:33.298Z|00045|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:33.298Z|00046|reconnect|INFO|tcp:192.168.220.103:6643:
waiting 4 seconds before reconnect
2018-04-23T05:56:37.298Z|00047|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:37.298Z|00048|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)

Re: [ovs-dev] raft ovsdb clustering with scale test

2018-04-17 Thread aginwala
Thanks Numan:


Will also try to go little deep and see if we get little more hints by
increasing the log levels to debug and will wait for inputs from you and
Ben further . Hope we can find the culprit soon :)


Regards,


On Tue, Apr 17, 2018 at 12:13 AM, Numan Siddique 
wrote:

>
>
> On Tue, Apr 17, 2018 at 5:52 AM, aginwala  wrote:
>
>> Hi Ben/Numan:
>>
>>
>> Just following up again. Let me know for any further
>> improvements/suggestions on this.
>>
>>
> Hi Aliasgar,
>
> I  need to educate myself first by studing raft and Ben's patches :). I
> hope I will do it soon.
>
> Thanks
> Numan
>
>
>
>
>> Regards,
>>
>>
>> On Mon, Apr 9, 2018 at 10:53 AM, aginwala  wrote:
>>
>> > Further Analysis indicates that cluster is not able to select a new
>> leader
>> > and hence every node keeps retrying:
>> >
>> > ovn-northd logs around the same time on all three nodes complains about
>> > disconnection from cluster and failing to choose leader
>> >
>> >
>> > #Node1
>> > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost.
>> This
>> > ovn-northd instance is now on standby.
>> > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock
>> acquired.
>> > This ovn-northd instance is now active.
>> > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
>> > clustered database server is not cluster leader; trying another server
>> > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642:
>> > connected
>> > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost.
>> This
>> > ovn-northd instance is now on standby.
>> >
>> > #Node2
>> > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock
>> acquired.
>> > This ovn-northd instance is now active.
>> > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642:
>> > clustered database server is disconnected from cluster; trying another
>> > server
>> > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642:
>> > connected
>> > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost.
>> This
>> > ovn-northd instance is now on standby.
>> >
>> >
>> > #Node3
>> > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock
>> acquired.
>> > This ovn-northd instance is now active.
>> > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
>> > clustered database server is disconnected frr
>> > om cluster; trying another server
>> > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642:
>> > connected
>> > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost.
>> This
>> > ovn-northd instance is now on standby.
>> >
>> >
>> >
>> > Regards,
>> > Aliasgar
>> >
>> > On Fri, Apr 6, 2018 at 2:16 PM, aginwala  wrote:
>> >
>> >> Hi Ben/Numan:
>> >>
>> >> So I went ahead to try out clustered db in scale env and results are
>> not
>> >> as expected.
>> >> OVS branch used is master; commit-id(2062840612904ff0873d
>> >> 46b2f4f226bc23f2296d)
>> >>
>> >> Setup is uing 3 nodes.
>> >>
>> >> Also disabled inactivity_probe,
>> >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
>> >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
>> >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp:
>> >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0
>> >>
>> >>
>> >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses
>> >> wait-until, it was able to create around 1k lports and rally exited
>> due db
>> >> connection error.
>> >> Cause of failure: database connection failed is @
>> >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3
>> >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html
>> >>
>> >> Some data from sb db logs are as below
>> >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms
>> timeout
>> >> at lib/reconnect.c:643 (90% CPU usage)
>> >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
>> >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
>> >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
>> >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> >> 

Re: [ovs-dev] raft ovsdb clustering with scale test

2018-04-17 Thread Numan Siddique
On Tue, Apr 17, 2018 at 5:52 AM, aginwala  wrote:

> Hi Ben/Numan:
>
>
> Just following up again. Let me know for any further
> improvements/suggestions on this.
>
>
Hi Aliasgar,

I  need to educate myself first by studing raft and Ben's patches :). I
hope I will do it soon.

Thanks
Numan




> Regards,
>
>
> On Mon, Apr 9, 2018 at 10:53 AM, aginwala  wrote:
>
> > Further Analysis indicates that cluster is not able to select a new
> leader
> > and hence every node keeps retrying:
> >
> > ovn-northd logs around the same time on all three nodes complains about
> > disconnection from cluster and failing to choose leader
> >
> >
> > #Node1
> > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost.
> This
> > ovn-northd instance is now on standby.
> > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired.
> > This ovn-northd instance is now active.
> > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
> > clustered database server is not cluster leader; trying another server
> > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642:
> > connected
> > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost.
> This
> > ovn-northd instance is now on standby.
> >
> > #Node2
> > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired.
> > This ovn-northd instance is now active.
> > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642:
> > clustered database server is disconnected from cluster; trying another
> > server
> > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642:
> > connected
> > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost.
> This
> > ovn-northd instance is now on standby.
> >
> >
> > #Node3
> > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired.
> > This ovn-northd instance is now active.
> > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
> > clustered database server is disconnected frr
> > om cluster; trying another server
> > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642:
> > connected
> > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost.
> This
> > ovn-northd instance is now on standby.
> >
> >
> >
> > Regards,
> > Aliasgar
> >
> > On Fri, Apr 6, 2018 at 2:16 PM, aginwala  wrote:
> >
> >> Hi Ben/Numan:
> >>
> >> So I went ahead to try out clustered db in scale env and results are not
> >> as expected.
> >> OVS branch used is master; commit-id(2062840612904ff0873d
> >> 46b2f4f226bc23f2296d)
> >>
> >> Setup is uing 3 nodes.
> >>
> >> Also disabled inactivity_probe,
> >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
> >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
> >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp:
> >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0
> >>
> >>
> >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses
> >> wait-until, it was able to create around 1k lports and rally exited due
> db
> >> connection error.
> >> Cause of failure: database connection failed is @
> >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3
> >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html
> >>
> >> Some data from sb db logs are as below
> >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms
> timeout
> >> at lib/reconnect.c:643 (90% CPU usage)
> >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
> >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
> >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
> >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> >> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid
> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> >> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew
> >> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB
> >> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1
> >> 

Re: [ovs-dev] raft ovsdb clustering with scale test

2018-04-16 Thread aginwala
Hi Ben/Numan:


Just following up again. Let me know for any further
improvements/suggestions on this.


Regards,


On Mon, Apr 9, 2018 at 10:53 AM, aginwala  wrote:

> Further Analysis indicates that cluster is not able to select a new leader
> and hence every node keeps retrying:
>
> ovn-northd logs around the same time on all three nodes complains about
> disconnection from cluster and failing to choose leader
>
>
> #Node1
> 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. This
> ovn-northd instance is now on standby.
> 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
> clustered database server is not cluster leader; trying another server
> 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642:
> connected
> 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. This
> ovn-northd instance is now on standby.
>
> #Node2
> 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642:
> clustered database server is disconnected from cluster; trying another
> server
> 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642:
> connected
> 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. This
> ovn-northd instance is now on standby.
>
>
> #Node3
> 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
> clustered database server is disconnected frr
> om cluster; trying another server
> 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642:
> connected
> 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. This
> ovn-northd instance is now on standby.
>
>
>
> Regards,
> Aliasgar
>
> On Fri, Apr 6, 2018 at 2:16 PM, aginwala  wrote:
>
>> Hi Ben/Numan:
>>
>> So I went ahead to try out clustered db in scale env and results are not
>> as expected.
>> OVS branch used is master; commit-id(2062840612904ff0873d
>> 46b2f4f226bc23f2296d)
>>
>> Setup is uing 3 nodes.
>>
>> Also disabled inactivity_probe,
>> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
>> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
>> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp:
>> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0
>>
>>
>> 1. With wait_up = true for 10k lports and 1k HVs which internally uses
>> wait-until, it was able to create around 1k lports and rally exited due db
>> connection error.
>> Cause of failure: database connection failed is @
>> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3
>> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html
>>
>> Some data from sb db logs are as below
>> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout
>> at lib/reconnect.c:643 (90% CPU usage)
>> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
>> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
>> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
>> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
>> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
>> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
>> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
>> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid
>> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew
>> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB
>> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1
>> monitors:3 sessions:431
>> 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received
>> as leader
>> 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms
>> poll interval (1044ms user, 232ms system)
>> 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major
>> 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0
>> voluntary, 3 involuntary
>> 

Re: [ovs-dev] raft ovsdb clustering with scale test

2018-04-09 Thread aginwala
Further Analysis indicates that cluster is not able to select a new leader
and hence every node keeps retrying:

ovn-northd logs around the same time on all three nodes complains about
disconnection from cluster and failing to choose leader


#Node1
2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. This
ovn-northd instance is now on standby.
2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired.
This ovn-northd instance is now active.
2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
clustered database server is not cluster leader; trying another server
2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642:
connected
2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. This
ovn-northd instance is now on standby.

#Node2
2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired.
This ovn-northd instance is now active.
2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642:
clustered database server is disconnected from cluster; trying another
server
2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642:
connected
2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. This
ovn-northd instance is now on standby.


#Node3
2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired.
This ovn-northd instance is now active.
2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642:
clustered database server is disconnected frr
om cluster; trying another server
2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642:
connected
2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. This
ovn-northd instance is now on standby.



Regards,
Aliasgar

On Fri, Apr 6, 2018 at 2:16 PM, aginwala  wrote:

> Hi Ben/Numan:
>
> So I went ahead to try out clustered db in scale env and results are not
> as expected.
> OVS branch used is master; commit-id(2062840612904ff0873d46b2f4f226
> bc23f2296d)
>
> Setup is uing 3 nodes.
>
> Also disabled inactivity_probe,
> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp:
> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0
>
>
> 1. With wait_up = true for 10k lports and 1k HVs which internally uses
> wait-until, it was able to create around 1k lports and rally exited due db
> connection error.
> Cause of failure: database connection failed is @
> https://raw.githubusercontent.com/noah8713/ovn-scale-test/
> e398ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html
>
> Some data from sb db logs are as below
> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout
> at lib/reconnect.c:643 (90% CPU usage)
> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid
> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew
> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB
> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1
> monitors:3 sessions:431
> 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received
> as leader
> 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms poll
> interval (1044ms user, 232ms system)
> 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major
> 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0
> voluntary, 3 involuntary
> 2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate
> over last: 5 seconds, last minute, last hour,  hash=1dd2af20:
> 2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological
>  25.0/sec24.633/sec1.5550/sec   total: 19274
> 2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand
> 4202.6/sec  4041.417/sec 

[ovs-dev] raft ovsdb clustering with scale test

2018-04-06 Thread aginwala
Hi Ben/Numan:

So I went ahead to try out clustered db in scale env and results are not as
expected.
OVS branch used is master;
commit-id(2062840612904ff0873d46b2f4f226bc23f2296d)

Setup is uing 3 nodes.

Also disabled inactivity_probe,
ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp:
192.168.220.103:6642" set SB . external_ids:inactivity_probe=0


1. With wait_up = true for 10k lports and 1k HVs which internally uses
wait-until, it was able to create around 1k lports and rally exited due db
connection error.
Cause of failure: database connection failed is @
https://raw.githubusercontent.com/noah8713/ovn-scale-test/e398ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html

Some data from sb db logs are as below
2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout at
lib/reconnect.c:643 (90% CPU usage)
2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid
f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew 51%
in last 35031.7 seconds, from 24748 kB to 37332 kB
2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1
monitors:3 sessions:431
2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received as
leader
2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms poll
interval (1044ms user, 232ms system)
2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major
2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 voluntary,
3 involuntary
2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate over
last: 5 seconds, last minute, last hour,  hash=1dd2af20:
2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological
 25.0/sec24.633/sec1.5550/sec   total: 19274
2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand
4202.6/sec  4041.417/sec  578.9233/sec   total: 14158936
2018-04-05T06:32:27.959Z|00081|coverage|INFO|lockfile_lock
0.0/sec 0.000/sec0./sec   total: 1
2018-04-05T06:32:27.960Z|00082|coverage|INFO|poll_create_node
 50281.4/sec 47195.233/sec29424.6886/sec   total: 1001238470
2018-04-05T06:32:27.960Z|00083|coverage|INFO|poll_zero_timeout
9.8/sec12.167/sec8.2392/sec   total: 312540
2018-04-05T06:32:27.960Z|00084|coverage|INFO|seq_change
 122.8/sec   115.967/sec   68.2761/sec   total: 2309640
2018-04-05T06:32:27.961Z|00085|coverage|INFO|pstream_open
 0.0/sec 0.000/sec0./sec   total: 4
2018-04-05T06:32:27.961Z|00086|coverage|INFO|stream_open
0.0/sec 0.000/sec0./sec   total: 9
2018-04-05T06:32:27.961Z|00087|coverage|INFO|util_xalloc
229362.2/sec 218086.767/sec46494.5058/sec   total: 1344641131



some data from nb db logs:
2018-04-05T23:35:33.458Z|00020|raft|INFO|adding b7bd (b7bd at tcp:
192.168.220.103:6645) to cluster a36d succeeded (completed)
2018-04-05T23:53:06.360Z|00021|jsonrpc|WARN|tcp:192.168.10.10:57298:
receive error: Connection reset by peer
2018-04-05T23:53:06.360Z|00022|reconnect|WARN|tcp:192.168.10.10:57298:
connection dropped (Connection reset by peer)
2018-04-05T23:53:09.071Z|00023|jsonrpc|WARN|tcp:192.168.10.10:57348:
receive error: Connection reset by peer
2018-04-05T23:53:09.071Z|00024|reconnect|WARN|tcp:192.168.10.10:57348:
connection dropped (Connection reset by peer)
^2018-04-06T00:02:03.864Z|00076|raft|INFO|current entry eid
0b9eeff2-8377-4f6d-a907-cb4e98396d14 does not match prerequisite
277c09b4-96b3-43e9-b482-a9df29576988
2018-04-06T00:02:06.882Z|00077|poll_loop|INFO|Dropped 6187 log messages in
last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2018-04-06T00:02:06.882Z|00078|poll_loop|INFO|wakeup due to [POLLIN] on fd
31 (192.168.220.102:6645<->192.168.10.10:54782) at lib/stream-fd.c:157 (62%
CPU usage)