On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou <y...@nvidia.com> wrote:

> Hi,
>
> Need expert's view to address a problem we are seeing now and then:  A
> ovsdb-server node in a 3-nodes raft cluster keeps printing out the
> "raft_is_connected: false" message, and its "connected" state in its
> _Server DB stays as false.
>
> According to the ovsdb-server(5) manpage, it means this server is not
> contacting with a majority of its cluster.
>
> Except its "connected" state, from what we can see, this server is in the
> follower state and works fine, and connection between it and the other two
> servers appear healthy as well.
>
> Below is its raft structure snapshot at the time of the problem. Note that
> its candidate_retrying field stays as true.
>
> Hopefully the provide information can help to figure out what goes wrong
> here. Unfortunately we don't have a solid case to reproduce it:
>

Thanks for reporting the issue. This looks really strange. In the below
state, leader_sid is non-zero, but candidate_retrying is true.
According to the latest code, whenever leader_sid is set to non-zero (in
raft_set_leader()), candidate_retrying will be set to false; whenever
candidate_retrying is set to true (in raft_start_election()), leader_sid
will be set to UUID_ZERO. And the data struct is initialized with xzalloc,
making sure candidate_retrying is false in the beginning. So, sorry that I
can't explain how it ends up with this conflict situation. It would be
helpful if there is a way to reproduce. How often does it happen?

Thanks,
Han


> (gdb) print *(struct raft *)0xa872c0
> $19 = {
>   hmap_node = {
>     hash = 2911123117,
>     next = 0x0
>   },
>   log = 0xa83690,
>   cid = {
>     parts = {2699238234, 2258650653, 3035282424, 813064186}
>   },
>   sid = {
>     parts = {1071328836, 400573240, 2626104521, 1746414343}
>   },
>   local_address = 0xa874e0 "tcp:10.8.51.55:6643",
>   local_nickname = 0xa876d0 "3fdb",
>   name = 0xa876b0 "OVN_Northbound",
>   servers = {
>     buckets = 0xad4bc0,
>     one = 0x0,
>     mask = 3,
>     n = 3
>   },
>   election_timer = 1000,
>   election_timer_new = 0,
>   term = 3,
>   vote = {
>     parts = {1071328836, 400573240, 2626104521, 1746414343}
>   },
>   synced_term = 3,
>   synced_vote = {
>     parts = {1071328836, 400573240, 2626104521, 1746414343}
>   },
>   entries = 0xbf0fe0,
>   log_start = 2,
>   log_end = 312,
>   log_synced = 311,
>   allocated_log = 512,
>   snap = {
>     term = 1,
>     data = 0xaafb10,
>     eid = {
>       parts = {1838862864, 1569866528, 2969429118, 3021055395}
>     },
>     servers = 0xaafa70,
>     election_timer = 1000
>   },
>   role = RAFT_FOLLOWER,
>   commit_index = 311,
>   last_applied = 311,
>   leader_sid = {
>     parts = {642765114, 43797788, 2533161504, 3088745929}
>   },
>   election_base = 6043283367,
>   election_timeout = 6043284593,
>   joining = false,
>   remote_addresses = {
>     map = {
>       buckets = 0xa87410,
>       one = 0xa879c0,
>       mask = 0,
>       n = 1
>     }
>   },
>   join_timeout = 6037634820,
>   leaving = false,
>   left = false,
>   leave_timeout = 0,
>   failed = false,
>   waiters = {
>     prev = 0xa87448,
>     next = 0xa87448
>   },
>   listener = 0xaafad0,
>   listen_backoff = -9223372036854775808,
>   conns = {
>     prev = 0xbcd660,
>     next = 0xaafc20
>   },
>   add_servers = {
>     buckets = 0xa87480,
>     one = 0x0,
>     mask = 0,
>     n = 0
>   },
>   remove_server = 0x0,
>   commands = {
>     buckets = 0xa874a8,
>     one = 0x0,
>     mask = 0,
>     n = 0
>   },
>   ping_timeout = 6043283700,
>   n_votes = 1,
>   candidate_retrying = true,
>   had_leader = false,
>   ever_had_leader = true
> }
>
> Thanks
> - Yun
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com
> .
>
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to