[jira] [Updated] (KUDU-2962) Fix kudu::itest::FindTabletFollowers() test utility function
[ https://issues.apache.org/jira/browse/KUDU-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bankim Bhavsar updated KUDU-2962: - Status: In Review (was: Open) > Fix kudu::itest::FindTabletFollowers() test utility function > > > Key: KUDU-2962 > URL: https://issues.apache.org/jira/browse/KUDU-2962 > Project: Kudu > Issue Type: Improvement > Components: test >Reporter: Alexey Serbin >Assignee: Bankim Bhavsar >Priority: Minor > Labels: newbie > > The {{kudu::itest::FindTabletFollowers()}} function is unsafe: it uses > {{kudu::itest::FindTabletLeader()}} to generate the result as a complement to > tablet servers hosting the leader replica, but it doesn't sanitize the set of > tablet servers to make sure it contains only tablet servers hosting replicas > of the specified tablet. > For example, if you have a cluster with 10 tablet servers, and a tablet with > 3 tablet replicas, passing the map for all tablet servers in the 10-node > cluster would result in {{FindTabletFollowers()}} reporting 9 followers. > Whoops! > It's necessary to either fix the implementation of this utility function to > sanitize its first argument, or simply get rid of it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2962) Fix kudu::itest::FindTabletFollowers() test utility function
[ https://issues.apache.org/jira/browse/KUDU-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bankim Bhavsar updated KUDU-2962: - Code Review: https://gerrit.cloudera.org/c/14533/ > Fix kudu::itest::FindTabletFollowers() test utility function > > > Key: KUDU-2962 > URL: https://issues.apache.org/jira/browse/KUDU-2962 > Project: Kudu > Issue Type: Improvement > Components: test >Reporter: Alexey Serbin >Assignee: Bankim Bhavsar >Priority: Minor > Labels: newbie > > The {{kudu::itest::FindTabletFollowers()}} function is unsafe: it uses > {{kudu::itest::FindTabletLeader()}} to generate the result as a complement to > tablet servers hosting the leader replica, but it doesn't sanitize the set of > tablet servers to make sure it contains only tablet servers hosting replicas > of the specified tablet. > For example, if you have a cluster with 10 tablet servers, and a tablet with > 3 tablet replicas, passing the map for all tablet servers in the 10-node > cluster would result in {{FindTabletFollowers()}} reporting 9 followers. > Whoops! > It's necessary to either fix the implementation of this utility function to > sanitize its first argument, or simply get rid of it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2982) Document that kudu-backup (without incremental backup) can be used against older versions of Kudu
[ https://issues.apache.org/jira/browse/KUDU-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke resolved KUDU-2982. --- Fix Version/s: 1.11.0 Resolution: Fixed resolved via [https://github.com/apache/kudu/commit/f54ac28bc4f21da19daae6108f4323c5577d534c] > Document that kudu-backup (without incremental backup) can be used against > older versions of Kudu > - > > Key: KUDU-2982 > URL: https://issues.apache.org/jira/browse/KUDU-2982 > Project: Kudu > Issue Type: Bug > Components: documentation >Affects Versions: 1.11.0 >Reporter: Adar Dembo >Assignee: Grant Henke >Priority: Major > Fix For: 1.11.0 > > > Ignoring incremental backup (which requires diff scans introduced in 1.10.0), > kudu-backup is a client-side application that can be safely used against > older versions of Kudu, probably all the way back to 1.0 (if not older). We > should doc that for the sake of users of Kudu who are stuck on older versions > but would still like a simple backup/restore solution. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2452) Prevent follower from causing pre-elections when UpdateConsensus is slow
[ https://issues.apache.org/jira/browse/KUDU-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957339#comment-16957339 ] Todd Lipcon commented on KUDU-2452: --- Another idea here that would be more complicated but have a much bigger positive impact would be to exploit the fact that most heartbeats are simple "lease renewals" with no new tablet-specific information. In other words, the tablet has no operations to replicate, and the safetime advancement is only due to the server-wide clock advancing. In this case, it is somewhat wasteful that we are actually sending such heartbeats once per tablet instead of once per server > Prevent follower from causing pre-elections when UpdateConsensus is slow > > > Key: KUDU-2452 > URL: https://issues.apache.org/jira/browse/KUDU-2452 > Project: Kudu > Issue Type: Improvement >Affects Versions: 1.7.0 >Reporter: William Berkeley >Priority: Major > Labels: stability > > Thanks to pre-elections (KUDU-1365), slow UpdateConsensus calls on a single > follower don't disturb the whole tablet by calling elections. However, > sometimes I see situations where one or more followers are constantly calling > pre-elections, and only rarely, if ever, overflowing their service queues. > Occasionally, in 3x replicated tablets, the followers will get "lucky" and > detect a leader failure at around the same time, and an election will happen. > This background instability has caused bugs like KUDU-2343 that should be > rare to occur pretty frequently, plus the extra RequestConsensusVote RPCs add > a little more stress on the consensus service and on replicas' consensus > locks. It also spams the logs, since there's no generally no exponential > backoff for these pre-elections because there's a successful heartbeat in > between them. > It seems like we can get into the situation where the average number of > in-flight consensus requests is constant over time, so on average we are > processing each heartbeat in less than the heartbeat interval, however some > heartbeats take longer. Since UpdateConsensus calls to a replica are > serialized, a few of these in a row trigger the failure detector, despite the > follower receiving every heartbeat in a timely manner and responding > successfully eventually (and on average in a timely manner). > It'd be nice to prevent these worthless pre-elections. A couple of ideas: > 1. Separately calculate a backoff for failed pre-elections, and reset it when > a pre-election succeeds or more generally when there's an election. > 2. Don't count the time the follower is executing UpdateConsensus against the > failure detector. [~mpercy] suggested stopping the failure detector during > UpdateReplica() and resuming it when the function returns. > 3. Move leader failure detection out-of-band of UpdateConsensus entirely. -- This message was sent by Atlassian Jira (v8.3.4#803005)