[jira] [Updated] (KUDU-2962) Fix kudu::itest::FindTabletFollowers() test utility function

2019-10-22 Thread Bankim Bhavsar (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bankim Bhavsar updated KUDU-2962:
-
Status: In Review  (was: Open)

> Fix kudu::itest::FindTabletFollowers() test utility function
> 
>
> Key: KUDU-2962
> URL: https://issues.apache.org/jira/browse/KUDU-2962
> Project: Kudu
>  Issue Type: Improvement
>  Components: test
>Reporter: Alexey Serbin
>Assignee: Bankim Bhavsar
>Priority: Minor
>  Labels: newbie
>
> The {{kudu::itest::FindTabletFollowers()}} function is unsafe: it uses 
> {{kudu::itest::FindTabletLeader()}} to generate the result as a complement to 
> tablet servers hosting the leader replica, but it doesn't sanitize the set of 
> tablet servers to make sure it contains only tablet servers hosting replicas 
> of the specified tablet.
> For example, if you have a cluster with 10 tablet servers, and a tablet with 
> 3 tablet replicas, passing the map for all tablet servers in the 10-node 
> cluster would result in {{FindTabletFollowers()}} reporting 9 followers.  
> Whoops!
> It's necessary to either fix the implementation of this utility function to 
> sanitize its first argument, or simply get rid of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2962) Fix kudu::itest::FindTabletFollowers() test utility function

2019-10-22 Thread Bankim Bhavsar (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bankim Bhavsar updated KUDU-2962:
-
Code Review: https://gerrit.cloudera.org/c/14533/

> Fix kudu::itest::FindTabletFollowers() test utility function
> 
>
> Key: KUDU-2962
> URL: https://issues.apache.org/jira/browse/KUDU-2962
> Project: Kudu
>  Issue Type: Improvement
>  Components: test
>Reporter: Alexey Serbin
>Assignee: Bankim Bhavsar
>Priority: Minor
>  Labels: newbie
>
> The {{kudu::itest::FindTabletFollowers()}} function is unsafe: it uses 
> {{kudu::itest::FindTabletLeader()}} to generate the result as a complement to 
> tablet servers hosting the leader replica, but it doesn't sanitize the set of 
> tablet servers to make sure it contains only tablet servers hosting replicas 
> of the specified tablet.
> For example, if you have a cluster with 10 tablet servers, and a tablet with 
> 3 tablet replicas, passing the map for all tablet servers in the 10-node 
> cluster would result in {{FindTabletFollowers()}} reporting 9 followers.  
> Whoops!
> It's necessary to either fix the implementation of this utility function to 
> sanitize its first argument, or simply get rid of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2982) Document that kudu-backup (without incremental backup) can be used against older versions of Kudu

2019-10-22 Thread Grant Henke (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2982.
---
Fix Version/s: 1.11.0
   Resolution: Fixed

resolved via 
[https://github.com/apache/kudu/commit/f54ac28bc4f21da19daae6108f4323c5577d534c]

> Document that kudu-backup (without incremental backup) can be used against 
> older versions of Kudu
> -
>
> Key: KUDU-2982
> URL: https://issues.apache.org/jira/browse/KUDU-2982
> Project: Kudu
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.11.0
>Reporter: Adar Dembo
>Assignee: Grant Henke
>Priority: Major
> Fix For: 1.11.0
>
>
> Ignoring incremental backup (which requires diff scans introduced in 1.10.0), 
> kudu-backup is a client-side application that can be safely used against 
> older versions of Kudu, probably all the way back to 1.0 (if not older). We 
> should doc that for the sake of users of Kudu who are stuck on older versions 
> but would still like a simple backup/restore solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2452) Prevent follower from causing pre-elections when UpdateConsensus is slow

2019-10-22 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957339#comment-16957339
 ] 

Todd Lipcon commented on KUDU-2452:
---

Another idea here that would be more complicated but have a much bigger 
positive impact would be to exploit the fact that most heartbeats are simple 
"lease renewals" with no new tablet-specific information. In other words, the 
tablet has no operations to replicate, and the safetime advancement is only due 
to the server-wide clock advancing. In this case, it is somewhat wasteful that 
we are actually sending such heartbeats once per tablet instead of once per 
server

> Prevent follower from causing pre-elections when UpdateConsensus is slow
> 
>
> Key: KUDU-2452
> URL: https://issues.apache.org/jira/browse/KUDU-2452
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.7.0
>Reporter: William Berkeley
>Priority: Major
>  Labels: stability
>
> Thanks to pre-elections (KUDU-1365), slow UpdateConsensus calls on a single 
> follower don't disturb the whole tablet by calling elections. However, 
> sometimes I see situations where one or more followers are constantly calling 
> pre-elections, and only rarely, if ever, overflowing their service queues. 
> Occasionally, in 3x replicated tablets, the followers will get "lucky" and 
> detect a leader failure at around the same time, and an election will happen.
> This background instability has caused bugs like KUDU-2343 that should be 
> rare to occur pretty frequently, plus the extra RequestConsensusVote RPCs add 
> a little more stress on the consensus service and on replicas' consensus 
> locks. It also spams the logs, since there's no generally no exponential 
> backoff for these pre-elections because there's a successful heartbeat in 
> between them.
> It seems like we can get into the situation where the average number of 
> in-flight consensus requests is constant over time, so on average we are 
> processing each heartbeat in less than the heartbeat interval, however some 
> heartbeats take longer. Since UpdateConsensus calls to a replica are 
> serialized, a few of these in a row trigger the failure detector, despite the 
> follower receiving every heartbeat in a timely manner and responding 
> successfully eventually (and on average in a timely manner).
> It'd be nice to prevent these worthless pre-elections. A couple of ideas:
> 1. Separately calculate a backoff for failed pre-elections, and reset it when 
> a pre-election succeeds or more generally when there's an election.
> 2. Don't count the time the follower is executing UpdateConsensus against the 
> failure detector. [~mpercy] suggested stopping the failure detector during 
> UpdateReplica() and resuming it when the function returns.
> 3. Move leader failure detection out-of-band of UpdateConsensus entirely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)