[jira] [Updated] (KUDU-1788) Raft UpdateConsensus retry behavior on timeout is counter-productive

Jean-Daniel Cryans (JIRA) Fri, 25 Aug 2017 14:27:26 -0700

     [ 
https://issues.apache.org/jira/browse/KUDU-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans updated KUDU-1788:
-------------------------------------
    Target Version/s: 1.6.0  (was: 1.5.0)

> Raft UpdateConsensus retry behavior on timeout is counter-productive
> --------------------------------------------------------------------
>
>                 Key: KUDU-1788
>                 URL: https://issues.apache.org/jira/browse/KUDU-1788
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.1.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> In a stress test, I've seen the following counter-productive behavior:
> - a leader is trying to send operations to a replica (eg a 10MB batch)
> - the network is constrained due to other activity, so sending 10MB may take 
> >1sec
> - the request times out on the client side, likely while it was still in the 
> process of sending the batch
> - when the server receives it, it is likely to have timed out while waiting 
> in the queue. Or ,it will receive it and upon processing will all be 
> duplicate ops from the previous attempt
> - the client has no idea whether the server received it or not, and thus 
> keeps retrying the same batch (triggering the same timeout)
> This tends to be a "sticky"/cascading sort of state: after one such timeout, 
> the follower will be lagging behind more, and the next batch will be larger 
> (up to the configured max batch size). The client neither backs off nor 
> increases its timeout, so it will basically just keep the network pipe full 
> of useless redundant updates



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (KUDU-1788) Raft UpdateConsensus retry behavior on timeout is counter-productive

Reply via email to