[ https://issues.apache.org/jira/browse/KUDU-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans updated KUDU-1788: ------------------------------------- Target Version/s: 1.6.0 (was: 1.5.0) > Raft UpdateConsensus retry behavior on timeout is counter-productive > -------------------------------------------------------------------- > > Key: KUDU-1788 > URL: https://issues.apache.org/jira/browse/KUDU-1788 > Project: Kudu > Issue Type: Bug > Components: consensus > Affects Versions: 1.1.0 > Reporter: Todd Lipcon > Priority: Critical > > In a stress test, I've seen the following counter-productive behavior: > - a leader is trying to send operations to a replica (eg a 10MB batch) > - the network is constrained due to other activity, so sending 10MB may take > >1sec > - the request times out on the client side, likely while it was still in the > process of sending the batch > - when the server receives it, it is likely to have timed out while waiting > in the queue. Or ,it will receive it and upon processing will all be > duplicate ops from the previous attempt > - the client has no idea whether the server received it or not, and thus > keeps retrying the same batch (triggering the same timeout) > This tends to be a "sticky"/cascading sort of state: after one such timeout, > the follower will be lagging behind more, and the next batch will be larger > (up to the configured max batch size). The client neither backs off nor > increases its timeout, so it will basically just keep the network pipe full > of useless redundant updates -- This message was sent by Atlassian JIRA (v6.4.14#64029)