[ https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon reassigned KUDU-1365: --------------------------------- Assignee: Todd Lipcon (was: Mike Percy) > Leader flapping when one machine has a very slow disk > ----------------------------------------------------- > > Key: KUDU-1365 > URL: https://issues.apache.org/jira/browse/KUDU-1365 > Project: Kudu > Issue Type: Bug > Components: consensus > Affects Versions: 0.7.0 > Reporter: Mike Percy > Assignee: Todd Lipcon > > There is an issue that [~decster] ran into in production where a machine had > a bad disk. Unfortunately, the disk was not failing to write, it was just > very slow. This resulted in the UpdateConsensus RPC queue filling up on that > machine when it was a follower, which looked like this in the log (from the > leader's perspective): > {code} > W0229 00:07:14.332468 18148 consensus_peers.cc:316] T > 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer > 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer > 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032 > Status: Remote error: Service unavailable: UpdateConsensus request on > kudu.consensus.ConsensusService dropped due to backpressure. The service > queue is full; it has 50 items.. Retrying in the next heartbeat period. > Already tried 25 times. > {code} > The result is that the follower could not receive heartbeat messages from the > leader anymore. The follower (st128) would decide that the leader was dead > and start an election. Because it had the same amount of data as the rest of > the cluster, it won the election. Then, for reasons we still need to > investigate, it would not heartbeat to its own followers. After some timeout, > a different node (typically the previous leader) would start an election, get > elected, and the flapping process would continue. > It's possible that the bad node, when leader, was only partially transitioned > to leadership, and was blocking on some disk operation before starting to > heartbeat. Hopefully we can get logs from the bad node so we can better > understand what was happening from its perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332)