[ https://issues.apache.org/jira/browse/ZOOKEEPER-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andor Molnar reassigned ZOOKEEPER-3701: --------------------------------------- Assignee: Andor Molnar > Split brain on log disk full > ---------------------------- > > Key: ZOOKEEPER-3701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3701 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.4.13, 3.5.6 > Reporter: Ivan Kelly > Assignee: Andor Molnar > Priority: Blocker > > We ran into a situation where the cluster ended up with split brain when the > log disk filled up on a node. > The ZK cluster(3 node) in question was being used as the metadata store for > Pulsar. There was an outage in the Pulsar cluster, where two the ZK nodes > filled up there log disks, causing the cluster to lose quorum. Once we > rectified the full disk situation and restarted the nodes everything seemed > to work, but we started getting a lot of log messages about > UpdateMetadataLoop retrying. UpdateMetadataLoop is used to update bookkeeper > ledger metadata. If it sees a write conflict it rereads the znode, checks > whether the update needs to happen, applies it and writes. These retries were > flooding the log on a subset of the brokers. It turned out that it was > reading a znode with version 0, but when it tried the setData with version > set to 0 it was failing because the znode had a version of 2 (there were many > instances of this). After investigating this, we saw that the znode had a > different stat and value on ZK-1 to that on ZK-0 & ZK-2. > We resolved the situation by deleting the log and snapshots from ZK-1 and > restarting, at which point everything went back to normal. Had ZK-1 managed > to become leader we would have been in a lot of trouble, but thankfully this > didn't happen. > For the sequence of events that led to split brain, I'll refer to the > following code. > {code} > public class FileTxnSnapLog { > ... > public boolean truncateLog(long zxid) throws IOException { > // close the existing txnLog and snapLog > close(); > // truncate it > FileTxnLog truncLog = new FileTxnLog(dataDir); > boolean truncated = truncLog.truncate(zxid); > truncLog.close(); > // re-open the txnLog and snapLog > // I'd rather just close/reopen this object itself, however that > // would have a big impact outside ZKDatabase as there are other > // objects holding a reference to this object. > txnLog = new FileTxnLog(dataDir); > snapLog = new FileSnap(snapDir); > return truncated; > } > public void close() throws IOException { > txnLog.close(); > snapLog.close(); > } > } > public class FileSnap implements SnapShot { > ... > public synchronized void serialize(DataTree dt, Map<Long, Integer> > sessions, File snapShot) > throws IOException { > if (!close) { > // actual snapshot code > } > } > @Override > public synchronized void close() throws IOException { > close = true; > } > } > {code} > The sequence of events that lead to the failure are: > | 2020-01-04 01:56:56Z | ZK-2 fails to write to its transaction log due to > disk full. ZK-2 is still participating in leader election. ZK-2 becomes a > follower of ZK-1. ZK-1 sends TRUNC to ZK-2. truncLog.truncate on ZK-2 throws > an exception because of the disk being full, and leaves the process in a > broken state. | > |2020-01-04 02:35:23Z | ZK-2 removes 9 transaction logs from disk (bringing > it from 100% to 19%). It doesn't recover because its in a broken state. | > |2020-01-09 08:57:33Z| ZK-1 fails to write to its transaction log due to disk > full. Restarts as follower. Goes into loop of dropping from quorum (because > it can't update transaction log)| > |2020-01-09 08:59:33Z |ZK-1 receives snapshot from leader (ZK-0) (at > 1e00000000). ZK-1 persists snapshot, but fails to add subsequent transations > to log due to lack of space. ZK-1 drops from quorum.| > |2020-01-09 09:00:12Z |ZK-1 joins quorum as follower. 1e00000000 is close > enough to leader to receive TRUNC(1d0000001d). TRUNC fails because txnLog > can't flush on close() in trunateLog. ZK-1 goes into loop, dropping and > joining quorum.| > |2020-01-09 09:39:00Z |ZK-1 runs purgeTxnLog. Process doesn't recover due to > truncation exception having broken FileTxnSnapLog.| > |2020-01-09 19:28:37Z |ZK-1 is restarted. ZK-1 joins quorum as follower. ZK-1 > receives TRUNC(1d0000001d). In this case, txnLog.close() can succeed because > there's nothing to flush. snapLog is closed. truncLog.truncate fails with > "java.io.IOException: No log files found to truncate! This could happen if > you still have snapshots from an old setup or log files were deleted > accidentally or dataLogDir was changed in zoo.cfg.". It's true that there are > no log files to truncate because the snapshot is at 1e00000000 which was > received from the leader at 08:59 and nothing has been logged since. In any > case, FileTxnSnapLog is in another inconsistent state. snapLog is closed. > txnLog is closed, but nothing was ever written to it, so it looks like brand > new.| > |2020-01-09 19:29:04Z| ZK-2 is restarted. ZK-2 & ZK-0 are now in a good > state, so they can make progress. Transactions start to be logged.| > |2020-01-09 19:33:16Z| ZK-1 joins the quorum. As progress has been made, it > receives a SNAP from the leader at 6b30001183a. It writes a snapshot, which > ultimately calls FileSnap#serialize. Nothing hits the snapshot disk, because > FileSnap is in closed state since 19:28. ZK-1 starts logging transactions to > its log disk.| > |2020-01-09 19:42:00Z |We do a rolling restart of the cluster.| > |2020-01-09 19:45:11Z |ZK-1 loads the last snapshot that has been persisted > to disk (1e00000000), and applies all log entries with zxid greater than the > snapshot (6b30001183a onwards). | > |2020-01-09 19:47:35Z |ZK-2 & ZK-1 form a quorum, ZK-2 leading. ZK-1 reports > its lastZxid as 6b30001b32f and gets a DIFF from ZK-2.| > From this point on, the cluster has split brain. ZK-1 is missing all > transaction between 1e00000000 and 6bf0001183a. > There's a couple of failures in the code that could stop this problem. > - An exception in truncateLog should nuke the process. Even without the split > brain occurring, the processes limped on in a broken state for days and > required human intervention to get going again. > - snapLog and txnLog should be defensively nulled after they're closed. > - FileSnap#serialize should not fail silently if close=true. This is really > bad. It should at least throw an exception. > The issue occurred with 3.4.13 running on a kubernetes cluster. The bad code > paths still exist on master. -- This message was sent by Atlassian Jira (v8.3.4#803005)