Hi Team, In one of our environment during helm upgrade of our charts which includes Strimzi Kafka chart, we see zookeeper pod 1 went for Crashloopbackoff state. zookeeper-0 and zookeeper-2 are running fine. We never seen this behaviour till date.
srini-kf-op-sz-zookeeper-1 0/1 CrashLoopBackOff 146 (66s ago) 24h When we inspected the zookeeper-1 logs, below is the observation. ........... ........... ........... 2023-10-01 11:19:29,673 INFO Using org.apache.zookeeper.server.watch.WatchManager as watch manager (org.apache.zookeeper.server.watch.WatchManagerFactory) [main] 2023-10-01 11:19:29,673 INFO Using org.apache.zookeeper.server.watch.WatchManager as watch manager (org.apache.zookeeper.server.watch.WatchManagerFactory) [main] 2023-10-01 11:19:29,676 INFO zookeeper.snapshotSizeFactor = 0.33 (org.apache.zookeeper.server.ZKDatabase) [main] 2023-10-01 11:19:29,676 INFO zookeeper.commitLogCount=500 (org.apache.zookeeper.server.ZKDatabase) [main] 2023-10-01 11:19:29,734 INFO Using TLS encrypted quorum communication (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,734 INFO Port unification disabled (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,734 INFO multiAddress.enabled set to false (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,734 INFO multiAddress.reachabilityCheckEnabled set to true (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,734 INFO multiAddress.reachabilityCheckTimeoutMs set to 1000 (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,734 INFO QuorumPeer communication is not secured! (SASL auth disabled) (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,734 INFO quorum.cnxn.threads.size set to 20 (org.apache.zookeeper.server.quorum.QuorumPeer) [main] 2023-10-01 11:19:29,735 INFO Reading snapshot /var/lib/zookeeper/data/version-2/snapshot.2000004d8.tmp (org.apache.zookeeper.server.persistence.FileSnap) [main] 2023-10-01 11:19:29,824 INFO The digest value is empty in snapshot (org.apache.zookeeper.server.DataTree) [main] 2023-10-01 11:19:29,940 INFO Snapshot loaded in 206 ms, highest zxid is 0x2000004d8, digest is 1738427548192 (org.apache.zookeeper.server.ZKDatabase) [main] 2023-10-01 11:19:29,942 ERROR Unable to load database on disk (org.apache.zookeeper.server.quorum.QuorumPeer) [main] java.io.IOException: The current epoch, 1, is older than the last zxid, 8589935832 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1123) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1079) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90) 2023-10-01 11:19:29,946 ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.quorum.QuorumPeerMain) [main] java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1149) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1079) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90) Caused by: java.io.IOException: The current epoch, 1, is older than the last zxid, 8589935832 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1123) ... 4 more 2023-10-01 11:19:29,948 INFO ZooKeeper audit is disabled. (org.apache.zookeeper.audit.ZKAuditProvider) [main] 2023-10-01 11:19:29,951 ERROR Exiting JVM with code 1 (org.apache.zookeeper.util.ServiceUtils) [main] Can you let us know on the below queries ?. 1. What is the reason for this behaviour ? 2. As this is happening on a pipeline environment and If it is a disk corruption issue(based on inputs from Internet), how can we ensure the disk corruption not happens on zookeeper side again?. 3. What is the solution for this behaviour ?. Thanks, Srinivas 'The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com' Internal - General Use