Hi Team,

In one of our environment during helm upgrade of our charts which includes 
Strimzi Kafka chart, we see zookeeper pod 1 went for Crashloopbackoff state. 
zookeeper-0 and zookeeper-2 are running fine. We never seen this behaviour till 
date.

srini-kf-op-sz-zookeeper-1                                 0/1     
CrashLoopBackOff   146 (66s ago)     24h

When we inspected the zookeeper-1 logs, below is the observation.

...........
...........
...........
2023-10-01 11:19:29,673 INFO Using 
org.apache.zookeeper.server.watch.WatchManager as watch manager 
(org.apache.zookeeper.server.watch.WatchManagerFactory) [main]
2023-10-01 11:19:29,673 INFO Using 
org.apache.zookeeper.server.watch.WatchManager as watch manager 
(org.apache.zookeeper.server.watch.WatchManagerFactory) [main]
2023-10-01 11:19:29,676 INFO zookeeper.snapshotSizeFactor = 0.33 
(org.apache.zookeeper.server.ZKDatabase) [main]
2023-10-01 11:19:29,676 INFO zookeeper.commitLogCount=500 
(org.apache.zookeeper.server.ZKDatabase) [main]
2023-10-01 11:19:29,734 INFO Using TLS encrypted quorum communication 
(org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,734 INFO Port unification disabled 
(org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,734 INFO multiAddress.enabled set to false 
(org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,734 INFO multiAddress.reachabilityCheckEnabled set to true 
(org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,734 INFO multiAddress.reachabilityCheckTimeoutMs set to 
1000 (org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,734 INFO QuorumPeer communication is not secured! (SASL 
auth disabled) (org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,734 INFO quorum.cnxn.threads.size set to 20 
(org.apache.zookeeper.server.quorum.QuorumPeer) [main]
2023-10-01 11:19:29,735 INFO Reading snapshot 
/var/lib/zookeeper/data/version-2/snapshot.2000004d8.tmp 
(org.apache.zookeeper.server.persistence.FileSnap) [main]
2023-10-01 11:19:29,824 INFO The digest value is empty in snapshot 
(org.apache.zookeeper.server.DataTree) [main]
2023-10-01 11:19:29,940 INFO Snapshot loaded in 206 ms, highest zxid is 
0x2000004d8, digest is 1738427548192 (org.apache.zookeeper.server.ZKDatabase) 
[main]
2023-10-01 11:19:29,942 ERROR Unable to load database on disk 
(org.apache.zookeeper.server.quorum.QuorumPeer) [main]
java.io.IOException: The current epoch, 1, is older than the last zxid, 
8589935832
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1123)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1079)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2023-10-01 11:19:29,946 ERROR Unexpected exception, exiting abnormally 
(org.apache.zookeeper.server.quorum.QuorumPeerMain) [main]
java.lang.RuntimeException: Unable to run quorum server
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1149)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1079)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
Caused by: java.io.IOException: The current epoch, 1, is older than the last 
zxid, 8589935832
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1123)
... 4 more
2023-10-01 11:19:29,948 INFO ZooKeeper audit is disabled. 
(org.apache.zookeeper.audit.ZKAuditProvider) [main]
2023-10-01 11:19:29,951 ERROR Exiting JVM with code 1 
(org.apache.zookeeper.util.ServiceUtils) [main]

Can you let us know on the below queries ?.


  1.  What is the reason for this behaviour ?
  2.  As this is happening on a pipeline environment and If it is a disk 
corruption issue(based on inputs from Internet), how can we ensure the disk 
corruption not happens on zookeeper side again?.
  3.  What is the solution for this behaviour ?.

Thanks,
Srinivas
'The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com'

Internal - General Use

Reply via email to