[ https://issues.apache.org/jira/browse/ZOOKEEPER-3975 ]
Xin Chen deleted comment on ZOOKEEPER-3975: ------------------------------------- was (Author: JIRAUSER298666): log > Zookeeper crashes: Unable to load database on disk java.io.IOException: > Unreasonable length > ------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-3975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975 > Project: ZooKeeper > Issue Type: Bug > Components: jute > Affects Versions: 3.6.2 > Environment: Debian 10 x64 > openjdk version "11.0.8" 2020-07-14 > OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1) > OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, > sharing) > Reporter: Diego Lucas Jiménez > Priority: Critical > > After running for a while, the entire cluster (3 zookeeper) crash suddenly, > all of them logging: > > {code:java} > 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] > - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at > org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > 2020-10-16 10:37:00,475 [myid:2] - ERROR > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - > Unable to load database on disk > java.io.IOException: Unreasonable length = 5089607 > at > org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) > at > org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) > at > org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303) > at > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093) > at > org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249) > at > org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868) > at > org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code} > Apparently the "corrupted" file appears in all the servers, so no solution > such as "removing version-2 on the faulty server and letting replicate from a > healthy one" :(. > The entire cluster goes down, we have downtime, every-single-day since we > upgraded from 3.4.9. > -- This message was sent by Atlassian Jira (v8.20.10#820010)