[ https://issues.apache.org/jira/browse/ZOOKEEPER-3975 ]


    Xin Chen deleted comment on ZOOKEEPER-3975:
    -------------------------------------

was (Author: JIRAUSER298666):
log

> Zookeeper crashes: Unable to load database on disk java.io.IOException: 
> Unreasonable length
> -------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3975
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: jute
>    Affects Versions: 3.6.2
>         Environment: Debian 10 x64
> openjdk version "11.0.8" 2020-07-14
> OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
> OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, 
> sharing)
>            Reporter: Diego Lucas Jiménez
>            Priority: Critical
>
> After running for a while, the entire cluster (3 zookeeper) crash suddenly, 
> all of them logging:
>  
> {code:java}
> 2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] 
> - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at 
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) 
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834)
> 2020-10-16 10:37:00,475 [myid:2] - ERROR 
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - 
> Unable to load database on disk
> java.io.IOException: Unreasonable length = 5089607
>         at 
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
>         at 
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
>         at 
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
>         at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code}
> Apparently the "corrupted" file appears in all the servers, so no solution 
> such as "removing version-2 on the faulty server and letting replicate from a 
> healthy one" :(.
> The entire cluster goes down, we have downtime, every-single-day since we 
> upgraded from 3.4.9. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to