I recently lost quorum on a ZooKeeper-3.8.3 instance, where several of the 
nodes started reporting “Unreasonable length” while trying to load the snapshot 
from disk (see stack trace at the bottom of this message). This brought down 
the service and our application.

Searching for the error online, I came across the suggestion to increase the 
jute.maxbuffer size, and since that was a simple thing to try we bumped it from 
the default up to 64MB and restarted ZooKeeper. This worked and brought the 
quorum back online.

From the admin guide, increasing jute.maxbuffer is supposed to change the 
maximum znode size (with a default of 1MB). In our application, znode’s are in 
the kilbobytes maximum, and never come close to that 1MB default.

Note that I have had to increase the jute.maxbuffer size on our CLIENTS (also 
to 64MB), because we were hitting buffer size limitations while just trying to 
list all the child znodes in a large tree (we have millions of znodes in a 
quorum). I do see in the documentation that the recommendation is to set the 
client and server sides to the same value to ensure there aren’t any issues 
when dealing with large znodes, but to reiterate, we don’t have large 
individual znodes.

It therefore seems that the scope of the jute.maxbuffer setting extends beyond 
just controlling the size of individual znodes.

The exception below was encountered without any client involved… this quorum 
has been running for several years and just started experiencing this problem 
yesterday, and then suddenly several nodes were unable to read its own snapshot 
(I did use the snapshot tool to extract the data as JSON, around 500MB, as a 
sanity check to see if the snapshot file was corrupted).

Can anyone provide more information as to how this setting impacts the server 
and this particular behavior, and possibly make some best practice 
recommendations?

Thanks!

/Ryan

Exception sample:
2025-04-25 06:50:28,808 
[QuorumPeer[myid=12](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:2281)] INFO  
server.DataTree - The digest in the snapshot has digest version of 2, with zxid 
as 0xb7000139cd, and digest value as 14144770427753881
2025-04-25 06:50:28,913 
[QuorumPeer[myid=12](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:2281)] ERROR 
quorum.QuorumPeer - Unable to load database on disk
java.io.IOException: Unreasonable length = 2164400
        at 
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) 
~[zookeeper-jute-3.8.3.jar:3.8.3]
        at 
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) 
~[zookeeper-jute-3.8.3.jar:3.8.3]
        at 
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) 
~[mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
 ~[mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
 ~[mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
 ~[mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
 ~[mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) 
~[mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
 [mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1326)
 [mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:870)
 [mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:943)
 [mdtzookeeper.dist.jar:3.8.3]
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1509) 
[mdtzookeeper.dist.jar:3.8.3]

Reply via email to