I recently lost quorum on a ZooKeeper-3.8.3 instance, where several of the
nodes started reporting “Unreasonable length” while trying to load the snapshot
from disk (see stack trace at the bottom of this message). This brought down
the service and our application.
Searching for the error online, I came across the suggestion to increase the
jute.maxbuffer size, and since that was a simple thing to try we bumped it from
the default up to 64MB and restarted ZooKeeper. This worked and brought the
quorum back online.
From the admin guide, increasing jute.maxbuffer is supposed to change the
maximum znode size (with a default of 1MB). In our application, znode’s are in
the kilbobytes maximum, and never come close to that 1MB default.
Note that I have had to increase the jute.maxbuffer size on our CLIENTS (also
to 64MB), because we were hitting buffer size limitations while just trying to
list all the child znodes in a large tree (we have millions of znodes in a
quorum). I do see in the documentation that the recommendation is to set the
client and server sides to the same value to ensure there aren’t any issues
when dealing with large znodes, but to reiterate, we don’t have large
individual znodes.
It therefore seems that the scope of the jute.maxbuffer setting extends beyond
just controlling the size of individual znodes.
The exception below was encountered without any client involved… this quorum
has been running for several years and just started experiencing this problem
yesterday, and then suddenly several nodes were unable to read its own snapshot
(I did use the snapshot tool to extract the data as JSON, around 500MB, as a
sanity check to see if the snapshot file was corrupted).
Can anyone provide more information as to how this setting impacts the server
and this particular behavior, and possibly make some best practice
recommendations?
Thanks!
/Ryan
Exception sample:
2025-04-25 06:50:28,808
[QuorumPeer[myid=12](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:2281)] INFO
server.DataTree - The digest in the snapshot has digest version of 2, with zxid
as 0xb7000139cd, and digest value as 14144770427753881
2025-04-25 06:50:28,913
[QuorumPeer[myid=12](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:2281)] ERROR
quorum.QuorumPeer - Unable to load database on disk
java.io.IOException: Unreasonable length = 2164400
at
org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
~[zookeeper-jute-3.8.3.jar:3.8.3]
at
org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
~[zookeeper-jute-3.8.3.jar:3.8.3]
at
org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
~[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
~[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
~[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
~[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
~[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
~[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1326)
[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:870)
[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:943)
[mdtzookeeper.dist.jar:3.8.3]
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1509)
[mdtzookeeper.dist.jar:3.8.3]