I have been seeing an issue in my KRaft-based, 5-broker, Kafka 3.4.0 clusters, where broker 2 will report something like the following (see logs below): *"The latest computed high watermark 16691349 is smaller than the current value 16691350, which suggests that one of the voters has lost committed data."*
This has happened at least 5 times in the past month, on different days. It has happened in two different clusters. It always happens on broker 2. This error always has *lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1* on broker 2, with timestamps that match the log's timestamp (to the second, at least) on every other broker. The error is always off-by-one, where the *high watermark is one higher than the current value*. This is always preceded by the log cleaner compacting partitions. The error is always logged 5-6 times, and then nothing more is mentioned about it until the next occurrence days or weeks later. I read through the ticket names included in the Kafka 3.5.0 and 3.6.0 changelogs, and didn't see anything that looked related to this issue. I have attached the server.properties for my brokers (this is particularly from broker 2, but is identical to the others except for node.id and advertised.listeners). My cluster is deployed via the Bitnami Kafka chart, version 22.1.3, though I don't believe that is related. Has anyone encountered this issue before? The fact that it duplicates on different clusters (we do have mirrormaker replicating certain topics between the clusters, but the errors occur on different days in the two clusters, so I don't think it's related to that) implies to me Example 1: [2023-09-28 17:08:10,839] *ERROR* [RaftManager nodeId=2] *The latest computed high watermark 16691349 is smaller than the current value 16691350, which suggests that one of the voters has lost committed data.* Full voter replication state: [ReplicaState(nodeId=0, endOffset=Optional[LogOffsetMetadata(offset=16691349, metadata=Optional[(segmentBaseOffset=16588205,relativePositionInSegment=7432590)])], lastFetchTimestamp=1695920890339, lastCaughtUpTimestamp=1695920890339, hasAcknowledgedLeader=true), ReplicaState(nodeId=1, endOffset=Optional[LogOffsetMetadata(offset=16691349, metadata=Optional[(segmentBaseOffset=16588205,relativePositionInSegment=7432590)])], lastFetchTimestamp=1695920890839, lastCaughtUpTimestamp=1695920890839, hasAcknowledgedLeader=true), *ReplicaState(nodeId=2, endOffset=Optional[LogOffsetMetadata(offset=16691350, metadata=Optional[(segmentBaseOffset=16588205,relativePositionInSegment=7432662)])], lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true),* ReplicaState(nodeId=3, endOffset=Optional[LogOffsetMetadata(offset=16691349, metadata=Optional[(segmentBaseOffset=16588205,relativePositionInSegment=7432590)])], lastFetchTimestamp=1695920890838, lastCaughtUpTimestamp=1695920890838, hasAcknowledgedLeader=true), ReplicaState(nodeId=4, endOffset=Optional[LogOffsetMetadata(offset=16691349, metadata=Optional[(segmentBaseOffset=16588205,relativePositionInSegment=7432590)])], lastFetchTimestamp=1695920890839, lastCaughtUpTimestamp=1695920890338, hasAcknowledgedLeader=true)] (org.apache.kafka.raft.LeaderState) Example 2, with some of the preceding logs, in case it helps: [2023-10-18 23:30:13,437] INFO Cleaner 0: Swapping in cleaned segment LogSegment(baseOffset=0, size=716, lastModifiedTime=1697663101117, largestRecordTimestamp=Some(1697663100871)) for segment(s) List(LogSegment(baseOffset=0, size=837, lastModifiedTime=1697662948877, largestRecordTimestamp=Some(1697662948631)), LogSegment(baseOffset=28207, size=1711, lastModifiedTime=1697663101117, largestRecordTimestamp=Some(1697663100871))) in log Log(dir=/bitnami/kafka/data/arbitrary.topic.name-0, topicId=0_yeqO7AQWykvy07thtDDQ, topic=arbitrary.topic.name, partition=0, highWatermark=28215, lastStableOffset=28215, logStartOffset=0, logEndOffset=28215) (kafka.log.LogCleaner) [2023-10-18 23:30:13,432] INFO Cleaner 0: Cleaning LogSegment(baseOffset=28207, size=1711, lastModifiedTime=1697663101117, largestRecordTimestamp=Some(1697663100871)) in log arbitrary.topic.name-0 into 0 with an upper bound deletion horizon 1697576548877 computed from the segment last modified time of 1697663101117,retaining deletes. (kafka.log.LogCleaner) [2023-10-18 23:30:13,432] INFO Cleaner 0: Cleaning LogSegment(baseOffset=0, size=837, lastModifiedTime=1697662948877, largestRecordTimestamp=Some(1697662948631)) in log arbitrary.topic.name-0 into 0 with an upper bound deletion horizon 1697576548877 computed from the segment last modified time of 1697662948877,retaining deletes. (kafka.log.LogCleaner) [2023-10-18 23:30:13,432] INFO Cleaner 0: Cleaning log arbitrary.topic.name-0 (cleaning prior to Wed Oct 18 21:05:01 UTC 2023, discarding tombstones prior to upper bound deletion horizon Tue Oct 17 21:02:28 UTC 2023)... (kafka.log.LogCleaner) [2023-10-18 23:30:13,432] INFO Cleaner 0: Offset map for log arbitrary.topic.name-0 complete. (kafka.log.LogCleaner) [2023-10-18 23:30:13,432] INFO Cleaner 0: Building offset map for log arbitrary.topic.name-0 for 1 segments in offset range [28207, 28214). (kafka.log.LogCleaner) [2023-10-18 23:30:13,423] INFO Cleaner 0: Building offset map for arbitrary.topic.name-0... (kafka.log.LogCleaner) [2023-10-18 23:30:13,423] INFO Cleaner 0: Beginning cleaning of log arbitrary.topic.name-0 (kafka.log.LogCleaner) [2023-10-18 23:30:09,828] INFO [ProducerStateManager partition=arbitrary.topic.name-0] Wrote producer snapshot at offset 28214 with 0 producer ids in 3 ms. (kafka.log.ProducerStateManager) [2023-10-18 23:30:09,825] INFO [LocalLog partition=arbitrary.topic.name-0, dir=/bitnami/kafka/data] Rolled new log segment at offset 28214 in 0 ms. (kafka.log.LocalLog) [2023-10-18 23:30:09,352] INFO [ProducerStateManager partition=another.topic.name-0] Wrote producer snapshot at offset 34488 with 0 producer ids in 3 ms. (kafka.log.ProducerStateManager) [2023-10-18 23:30:09,349] INFO [LocalLog partition=another.topic.name-0, dir=/bitnami/kafka/data] Rolled new log segment at offset 34488 in 0 ms. (kafka.log.LocalLog) [2023-10-18 23:26:11,462] *ERROR* [RaftManager nodeId=2] *The latest computed high watermark 345450 is smaller than the current value 345451*, which suggests that one of the voters has lost committed data. Full voter replication state: [ReplicaState(nodeId=0, endOffset=Optional[LogOffsetMetadata(offset=345450, metadata=Optional[(segmentBaseOffset=0,relativePositionInSegment=24824347)])], lastFetchTimestamp=1697671571461, lastCaughtUpTimestamp=1697671571459, hasAcknowledgedLeader=true), ReplicaState(nodeId=1, endOffset=Optional[LogOffsetMetadata(offset=345450, metadata=Optional[(segmentBaseOffset=0,relativePositionInSegment=24824347)])], lastFetchTimestamp=1697671571459, lastCaughtUpTimestamp=1697671570957, hasAcknowledgedLeader=true), ReplicaState(nodeId=2, endOffset=Optional[LogOffsetMetadata(offset=345451, metadata=Optional[(segmentBaseOffset=0,relativePositionInSegment=24824419)])], *lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1,* hasAcknowledgedLeader=true), ReplicaState(nodeId=3, endOffset=Optional[LogOffsetMetadata(offset=345450, metadata=Optional[(segmentBaseOffset=0,relativePositionInSegment=24824347)])], lastFetchTimestamp=1697671571459, lastCaughtUpTimestamp=1697671570957, hasAcknowledgedLeader=true), ReplicaState(nodeId=4, endOffset=Optional[LogOffsetMetadata(offset=345451, metadata=Optional[(segmentBaseOffset=0,relativePositionInSegment=24824419)])], lastFetchTimestamp=1697671571462, lastCaughtUpTimestamp=1697671571462, hasAcknowledgedLeader=true)] (org.apache.kafka.raft.LeaderState) -- THIS MESSAGE IS FOR THE NAMED PERSON’S USE ONLY. IT MAY CONTAIN CONFIDENTIAL, PROPRIETARY OR LEGALLY PRIVILEGED INFORMATION. NO CONFIDENTIALITY OR PRIVILEGE IS WAIVED OR LOST BY ANY TRANSMISSION ERRORS. IF YOU RECEIVE THIS MESSAGE IN ERROR, PLEASE IMMEDIATELY DELETE IT AND ALL COPIES OF IT FROM YOUR SYSTEM, DESTROY ANY HARD COPIES OF IT AND NOTIFY THE SENDER. YOU MUST NOT, DIRECTLY OR INDIRECTLY, USE, DISCLOSE, DISTRIBUTE, PRINT, OR COPY ANY PART OF THIS MESSAGE IF YOU ARE NOT THE INTENDED RECIPIENT. TRUMID HOLDINGS, LLC AND EACH OF ITS SUBSIDIARIES RESERVES THE RIGHT TO INTERCEPT AND MONITOR ALL E-MAIL COMMUNICATIONS THROUGH THEIR RESPECTIVE NETWORKS IF LEGALLY ALLOWED. TRUMID HOLDINGS, LLC AND ITS SUBSIDIARIES RETAIN ALL ELECTRONIC MESSAGES AND WILL PROVIDE THEM UPON REQUEST TO ANY AUTHORIZED REGULATORY BODY. MESSAGE TRANSMISSION IS NOT GUARANTEED TO BE SECURE.INFORMATION INCLUDED IN THIS MESSAGE DOES NOT CONSTITUTE A TRADE CONFIRMATION OR AN OFFER OR SOLICITATION OF AN OFFER TO BUY/SELL SECURITIES OR ANY OTHER PRODUCTS.THERE IS NO INTENTION TO OFFER PRODUCTS AND SERVICES IN COUNTRIES OR JURISDICTIONS WHERE SUCH OFFER WOULD BE UNLAWFUL UNDER THE RELEVANT DOMESTIC LAW.