Johnny Malizia created KAFKA-10207:
--------------------------------------

             Summary: Untrimmed Index files cause premature log segment 
deletions on startup
                 Key: KAFKA-10207
                 URL: https://issues.apache.org/jira/browse/KAFKA-10207
             Project: Kafka
          Issue Type: Bug
          Components: log
    Affects Versions: 2.4.1, 2.3.1, 2.4.0
            Reporter: Johnny Malizia


[KIP-263|https://cwiki.apache.org/confluence/display/KAFKA/KIP-263%3A+Allow+broker+to+skip+sanity+check+of+inactive+segments+on+broker+startup#KIP263:Allowbrokertoskipsanitycheckofinactivesegmentsonbrokerstartup-Evaluation]
 appears to have introduced a change explicitly deciding to not call the 
sanityCheck method on the time or offset index files that are loaded by Kafka 
at startup. I found a particularly nasty bug using the following configuration
{code:java}
jvm=1.8.0_191 zfs=0.6.5.6 kernel=4.4.0-1013-aws kafka=2.4.1{code}
The bug was that the retention period for a topic or even the broker level 
configuration seemed to not be respected, no matter what, when the broker 
started up it would decide that all log segments on disk were breaching the 
retention window and the data would be purged away.

 
{code:java}
Found deletable segments with base offsets [11610665,12130396,12650133] due to 
retention time 86400000ms breach {code}
{code:java}
Rolled new log segment at offset 12764291 in 1 ms. (kafka.log.Log)
Scheduling segments for deletion List(LogSegment(baseOffset=11610665, 
size=1073731621, lastModifiedTime=1592532125000, largestTime=0), 
LogSegment(baseOffset=12130396, size=1073727967, 
lastModifiedTime=1592532462000, largestTime=0), LogSegment(baseOffset=12650133, 
size=235891971, lastModifiedTime=1592532531000, largestTime=0)) {code}
Further logging showed that this issue was happening when loading the files, 
indicating the final writes to trim the index were not successful
{code:java}
DEBUG Loaded index file 
/mnt/kafka-logs/test_topic-0/00000000000017221277.timeindex with maxEntries = 
873813, maxIndexSize = 10485760, entries = 873813, lastOffset = 
TimestampOffset(0,17221277), file position = 10485756 
(kafka.log.TimeIndex){code}
 

So, because the index leaves the preallocated 0 bytes at the tail, when the 
index is loaded again after restarting Kafka, the next timestamp is 0 and this 
leads to a premature TTL deletion of the log segments.

 

I tracked the issue down to being caused by the jvm version being used as 
upgrading resolved this issue, but I think that Kafka should never delete data 
by mistake like this as doing a rolling restart with this bug in place would 
cause complete data-loss across the cluster.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to