Hi all, This morning I woke up to see a very high max replica lag on one of my brokers. I looked at logs, and it seems that one of the replica fetchers for a partition just decided that its offset was out of range, so it reset its offset to the beginning of the leader’s log and started replicating from there. This broker is currently catching back up, so things will be fine.
But, I’m curious. Has anyone seen this before? Why would this just happen? The logs show that many segments for this partition were scheduled for deletion all at once, right before the fetcher reset its offset: [2015-10-29 09:27:11,899] 5421994218 [ReplicaFetcherThread-5-14] INFO kafka.log.Log - Scheduling log segment 28493996399 for log webrequest_upload-0 for deletion. … (repeats for about 950 segments…) … [2015-10-29 09:27:12,606] 5421994925 [ReplicaFetcherThread-5-14] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-5-14], Replica 18 for partition [webrequest_upload,0] reset its fetch offset from 28493996399 to current leader 14's start offset 28493996399 [2015-10-29 09:27:12,606] 5421994925 [ReplicaFetcherThread-5-14] ERROR kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-5-14], Current offset 31062784634 for partition [webrequest_upload,0] out of range; reset offset to 28493996399 … A more complete capture of this log is here: https://gist.github.com/ottomata/033ddef8f699ca09cfa8 <https://gist.github.com/ottomata/033ddef8f699ca09cfa8> Thanks! -Ao