It might be related to KAFKA-2477. On Thu, Oct 29, 2015 at 6:44 AM, Andrew Otto <ao...@wikimedia.org> wrote:
> Hi all, > > This morning I woke up to see a very high max replica lag on one of my > brokers. I looked at logs, and it seems that one of the replica fetchers > for a partition just decided that its offset was out of range, so it reset > its offset to the beginning of the leader’s log and started replicating > from there. This broker is currently catching back up, so things will be > fine. > > But, I’m curious. Has anyone seen this before? Why would this just > happen? > > The logs show that many segments for this partition were scheduled for > deletion all at once, right before the fetcher reset its offset: > > > [2015-10-29 09:27:11,899] 5421994218 [ReplicaFetcherThread-5-14] INFO > kafka.log.Log - Scheduling log segment 28493996399 for log > webrequest_upload-0 for deletion. > … > (repeats for about 950 segments…) > … > [2015-10-29 09:27:12,606] 5421994925 [ReplicaFetcherThread-5-14] WARN > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-5-14], Replica > 18 for partition [webrequest_upload,0] reset its fetch offset from > 28493996399 to current leader 14's start offset 28493996399 > [2015-10-29 09:27:12,606] 5421994925 [ReplicaFetcherThread-5-14] ERROR > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-5-14], Current > offset 31062784634 for partition [webrequest_upload,0] out of range; reset > offset to 28493996399 > … > > > A more complete capture of this log is here: > https://gist.github.com/ottomata/033ddef8f699ca09cfa8 < > https://gist.github.com/ottomata/033ddef8f699ca09cfa8> > > Thanks! > -Ao > >