Oh interesting. I didn’t know about that log file until now. The only error that has been populated among all brokers showing this behavior is:
ERROR [kafka-log-cleaner-thread-0], Error due to (kafka.log.LogCleaner) Then we see many messages like this: INFO Compaction for partition [__consumer_offsets,30] is resumed (kafka.log.LogCleaner) INFO The cleaning for partition [__consumer_offsets,30] is aborted (kafka.log.LogCleaner) Using Visual VM, I do not see any log-cleaner threads in those brokers. I do see it in the brokers not showing this behavior though. Any idea why the LogCleaner failed? As a temporary fix, should we restart the affected brokers? Thanks again! Lawrence Weikum On 7/13/16, 10:34 AM, "Manikumar Reddy" <manikumar.re...@gmail.com> wrote: Hi, Are you seeing any errors in log-cleaner.log? The log-cleaner thread can crash on certain errors. Thanks Manikumar On Wed, Jul 13, 2016 at 9:54 PM, Lawrence Weikum <lwei...@pandora.com> wrote: > Hello, > > We’re seeing a strange behavior in Kafka 0.9.0.1 which occurs about every > other week. I’m curious if others have seen it and know of a solution. > > Setup and Scenario: > > - Brokers initially setup with log compaction turned off > > - After 30 days, log compaction was turned on > > - At this time, the number of Open FDs was ~ 30K per broker. > > - After 2 days, the __consumer_offsets topic was compacted > fully. Open FDs reduced to ~5K per broker. > > - Cluster has been under normal load for roughly 7 days. > > - At the 7 day mark, __consumer_offsets topic seems to have > stopped compacting on two of the brokers, and on those brokers, the FD > count is up to ~25K. > > > We have tried rebalancing the partitions before. The first time, the > destination broker had compacted the data fine and open FDs were low. The > second time, the destination broker kept the FDs open. > > > In all the broker logs, we’re seeing this messages: > INFO [Group Metadata Manager on Broker 8]: Removed 0 expired offsets in 0 > milliseconds. (kafka.coordinator.GroupMetadataManager) > > There are only 4 consumers at the moment on the cluster; one topic with 92 > partitions. > > Is there a reason why log compaction may stop working or why the > __consumer_offsets topic would start holding thousands of FDs? > > Thank you all for your help! > > Lawrence Weikum > >