Hi Tom, Thanks for your answer. No, I haven't checked the Kafka metrics yet, weren't actually aware of them. I will turn them on and will monitor to see what will happen upon next failure.
In meantime I've also upgraded to 0.9, so according to what you said there is also a chance, that it wont happen anymore ;) It really felt like Kafka would be stuck somewhere, as it wouldaccept some connections, but generally was idling, with no CPU used, no endlessly repeated messages in the logs, nothing like that. Anyway, your answer will help me a lot to get into bottom of this. Failures on my laptop obviously aren't critical, as you noticed, I'm just wiping all the data. I just felt that it might be important to note that this problem happens in 2 different environments. On 18/05/16 15:50 , Tom Crayford wrote: > Hi, > > Did you check the critical Kafka metrics? > http://docs.confluent.io/2.0.1/kafka/monitoring.html has a good set of > them. We've seen a few issues with 0.8.X where the Controller gets stuck in > an infinite loop (even on single broker clusters), which would possibly > result in the case you see. Look at > kafka.controller:type=KafkaController,name=ActiveControllerCount next time. > Putting Kafka to sleep on a laptop seems likely to cause this kind of > issue. I'd recommend either pausing the broker before sleep, or just wiping > the data when this happens (if the data is critical, it likely doesn't > belong on a laptop that goes to sleep). > > Lastly, because of this and many other issues we've seen in production with > 0.8.X, I'd recommend upgrading to 0.9 ASAP - in our experience (running > thousands of production clusters), 0.9 is much more stable than 0.8. > > Thanks > > Tom Crayford, > Heroku Kafka > > On Wed, May 18, 2016 at 11:31 AM, Kamil Burzynski <ka...@nopik.net> wrote: > >> Hello, >> >> I'm trying to run single Kafka broker, with few topics. Basically 1 >> broker, 1 partition per topic, 1 replica, few topics. I've been using >> spotify/kafka dockerhub image which apparently just downloads Kafka >> release (0.8.2.1 in my case) and start it with default config + >> advertised host settings added. >> >> When I start Kafka like this it works fine, for a number of days. >> Occasionally, and seemingly random, it however enters some state where >> my clients are receiving LeaderNotAvailable exception, for all topics. >> >> Once Kafka server enters this state, I didn't found any way to get it >> back to healthy state. If I restart the server, it immediately works >> fine again, for few days. This is identical whether running on my >> development laptop or on Amazon's ECS service. I have feeling, that is >> happens often on my laptop when I put it to sleep (so virtualbox and >> docker inside might be affected somehow), but over past few weekssuch >> failure didnt happened, despite of daily usage and laptop sleeping. >> >> I googled a bit, it seems to happen when Kafka can't access self through >> the address specified in advertised host. I've verified that the host is >> availbale (i.e. I can connect to self using those settings), all >> dns/networking/etc seem to work fine. Like, I can docker exec to the >> docker container, and with telnet access zookeeper's 2181 or Kafka's >> 9092 ports, using the addresses from server.properties file. >> >> I also tried to run kafka-preferred-replica-election, which succeeds on >> first try and says that election process has started for all topics. >> But, thatprocess apparently does continue indefinitely, so subsequent >> executions of that command abort due to running election process. >> >> I've checked all the logs from Kafka and Zookeeper, nothing alarming >> there, either. >> >> Any idea where could I dig next?How to troubleshoot it when it will >> happens? What to check/execute? >> >> PS. While I consider myself to be relatively strong in devops area, my >> experience with Kafka is very minimal, soplease comment even on most >> novice details, as I'm likely to miss them. >> >> -- >> Best regards from >> Kamil Burzynski >> >> -- Best regards from Kamil Burzynski