Hi, Did you check the critical Kafka metrics? http://docs.confluent.io/2.0.1/kafka/monitoring.html has a good set of them. We've seen a few issues with 0.8.X where the Controller gets stuck in an infinite loop (even on single broker clusters), which would possibly result in the case you see. Look at kafka.controller:type=KafkaController,name=ActiveControllerCount next time. Putting Kafka to sleep on a laptop seems likely to cause this kind of issue. I'd recommend either pausing the broker before sleep, or just wiping the data when this happens (if the data is critical, it likely doesn't belong on a laptop that goes to sleep).
Lastly, because of this and many other issues we've seen in production with 0.8.X, I'd recommend upgrading to 0.9 ASAP - in our experience (running thousands of production clusters), 0.9 is much more stable than 0.8. Thanks Tom Crayford, Heroku Kafka On Wed, May 18, 2016 at 11:31 AM, Kamil Burzynski <ka...@nopik.net> wrote: > Hello, > > I'm trying to run single Kafka broker, with few topics. Basically 1 > broker, 1 partition per topic, 1 replica, few topics. I've been using > spotify/kafka dockerhub image which apparently just downloads Kafka > release (0.8.2.1 in my case) and start it with default config + > advertised host settings added. > > When I start Kafka like this it works fine, for a number of days. > Occasionally, and seemingly random, it however enters some state where > my clients are receiving LeaderNotAvailable exception, for all topics. > > Once Kafka server enters this state, I didn't found any way to get it > back to healthy state. If I restart the server, it immediately works > fine again, for few days. This is identical whether running on my > development laptop or on Amazon's ECS service. I have feeling, that is > happens often on my laptop when I put it to sleep (so virtualbox and > docker inside might be affected somehow), but over past few weekssuch > failure didnt happened, despite of daily usage and laptop sleeping. > > I googled a bit, it seems to happen when Kafka can't access self through > the address specified in advertised host. I've verified that the host is > availbale (i.e. I can connect to self using those settings), all > dns/networking/etc seem to work fine. Like, I can docker exec to the > docker container, and with telnet access zookeeper's 2181 or Kafka's > 9092 ports, using the addresses from server.properties file. > > I also tried to run kafka-preferred-replica-election, which succeeds on > first try and says that election process has started for all topics. > But, thatprocess apparently does continue indefinitely, so subsequent > executions of that command abort due to running election process. > > I've checked all the logs from Kafka and Zookeeper, nothing alarming > there, either. > > Any idea where could I dig next?How to troubleshoot it when it will > happens? What to check/execute? > > PS. While I consider myself to be relatively strong in devops area, my > experience with Kafka is very minimal, soplease comment even on most > novice details, as I'm likely to miss them. > > -- > Best regards from > Kamil Burzynski > >