Hi Sandor, Thanks for your reply. I am not at work right now, but I still am a bit confused about what happened at work:
1- One thing that I confirmed was that one the 3 nodes was definitely down. We were unable to telnet into its Kafka port from anywhere. The other two nodes were up and we could telnet into their Kafka port. 2- I modified my app a bit and implemented a means for sending DescribeCluster requests to the cluster, setting bootrstrap-servers to all the 3 nodes. The result indicated that the controller node ( https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/DescribeClusterResult.html#controller()) had an id that was not amongst the nodes ( https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/DescribeClusterResult.html#nodes()). It was the same node that was down (i.e. I could telnet into the other nodes but not the controller node). And this was always the same, even after a few minutes, the controller node's id was still the same. 3- Despite that, when running my app from my machine, I could get records from the topics I had subscribed to, but from another machine, no records were getting sent to the app. The app running on the other machine had a different consumer groups though. 4- The cluster had three nodes and when the controller node was done, most of the time I was getting a message like this: *"Connection to node -N could not be established. Broker may not be available."* where N was either -1, -2, or -3 but at one point in my app's logs I found a handful of entries in which N was a very large number (e.g. 2156987456). I assume our cluster was misbehaving, but still can't explain why my app was working like this. Best regards, Behrang Saeedzadeh On 20 February 2018 at 19:22, Sandor Murakozi <smurak...@gmail.com> wrote: > Hi Behrang, > > All reads and writes of a partition go through the leader of that > partition. > If the leader of a partition is down you will not be able to > produce/consume data in it until a new leader is elected. Typically it > happens in a few seconds, after that you should be able to use that > partition again. If your problem persists I recommend figuring out why > leader election does not happen. > You might be able to work with other partitions, at least those that have > leaders on brokers that are up. > > Cheers, > Sandor Murakozi > > On Tue, Feb 20, 2018 at 9:00 AM, Behrang <behran...@gmail.com> wrote: > > > Hi, > > > > I have a Kafka cluster with 3 nodes. > > > > I pass the nodes in the cluster to a consumer app I am building as > > bootstrap servers. > > > > When one of the nodes in the cluster is down, the consumer group > sometimes > > CAN read records from the server but sometimes CAN NOT. > > > > In both cases, the same Kafka node is down. > > > > Is this behavior normal? Isn't it enough to only have one of the nodes in > > the Kafka cluster be up and running? I have not delved much into setup > and > > administration of Kafka clusters, but I thought Kafka uses the nodes for > HA > > and as long as one node is up and running, the cluster remains healthy > and > > working. > > > > Best regards, > > Behrang Saeedzadeh > > >