Hi Check the __consumer_offsets topics replication. If it's set to one that's the issue. Increase the replication of the topic.
Thanks Siva On Feb 21, 2018 1:35 PM, "Sandor Murakozi" <smurak...@gmail.com> wrote: > hi Behrang, > I recommend you to check out some docs that explain how partitions and > replication work (e.g. > https://sookocheff.com/post/kafka/kafka-in-a-nutshell/) > > What I'd highlight is that the partition leader and the controller are two > different concepts. Each partition has its own leader and It's the leader > and not the controller that's responsible for dealing with producers and > consumers. > > Cheers, > Sandor > > On Tue, Feb 20, 2018 at 12:50 PM, Behrang <behran...@gmail.com> wrote: > > > Hi Sandor, > > > > Thanks for your reply. I am not at work right now, but I still am a bit > > confused about what happened at work: > > > > 1- One thing that I confirmed was that one the 3 nodes was definitely > down. > > We were unable to telnet into its Kafka port from anywhere. The other two > > nodes were up and we could telnet into their Kafka port. > > > > 2- I modified my app a bit and implemented a means for sending > > DescribeCluster requests to the cluster, setting bootrstrap-servers to > all > > the 3 nodes. The result indicated that the controller node ( > > https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/ > > DescribeClusterResult.html#controller()) > > had an id that was not amongst the nodes ( > > https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/ > > DescribeClusterResult.html#nodes()). > > It was the same node that was down (i.e. I could telnet into the other > > nodes but not the controller node). And this was always the same, even > > after a few minutes, the controller node's id was still the same. > > > > 3- Despite that, when running my app from my machine, I could get records > > from the topics I had subscribed to, but from another machine, no records > > were getting sent to the app. The app running on the other machine had a > > different consumer groups though. > > > > 4- The cluster had three nodes and when the controller node was done, > most > > of the time I was getting a message like this: *"Connection to node -N > > could not be established. Broker may not be available."* where N was > either > > -1, -2, or -3 but at one point in my app's logs I found a handful of > > entries in which N was a very large number (e.g. 2156987456). > > > > I assume our cluster was misbehaving, but still can't explain why my app > > was working like this. > > > > > > Best regards, > > Behrang Saeedzadeh > > > > On 20 February 2018 at 19:22, Sandor Murakozi <smurak...@gmail.com> > wrote: > > > > > Hi Behrang, > > > > > > All reads and writes of a partition go through the leader of that > > > partition. > > > If the leader of a partition is down you will not be able to > > > produce/consume data in it until a new leader is elected. Typically it > > > happens in a few seconds, after that you should be able to use that > > > partition again. If your problem persists I recommend figuring out why > > > leader election does not happen. > > > You might be able to work with other partitions, at least those that > have > > > leaders on brokers that are up. > > > > > > Cheers, > > > Sandor Murakozi > > > > > > On Tue, Feb 20, 2018 at 9:00 AM, Behrang <behran...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > I have a Kafka cluster with 3 nodes. > > > > > > > > I pass the nodes in the cluster to a consumer app I am building as > > > > bootstrap servers. > > > > > > > > When one of the nodes in the cluster is down, the consumer group > > > sometimes > > > > CAN read records from the server but sometimes CAN NOT. > > > > > > > > In both cases, the same Kafka node is down. > > > > > > > > Is this behavior normal? Isn't it enough to only have one of the > nodes > > in > > > > the Kafka cluster be up and running? I have not delved much into > setup > > > and > > > > administration of Kafka clusters, but I thought Kafka uses the nodes > > for > > > HA > > > > and as long as one node is up and running, the cluster remains > healthy > > > and > > > > working. > > > > > > > > Best regards, > > > > Behrang Saeedzadeh > > > > > > > > > >