Hi Sandor,

Thanks for your reply. I am not at work right now, but I still am a bit
confused about what happened at work:

1- One thing that I confirmed was that one the 3 nodes was definitely down.
We were unable to telnet into its Kafka port from anywhere. The other two
nodes were up and we could telnet into their Kafka port.

2- I modified my app a bit and implemented a means for sending
DescribeCluster requests to the cluster, setting bootrstrap-servers to all
the 3 nodes. The result indicated that the controller node (
https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/DescribeClusterResult.html#controller())
had an id that was not amongst the nodes (
https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/DescribeClusterResult.html#nodes()).
It was the same node that was down (i.e. I could telnet into the other
nodes but not the controller node). And this was always the same, even
after a few minutes, the controller node's id was still the same.

3- Despite that, when running my app from my machine, I could get records
from the topics I had subscribed to, but from another machine, no records
were getting sent to the app. The app running on the other machine had a
different consumer groups though.

4- The cluster had three nodes and when the controller node was done, most
of the time I was getting a message like this: *"Connection to node -N
could not be established. Broker may not be available."* where N was either
-1, -2, or -3 but at one point in my app's logs I found a handful of
entries in which N was a very large number (e.g. 2156987456).

I assume our cluster was misbehaving, but still can't explain why my app
was working like this.


Best regards,
Behrang Saeedzadeh

On 20 February 2018 at 19:22, Sandor Murakozi <smurak...@gmail.com> wrote:

> Hi Behrang,
>
> All reads and writes of a partition go through the leader of that
> partition.
> If the leader of a partition is down you will not be able to
> produce/consume data in it until a new leader is elected. Typically it
> happens in a few seconds, after that you should be able to use that
> partition again. If your problem persists I recommend figuring out why
> leader election does not happen.
> You might be able to work with other partitions, at least those that have
> leaders on brokers that are up.
>
> Cheers,
> Sandor Murakozi
>
> On Tue, Feb 20, 2018 at 9:00 AM, Behrang <behran...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a Kafka cluster with 3 nodes.
> >
> > I pass the nodes in the cluster to a consumer app I am building as
> > bootstrap servers.
> >
> > When one of the nodes in the cluster is down, the consumer group
> sometimes
> > CAN read records from the server but sometimes CAN NOT.
> >
> > In both cases, the same Kafka node is down.
> >
> > Is this behavior normal? Isn't it enough to only have one of the nodes in
> > the Kafka cluster be up and running? I have not delved much into setup
> and
> > administration of Kafka clusters, but I thought Kafka uses the nodes for
> HA
> > and as long as one node is up and running, the cluster remains healthy
> and
> > working.
> >
> > Best regards,
> > Behrang Saeedzadeh
> >
>

Reply via email to