Re: Consumer group intermittently can not read any records from a cluster with 3 nodes that has one node down

2018-03-01 Thread Siva A
Hi

Check the __consumer_offsets topics replication. If it's set to one that's
the issue. Increase the replication of the topic.

Thanks
Siva

On Feb 21, 2018 1:35 PM, "Sandor Murakozi"  wrote:

> hi Behrang,
> I recommend you to check out some docs that explain how partitions and
> replication work (e.g.
> https://sookocheff.com/post/kafka/kafka-in-a-nutshell/)
>
> What I'd highlight is that the partition leader and the controller are two
> different concepts. Each partition has its own leader and It's the leader
> and not the controller that's responsible for dealing with producers and
> consumers.
>
> Cheers,
> Sandor
>
> On Tue, Feb 20, 2018 at 12:50 PM, Behrang  wrote:
>
> > Hi Sandor,
> >
> > Thanks for your reply. I am not at work right now, but I still am a bit
> > confused about what happened at work:
> >
> > 1- One thing that I confirmed was that one the 3 nodes was definitely
> down.
> > We were unable to telnet into its Kafka port from anywhere. The other two
> > nodes were up and we could telnet into their Kafka port.
> >
> > 2- I modified my app a bit and implemented a means for sending
> > DescribeCluster requests to the cluster, setting bootrstrap-servers to
> all
> > the 3 nodes. The result indicated that the controller node (
> > https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/
> > DescribeClusterResult.html#controller())
> > had an id that was not amongst the nodes (
> > https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/
> > DescribeClusterResult.html#nodes()).
> > It was the same node that was down (i.e. I could telnet into the other
> > nodes but not the controller node). And this was always the same, even
> > after a few minutes, the controller node's id was still the same.
> >
> > 3- Despite that, when running my app from my machine, I could get records
> > from the topics I had subscribed to, but from another machine, no records
> > were getting sent to the app. The app running on the other machine had a
> > different consumer groups though.
> >
> > 4- The cluster had three nodes and when the controller node was done,
> most
> > of the time I was getting a message like this: *"Connection to node -N
> > could not be established. Broker may not be available."* where N was
> either
> > -1, -2, or -3 but at one point in my app's logs I found a handful of
> > entries in which N was a very large number (e.g. 2156987456).
> >
> > I assume our cluster was misbehaving, but still can't explain why my app
> > was working like this.
> >
> >
> > Best regards,
> > Behrang Saeedzadeh
> >
> > On 20 February 2018 at 19:22, Sandor Murakozi 
> wrote:
> >
> > > Hi Behrang,
> > >
> > > All reads and writes of a partition go through the leader of that
> > > partition.
> > > If the leader of a partition is down you will not be able to
> > > produce/consume data in it until a new leader is elected. Typically it
> > > happens in a few seconds, after that you should be able to use that
> > > partition again. If your problem persists I recommend figuring out why
> > > leader election does not happen.
> > > You might be able to work with other partitions, at least those that
> have
> > > leaders on brokers that are up.
> > >
> > > Cheers,
> > > Sandor Murakozi
> > >
> > > On Tue, Feb 20, 2018 at 9:00 AM, Behrang  wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a Kafka cluster with 3 nodes.
> > > >
> > > > I pass the nodes in the cluster to a consumer app I am building as
> > > > bootstrap servers.
> > > >
> > > > When one of the nodes in the cluster is down, the consumer group
> > > sometimes
> > > > CAN read records from the server but sometimes CAN NOT.
> > > >
> > > > In both cases, the same Kafka node is down.
> > > >
> > > > Is this behavior normal? Isn't it enough to only have one of the
> nodes
> > in
> > > > the Kafka cluster be up and running? I have not delved much into
> setup
> > > and
> > > > administration of Kafka clusters, but I thought Kafka uses the nodes
> > for
> > > HA
> > > > and as long as one node is up and running, the cluster remains
> healthy
> > > and
> > > > working.
> > > >
> > > > Best regards,
> > > > Behrang Saeedzadeh
> > > >
> > >
> >
>


Re: Consumer group intermittently can not read any records from a cluster with 3 nodes that has one node down

2018-02-21 Thread Sandor Murakozi
hi Behrang,
I recommend you to check out some docs that explain how partitions and
replication work (e.g.
https://sookocheff.com/post/kafka/kafka-in-a-nutshell/)

What I'd highlight is that the partition leader and the controller are two
different concepts. Each partition has its own leader and It's the leader
and not the controller that's responsible for dealing with producers and
consumers.

Cheers,
Sandor

On Tue, Feb 20, 2018 at 12:50 PM, Behrang  wrote:

> Hi Sandor,
>
> Thanks for your reply. I am not at work right now, but I still am a bit
> confused about what happened at work:
>
> 1- One thing that I confirmed was that one the 3 nodes was definitely down.
> We were unable to telnet into its Kafka port from anywhere. The other two
> nodes were up and we could telnet into their Kafka port.
>
> 2- I modified my app a bit and implemented a means for sending
> DescribeCluster requests to the cluster, setting bootrstrap-servers to all
> the 3 nodes. The result indicated that the controller node (
> https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/
> DescribeClusterResult.html#controller())
> had an id that was not amongst the nodes (
> https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/
> DescribeClusterResult.html#nodes()).
> It was the same node that was down (i.e. I could telnet into the other
> nodes but not the controller node). And this was always the same, even
> after a few minutes, the controller node's id was still the same.
>
> 3- Despite that, when running my app from my machine, I could get records
> from the topics I had subscribed to, but from another machine, no records
> were getting sent to the app. The app running on the other machine had a
> different consumer groups though.
>
> 4- The cluster had three nodes and when the controller node was done, most
> of the time I was getting a message like this: *"Connection to node -N
> could not be established. Broker may not be available."* where N was either
> -1, -2, or -3 but at one point in my app's logs I found a handful of
> entries in which N was a very large number (e.g. 2156987456).
>
> I assume our cluster was misbehaving, but still can't explain why my app
> was working like this.
>
>
> Best regards,
> Behrang Saeedzadeh
>
> On 20 February 2018 at 19:22, Sandor Murakozi  wrote:
>
> > Hi Behrang,
> >
> > All reads and writes of a partition go through the leader of that
> > partition.
> > If the leader of a partition is down you will not be able to
> > produce/consume data in it until a new leader is elected. Typically it
> > happens in a few seconds, after that you should be able to use that
> > partition again. If your problem persists I recommend figuring out why
> > leader election does not happen.
> > You might be able to work with other partitions, at least those that have
> > leaders on brokers that are up.
> >
> > Cheers,
> > Sandor Murakozi
> >
> > On Tue, Feb 20, 2018 at 9:00 AM, Behrang  wrote:
> >
> > > Hi,
> > >
> > > I have a Kafka cluster with 3 nodes.
> > >
> > > I pass the nodes in the cluster to a consumer app I am building as
> > > bootstrap servers.
> > >
> > > When one of the nodes in the cluster is down, the consumer group
> > sometimes
> > > CAN read records from the server but sometimes CAN NOT.
> > >
> > > In both cases, the same Kafka node is down.
> > >
> > > Is this behavior normal? Isn't it enough to only have one of the nodes
> in
> > > the Kafka cluster be up and running? I have not delved much into setup
> > and
> > > administration of Kafka clusters, but I thought Kafka uses the nodes
> for
> > HA
> > > and as long as one node is up and running, the cluster remains healthy
> > and
> > > working.
> > >
> > > Best regards,
> > > Behrang Saeedzadeh
> > >
> >
>


Re: Consumer group intermittently can not read any records from a cluster with 3 nodes that has one node down

2018-02-20 Thread Behrang
Hi Sandor,

Thanks for your reply. I am not at work right now, but I still am a bit
confused about what happened at work:

1- One thing that I confirmed was that one the 3 nodes was definitely down.
We were unable to telnet into its Kafka port from anywhere. The other two
nodes were up and we could telnet into their Kafka port.

2- I modified my app a bit and implemented a means for sending
DescribeCluster requests to the cluster, setting bootrstrap-servers to all
the 3 nodes. The result indicated that the controller node (
https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/DescribeClusterResult.html#controller())
had an id that was not amongst the nodes (
https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/admin/DescribeClusterResult.html#nodes()).
It was the same node that was down (i.e. I could telnet into the other
nodes but not the controller node). And this was always the same, even
after a few minutes, the controller node's id was still the same.

3- Despite that, when running my app from my machine, I could get records
from the topics I had subscribed to, but from another machine, no records
were getting sent to the app. The app running on the other machine had a
different consumer groups though.

4- The cluster had three nodes and when the controller node was done, most
of the time I was getting a message like this: *"Connection to node -N
could not be established. Broker may not be available."* where N was either
-1, -2, or -3 but at one point in my app's logs I found a handful of
entries in which N was a very large number (e.g. 2156987456).

I assume our cluster was misbehaving, but still can't explain why my app
was working like this.


Best regards,
Behrang Saeedzadeh

On 20 February 2018 at 19:22, Sandor Murakozi  wrote:

> Hi Behrang,
>
> All reads and writes of a partition go through the leader of that
> partition.
> If the leader of a partition is down you will not be able to
> produce/consume data in it until a new leader is elected. Typically it
> happens in a few seconds, after that you should be able to use that
> partition again. If your problem persists I recommend figuring out why
> leader election does not happen.
> You might be able to work with other partitions, at least those that have
> leaders on brokers that are up.
>
> Cheers,
> Sandor Murakozi
>
> On Tue, Feb 20, 2018 at 9:00 AM, Behrang  wrote:
>
> > Hi,
> >
> > I have a Kafka cluster with 3 nodes.
> >
> > I pass the nodes in the cluster to a consumer app I am building as
> > bootstrap servers.
> >
> > When one of the nodes in the cluster is down, the consumer group
> sometimes
> > CAN read records from the server but sometimes CAN NOT.
> >
> > In both cases, the same Kafka node is down.
> >
> > Is this behavior normal? Isn't it enough to only have one of the nodes in
> > the Kafka cluster be up and running? I have not delved much into setup
> and
> > administration of Kafka clusters, but I thought Kafka uses the nodes for
> HA
> > and as long as one node is up and running, the cluster remains healthy
> and
> > working.
> >
> > Best regards,
> > Behrang Saeedzadeh
> >
>


Re: Consumer group intermittently can not read any records from a cluster with 3 nodes that has one node down

2018-02-20 Thread Sandor Murakozi
Hi Behrang,

All reads and writes of a partition go through the leader of that
partition.
If the leader of a partition is down you will not be able to
produce/consume data in it until a new leader is elected. Typically it
happens in a few seconds, after that you should be able to use that
partition again. If your problem persists I recommend figuring out why
leader election does not happen.
You might be able to work with other partitions, at least those that have
leaders on brokers that are up.

Cheers,
Sandor Murakozi

On Tue, Feb 20, 2018 at 9:00 AM, Behrang  wrote:

> Hi,
>
> I have a Kafka cluster with 3 nodes.
>
> I pass the nodes in the cluster to a consumer app I am building as
> bootstrap servers.
>
> When one of the nodes in the cluster is down, the consumer group sometimes
> CAN read records from the server but sometimes CAN NOT.
>
> In both cases, the same Kafka node is down.
>
> Is this behavior normal? Isn't it enough to only have one of the nodes in
> the Kafka cluster be up and running? I have not delved much into setup and
> administration of Kafka clusters, but I thought Kafka uses the nodes for HA
> and as long as one node is up and running, the cluster remains healthy and
> working.
>
> Best regards,
> Behrang Saeedzadeh
>