Re: Few partitions stuck in under replication

Liam Clarke-Hutchinson Mon, 07 Mar 2022 21:10:40 -0800

Hi Dhirendra, so after looking into your stack trace further, it shows that
the AlterISRRequest is failing when trying to interact with ZooKeeper. It
doesn't give us more information as to why currently.


Can you please set some loggers to DEBUG level to help analyse the issue
further?

These ones:

- kafka.zk.KafkaZkClient
- kafka.utils.ReplicationUtils
- kafka.server.ZkIsrManager


If you can do that and then share any log output that would be great :)

Cheers,

Liam

On Tue, 8 Mar 2022 at 17:50, Dhirendra Singh <dhirendr...@gmail.com> wrote:

> Hi Thomas,
> I see the IllegalStateException but as i pasted earlier it is
> java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request
> with state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1,
> 2), zkVersion=4719) for partition __consumer_offsets-4
>
> I upgraded to version 2.8.1 but issue is not resolved.
>
> Thanks,
> Dhirendra.
>
> On Mon, Mar 7, 2022 at 5:45 AM Liam Clarke-Hutchinson <lclar...@redhat.com
> >
> wrote:
>
> > Ah, I may have seen this error before. Dhirendra Singh, If you grep your
> > logs, you may find an IllegalStateException or two.
> >
> > https://issues.apache.org/jira/browse/KAFKA-12948
> >
> > You need to upgrade to 2.7.2 if this is the issue you're hitting.
> >
> > Kind regards,
> >
> > Liam Clarke-Hutchinson
> >
> > On Sun, 6 Mar 2022 at 04:30, Mailbox - Dhirendra Kumar Singh <
> > dhirendr...@gmail.com> wrote:
> >
> > > Let me rephrase my issue.
> > > Issue occur when broker loose connectivity to zookeeper server.
> > > Connectivity loss can happen due to many reasons…zookeeper servers
> > getting
> > > bounced, due to some network glitch etc…
> > >
> > > After the brokers reconnect to zookeeper server I expect the kafka
> > cluster
> > > to come back in stable state by itself without any manual intervention.
> > > but instead few partitions remain under replicated due to the error I
> > > pasted earlier.
> > >
> > > I feel this is some kind of bug. I am going to file a bug.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Dhirendra.
> > >
> > >
> > >
> > > From: Thomas Cooper <c...@tomcooper.dev>
> > > Sent: Friday, March 4, 2022 7:01 PM
> > > To: Dhirendra Singh <dhirendr...@gmail.com>
> > > Cc: users@kafka.apache.org
> > > Subject: Re: Few partitions stuck in under replication
> > >
> > >
> > >
> > > Do you roll the controller last?
> > >
> > > I suspect this is more to do with the way you are rolling the cluster
> > > (which I am still not clear on the need for) rather than some kind of
> bug
> > > in Kafka (though that could of course be the case).
> > >
> > > Tom
> > >
> > > On 04/03/2022 01:59, Dhirendra Singh wrote:
> > >
> > > Hi Tom,
> > > During the rolling restart we check for under replicated partition
> count
> > > to be zero in the readiness probe before restarting the next POD in
> > order.
> > > This issue never occurred before. It started after we upgraded kafka
> > > version from 2.5.0 to 2.7.1.
> > > So i suspect some bug introduced in the version after 2.5.0.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Dhirendra.
> > >
> > >
> > >
> > > On Thu, Mar 3, 2022 at 11:09 PM Thomas Cooper <c...@tomcooper.dev
> > <mailto:
> > > c...@tomcooper.dev> > wrote:
> > >
> > > I suspect this nightly rolling will have something to do with your
> > issues.
> > > If you are just rolling the stateful set in order, with no dependence
> on
> > > maintaining minISR and other Kafka considerations you are going to hit
> > > issues.
> > >
> > > If you are running on Kubernetes I would suggest using an Operator like
> > > Strimzi <https://strimzi.io/>  which will do a lot of the Kafka admin
> > > tasks like this for you automatically.
> > >
> > > Tom
> > >
> > > On 03/03/2022 16:28, Dhirendra Singh wrote:
> > >
> > > Hi Tom,
> > >
> > > Doing the nightly restart is the decision of the cluster admin. I have
> no
> > > control on it.
> > > We have implementation using stateful set. restart is triggered by
> > > updating a annotation in the pod.
> > > Issue is not triggered by kafka cluster restart but the zookeeper
> servers
> > > restart.
> > >
> > > Thanks,
> > >
> > > Dhirendra.
> > >
> > >
> > >
> > > On Thu, Mar 3, 2022 at 7:19 PM Thomas Cooper <c...@tomcooper.dev
> > <mailto:
> > > c...@tomcooper.dev> > wrote:
> > >
> > > Hi Dhirenda,
> > >
> > > Firstly, I am interested in why are you restarting the ZK and Kafka
> > > cluster every night?
> > >
> > > Secondly, how are you doing the restarts. For example, in [Strimzi](
> > > https://strimzi.io/), when we roll the Kafka cluster we leave the
> > > designated controller broker until last. For each of the other brokers
> we
> > > wait until all the partitions they are leaders for are above their
> minISR
> > > and then we roll the broker. In this way we maintain availability and
> > make
> > > sure leadership can move off the rolling broker temporarily.
> > >
> > > Cheers,
> > >
> > > Tom Cooper
> > >
> > > [@tomncooper](https://twitter.com/tomncooper) | https://tomcooper.dev
> > >
> > > On 03/03/2022 07:38, Dhirendra Singh wrote:
> > >
> > > > Hi All,
> > > >
> > > > We have kafka cluster running in kubernetes. kafka version we are
> using
> > > is
> > > > 2.7.1.
> > > > Every night zookeeper servers and kafka brokers are restarted.
> > > > After the nightly restart of the zookeeper servers some partitions
> > remain
> > > > stuck in under replication. This happens randomly but not at every
> > > nightly
> > > > restart.
> > > > Partitions remain under replicated until kafka broker with the
> > partition
> > > > leader is restarted.
> > > > For example partition 4 of consumer_offsets topic remain under
> > replicated
> > > > and we see following error in the log...
> > > >
> > > > [2022-02-28 04:01:20,217] WARN [Partition __consumer_offsets-4
> > broker=1]
> > > > Controller failed to update ISR to PendingExpandIsr(isr=Set(1),
> > > > newInSyncReplicaId=2) due to unexpected UNKNOWN_SERVER_ERROR.
> Retrying.
> > > > (kafka.cluster.Partition)
> > > > [2022-02-28 04:01:20,217] ERROR [broker-1-to-controller] Uncaught
> error
> > > in
> > > > request completion: (org.apache.kafka.clients.NetworkClient)
> > > > java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request
> > > with
> > > > state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2),
> > > > zkVersion=4719) for partition __consumer_offsets-4
> > > > at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1403)
> > > > at
> > > >
> > >
> >
> kafka.cluster.Partition.$anonfun$handleAlterIsrResponse$1(Partition.scala:1438)
> > > > at
> kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1417)
> > > > at
> > > >
> > >
> >
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1398)
> > > > at
> > > >
> > >
> >
> kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1398)
> > > > at
> > > >
> > >
> >
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8(AlterIsrManager.scala:166)
> > > > at
> > > >
> > >
> >
> kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8$adapted(AlterIsrManager.scala:163)
> > > > at scala.collection.immutable.List.foreach(List.scala:333)
> > > > at
> > > >
> > >
> >
> kafka.server.AlterIsrManagerImpl.handleAlterIsrResponse(AlterIsrManager.scala:163)
> > > > at
> > > >
> > >
> >
> kafka.server.AlterIsrManagerImpl.responseHandler$1(AlterIsrManager.scala:94)
> > > > at
> > > >
> > >
> >
> kafka.server.AlterIsrManagerImpl.$anonfun$sendRequest$2(AlterIsrManager.scala:104)
> > > > at
> > > >
> > >
> >
> kafka.server.BrokerToControllerRequestThread.handleResponse(BrokerToControllerChannelManagerImpl.scala:175)
> > > > at
> > > >
> > >
> >
> kafka.server.BrokerToControllerRequestThread.$anonfun$generateRequests$1(BrokerToControllerChannelManagerImpl.scala:158)
> > > > at
> > > >
> > >
> >
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
> > > > at
> > > >
> > >
> >
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:586)
> > > > at
> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:578)
> > > > at
> > >
> kafka.common.InterBrokerSendThread.doWork(InterBrokerSendThread.scala:71)
> > > > at
> > > >
> > >
> >
> kafka.server.BrokerToControllerRequestThread.doWork(BrokerToControllerChannelManagerImpl.scala:183)
> > > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
> > > > Looks like some kind of race condition bug...anyone has any idea ?
> > > >
> > > > Thanks,
> > > > Dhirendra
> > >
> > >
> > >
> > > --
> > >
> > > Tom Cooper
> > >
> > > @tomncooper <https://twitter.com/tomncooper>  | tomcooper.dev <
> > > https://tomcooper.dev>
> > >
> > >
> > >
> > > --
> > >
> > > Tom Cooper
> > >
> > > @tomncooper <https://twitter.com/tomncooper>  | tomcooper.dev <
> > > https://tomcooper.dev>
> > >
> > >
> >
>

Re: Few partitions stuck in under replication

Reply via email to