Thanks for your reply. Replication factor is 2 and min.insync.replicas is default (1) I will get back to you on the producer's ack settings.
Regards, Dhiraj On Sun, Jun 5, 2022 at 1:20 PM Liam Clarke-Hutchinson <lclar...@redhat.com> wrote: > Off the top of my head, it looks like it lost network connectivity to some > extent. > > Question - what settings were used for topics like efGamePlay? What is min > insync replicas, replication factor, and what acks settings is the producer > using? > > Cheers, > > Liam > > On Fri, 3 Jun 2022 at 22:55, dhiraj prajapati <dhirajp...@gmail.com> > wrote: > > > Hi all, > > Recently we faced an issue with one of our production kafka clusters: > > - It is a 3 node cluster > > - kafka server version is 1.0 > > > > *Issue*: > > One of the brokers had some problem resulting in the following: > > 1. The broker lost leadership of all of the topic-parttions > > 2. However the kafka server process did *NOT* get stopped/killed. > > > > I have attached the exception traces from controller and server logs at > > that time at the end of this email. > > > > *Questions:* > > 1. What could be the reason behind this happening? > > 2. How can I reproduce this exact scenario in our test environment? > > > > P.S. We did not see any GC logs, or general network blips at around that > > time. > > > > *Exception trace in controller log:* > > > > [2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] > Starting > > (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] > Starting > > (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,354] ERROR [Controller id=2] Error while electing or > > becoming controller on broker 2 (kafka.controller.KafkaController) > > kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either > > before or while waiting for connection > > at > > > > > kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:233) > > at > > > > > kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221) > > at > > > > > kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221) > > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) > > at > > > > > kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:221) > > at > > > > > kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:215) > > at > > > > > kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215) > > at > > > > > kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215) > > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) > > at > > > > > kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:214) > > at > > > > > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1461) > > at > > > > > kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1437) > > at > > kafka.zk.KafkaZkClient.getPartitionReassignment(KafkaZkClient.scala:671) > > at > > > > > kafka.controller.KafkaController.initializePartitionReassignment(KafkaController.scala:698) > > at > > > > > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:655) > > at > > > > > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:232) > > at > > > > > kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1203) > > at > > > > > kafka.controller.KafkaController$Reelect$.process(KafkaController.scala:1479) > > at > > > > > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69) > > at > > > > > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) > > at > > > > > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) > > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) > > at > > > > > kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68) > > at > kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > > [2022-05-21 09:07:48,358] DEBUG [Controller id=2] Resigning > > (kafka.controller.KafkaController) > > [2022-05-21 09:07:48,358] DEBUG [Controller id=2] Unregister > > BrokerModifications handler for Set(2, 1) > > (kafka.controller.KafkaController) > > [2022-05-21 09:07:48,360] INFO [PartitionStateMachine controllerId=2] > > Stopped partition state machine (kafka.controller.PartitionStateMachine) > > [2022-05-21 09:07:48,361] INFO [ReplicaStateMachine controllerId=2] > Stopped > > replica state machine (kafka.controller.ReplicaStateMachine) > > [2022-05-21 09:07:48,361] INFO [RequestSendThread controllerId=2] > Shutting > > down (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped > > (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] > Shutdown > > completed (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] > Shutting > > down (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped > > (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] > Shutdown > > completed (kafka.controller.RequestSendThread) > > [2022-05-21 09:07:48,365] INFO [Controller id=2] Resigned > > (kafka.controller.KafkaController) > > > > > > *Snippets from server log:* > > > > [2022-05-21 09:06:36,678] INFO [Log partition=efAcquisition-18, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 252314842 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:06:53,106] INFO [GroupCoordinator 2]: Preparing to > rebalance > > group > > > spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 > > with old generation 1 (__consumer_offsets-43) > > (kafka.coordinator.group.GroupCoordinator) > > [2022-05-21 09:06:53,107] INFO [GroupCoordinator 2]: Group > > > spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 > > with generation 2 is now empty (__consumer_offsets-43) > > (kafka.coordinator.group.GroupCoordinator) > > [2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Preparing to > rebalance > > group > > > spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 > > with old generation 1 (__consumer_offsets-37) > > (kafka.coordinator.group.GroupCoordinator) > > [2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Group > > > spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 > > with generation 2 is now empty (__consumer_offsets-37) > > (kafka.coordinator.group.GroupCoordinator) > > [2022-05-21 09:07:21,154] INFO [ProducerStateManager > partition=hh_topic-9] > > Writing producer snapshot at offset 7837005056 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:21,155] INFO [Log partition=hh_topic-9, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 7837005056 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:26,088] INFO [ProducerStateManager > > partition=efGamePlay-11] Writing producer snapshot at offset 2116578569 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:26,089] INFO [Log partition=efGamePlay-11, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 2116578569 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:27,170] INFO [ProducerStateManager > > partition=efGamePlay-18] Writing producer snapshot at offset 2116607621 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:27,171] INFO [Log partition=efGamePlay-18, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 2116607621 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:30,444] INFO [ProducerStateManager > > partition=auditTopic-36] Writing producer snapshot at offset 145646657 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:30,445] INFO [Log partition=auditTopic-36, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 145646657 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:33,189] INFO [ProducerStateManager > > partition=heartbeats-0] Writing producer snapshot at offset 43292127 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:33,190] INFO [Log partition=heartbeats-0, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 43292127 in 3 ms. (kafka.log.Log) > > [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group > > > spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 > > transitioned to Dead in generation 2 > > (kafka.coordinator.group.GroupMetadataManager) > > [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group > > > spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 > > transitioned to Dead in generation 2 > > (kafka.coordinator.group.GroupMetadataManager) > > [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Removed > 0 > > expired offsets in 0 milliseconds. > > (kafka.coordinator.group.GroupMetadataManager) > > [2022-05-21 09:07:37,543] INFO [ProducerStateManager > > partition=auditTopic-0] Writing producer snapshot at offset 145649350 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:37,544] INFO [Log partition=auditTopic-0, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 145649350 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:39,207] INFO [ProducerStateManager > > partition=taxonomyOthers-48] Writing producer snapshot at offset > 162628223 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:39,207] INFO [Log partition=taxonomyOthers-48, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 162628223 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:40,103] INFO Creating /controller (is it secure? false) > > (kafka.zk.KafkaZkClient) > > [2022-05-21 09:07:40,115] INFO Result of znode creation at /controller > is: > > OK (kafka.zk.KafkaZkClient) > > [2022-05-21 09:07:45,915] WARN Client session timed out, have not heard > > from server in 5138ms for sessionid 0x27b85a3ead1005a > > (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:45,916] INFO Client session timed out, have not heard > > from server in 5138ms for sessionid 0x27b85a3ead1005a, closing socket > > connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:45,933] WARN Attempting to send response via channel > for > > which there is no open connection, connection id 10.200.3.174:9092 > > -10.200.2.18:33360-5632853 (kafka.network.Processor) > > [2022-05-21 09:07:45,936] WARN Attempting to send response via channel > for > > which there is no open connection, connection id 10.200.3.174:9092 > > -10.200.3.29:18284-5535253 (kafka.network.Processor) > > [2022-05-21 09:07:45,995] WARN Attempting to send response via channel > for > > which there is no open connection, connection id 10.200.3.174:9092 > > -10.200.4.128:6078-5535526 (kafka.network.Processor) > > [2022-05-21 09:07:45,996] WARN Attempting to send response via channel > for > > which there is no open connection, connection id 10.200.3.174:9092 > > -10.200.4.129:6664-5535480 (kafka.network.Processor) > > [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Waiting until connected. > > (kafka.zookeeper.ZooKeeperClient) > > [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected. > > (kafka.zookeeper.ZooKeeperClient) > > [2022-05-21 09:07:46,902] INFO Opening socket connection to server > 10.200/ > > 10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown > > error) (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,027] WARN Client session timed out, have not heard > > from server in 2001ms for sessionid 0x27b85a3ead1005a > > (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,027] INFO Client session timed out, have not heard > > from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket > > connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected. > > (kafka.zookeeper.ZooKeeperClient) > > [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected. > > (kafka.zookeeper.ZooKeeperClient) > > [2022-05-21 09:07:46,902] INFO Opening socket connection to server > 10.200/ > > 10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown > > error) (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,027] WARN Client session timed out, have not heard > > from server in 2001ms for sessionid 0x27b85a3ead1005a > > (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,027] INFO Client session timed out, have not heard > > from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket > > connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected. > > (kafka.zookeeper.ZooKeeperClient) > > [2022-05-21 09:07:48,294] INFO Opening socket connection to server > > 10.200.3.174/10.200.3.174:2181. Will not attempt to authenticate using > > SASL > > (unknown error) (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,294] INFO Socket connection established to > > 10.200.3.174/10.200.3.174:2181, initiating session > > (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,313] WARN Unable to reconnect to ZooKeeper service, > > session 0x27b85a3ead1005a has expired (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,314] INFO Unable to reconnect to ZooKeeper service, > > session 0x27b85a3ead1005a has expired, closing socket connection > > (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,314] INFO EventThread shut down for session: > > 0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn) > > [2022-05-21 09:07:48,318] INFO [ZooKeeperClient] Session expired. > > (kafka.zookeeper.ZooKeeperClient) > > [2022-05-21 09:07:48,396] INFO [ProducerStateManager > > partition=efGamePlay-41] Writing producer snapshot at offset 2116645612 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:48,397] INFO [Log partition=efGamePlay-41, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 2116645612 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:07:49,452] INFO [ProducerStateManager > > partition=efAcquisition-47] Writing producer snapshot at offset 252224167 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:49,453] INFO [Log partition=efAcquisition-47, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 252224167 in 3 ms. (kafka.log.Log) > > [2022-05-21 09:07:57,475] INFO [ProducerStateManager > > partition=efGamePlay-5] Writing producer snapshot at offset 2116617455 > > (kafka.log.ProducerStateManager) > > [2022-05-21 09:07:57,476] INFO [Log partition=efGamePlay-5, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > > offset 2116617455 in 2 ms. (kafka.log.Log) > > [2022-05-21 09:08:00,642] INFO [Partition prism_sqldb_topic-55 broker=2] > > Shrinking ISR from 2,1 to 2 (kafka.cluster.Partition) > > [2022-05-21 09:08:00,782] WARN Attempting to send response via channel > for > > which there is no open connection, connection id 10.200.3.174:9092 > > -10.200.2.192:52088-5632843 (kafka.network.Processor) > > [2022-05-21 09:08:05,321] INFO [Log partition=efGamePlay-41, > > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Found deletable segments > > with base offsets [2116567157] due to retention time 3600000ms breach > > (kafka.log.Log) > > >