Hi , I have an environment like kafka cluster with 3 brokers & kafka-streams to process data of kafka topic. Here kafka & kafka-streams versions are 2.7.0 . Which is working fine for sometime , later having issues in kafka-streams, in logs showing below error's
- Execution error java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException - Rebalance failed. org.apache.kafka.common.errors.DisconnectException - Rebalance failed. org.apache.kafka.common.errors.CoordinatorLoadInProgressException: The coordinator is loading and hence can't process requests. Whenever restarting kafka-cluster & kafka-streams , then again working fine for some time. I am not sure exactly where the problem is , But when i looked at kafka metrics which are below. Here i thought mostly **** indicated metrics having high error rate & looks like *kafka cluster is unstable*. network<type=RequestMetrics, name=ErrorsPerSec, request=ApiVersions, error=NONE><>Count) # TYPE kafka_network_requestmetrics_errors_total untyped kafka_network_requestmetrics_errors_total{request="ApiVersions, error=NONE",} 510.0 kafka_network_requestmetrics_errors_total{request="Fetch, error=NOT_LEADER_OR_FOLLOWER",} 53.0 **** kafka_network_requestmetrics_errors_total{request="Fetch, error=FENCED_LEADER_EPOCH",} 334507.0 kafka_network_requestmetrics_errors_total{request="JoinGroup, error=NONE",} 9.0 kafka_network_requestmetrics_errors_total{request="JoinGroup, error=COORDINATOR_LOAD_IN_PROGRESS",} 252.0 kafka_network_requestmetrics_errors_total{request="OffsetForLeaderEpoch, error=UNKNOWN_LEADER_EPOCH",} 42.0 kafka_network_requestmetrics_errors_total{request="JoinGroup, error=NOT_COORDINATOR",} 2.0 **** kafka_network_requestmetrics_errors_total{request="LeaderAndIsr, error=NONE",} 346.0 kafka_network_requestmetrics_errors_total{request="OffsetForLeaderEpoch, error=NONE",} 62.0 **** kafka_network_requestmetrics_errors_total{request="FindCoordinator, error=COORDINATOR_NOT_AVAILABLE",} 104.0 kafka_network_requestmetrics_errors_total{request="ListOffsets, error=NOT_LEADER_OR_FOLLOWER",} 15.0 kafka_network_requestmetrics_errors_total{request="SyncGroup, error=UNKNOWN_MEMBER_ID",} 3.0 **** kafka_network_requestmetrics_errors_total{request="OffsetCommit, error=NONE",} 1883.0 kafka_network_requestmetrics_errors_total{request="Heartbeat, error=NOT_COORDINATOR",} 2.0 **** kafka_network_requestmetrics_errors_total{request="Metadata, error=NONE",} 1091.0 kafka_network_requestmetrics_errors_total{request="Heartbeat, error=UNKNOWN_MEMBER_ID",} 5.0 kafka_network_requestmetrics_errors_total{request="ListOffsets, error=FENCED_LEADER_EPOCH",} 5.0 **** kafka_network_requestmetrics_errors_total{request="DeleteRecords, error=NONE",} 756.0 kafka_network_requestmetrics_errors_total{request="OffsetFetch, error=NONE",} 134.0 kafka_network_requestmetrics_errors_total{request="FindCoordinator, error=NONE",} 19.0 kafka_network_requestmetrics_errors_total{request="ListOffsets, error=NONE",} 321.0 kafka_network_requestmetrics_errors_total{request="SyncGroup, error=NONE",} 6.0 kafka_network_requestmetrics_errors_total{request="JoinGroup, error=MEMBER_ID_REQUIRED",} 9.0 kafka_network_requestmetrics_errors_total{request="UpdateMetadata, error=NONE",} 13.0 kafka_network_requestmetrics_errors_total{request="Fetch, error=UNKNOWN_LEADER_EPOCH",} 88.0 **** kafka_network_requestmetrics_errors_total{request="Fetch, error=NONE",} 16927.0 **** kafka_network_requestmetrics_errors_total{request="Heartbeat, error=NONE",} 18353.0 kafka_network_requestmetrics_errors_total{request="OffsetForLeaderEpoch, error=UNKNOWN_TOPIC_OR_PARTITION",} 24.0 kafka_network_requestmetrics_errors_total{request="OffsetForLeaderEpoch, error=NOT_LEADER_OR_FOLLOWER",} 17.0 **** kafka_network_requestmetrics_errors_total{request="Produce, error=NONE",} 4450.0 # HELP jmx_scrape_error Non-zero if this scrape failed. These are the kafka configurations which i used rm -f /var/lib/kafka/kafka-0/.lock; rm -f /var/lib/kafka/kafka-0/meta.properties; exec kafka-server-start.sh /opt/kafka/config/server.properties --override unclean.leader.election.enable=true --override broker.id=0 --override listeners=PLAINTEXT://\${LOCAL_POD_IP}:9093 --override host.name=#[[${HOSTNAME}]]# --override advertised.listeners=PLAINTEXT://\${LOCAL_POD_IP}:9093 --override log.dirs=/var/lib/kafka/kafka-0 --override auto.create.topics.enable=true --override auto.leader.rebalance.enable=true --override compression.type=producer --override delete.topic.enable=false --override offsets.topic.replication.factor=2 --override broker.id.generation.enable=true --override default.replication.factor=2 --override num.partitions=10 --override log.retention.bytes=536870912000 --override socket.request.max.bytes=1195725856 --override log.retention.hours=360 --override log.roll.hours=360 --override max.message.bytes=5242880 --override zookeeper.ssl.endpoint.identification.algorithm --override zookeeper.ssl.client.enable=true --override zookeeper.clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty --override zookeeper.ssl.keystore.type=PEM --override zookeeper.ssl.truststore.type=PEM --override zookeeper.ssl.keystore.location=/zoo/cert_key --override zookeeper.ssl.truststore.location=/zoo/caBundle --override zookeeper.set.acl=false --override zookeeper.connect=zookeeper-svc:4095/kafka-brokers/kafka Please have a look at the kafka configurations & metrics , let me know what all changes to do make kafka-cluster stable. Thanks in advance to looking into this. -- Thank's&Regard's, Prasad.