[ https://issues.apache.org/jira/browse/KAFKA-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566872#comment-17566872 ]
Haruki Okada commented on KAFKA-13572: -------------------------------------- We experienced similar phenomenon in our Kafka cluster and we found that following scenario can cause negative metric. Let's say there are topic-A, topic-B. # Initiate topic deletion of topic-A ** TopicDeletionManager#enqueueTopicsForDeletion is called with argument Set(topic-A) *** [https://github.com/apache/kafka/blob/3.2.0/core/src/main/scala/kafka/controller/KafkaController.scala#L1771] # During topic-A's deletion procedure, topic-A's all partitions are marked as Offline (Leader = -1) ** [https://github.com/apache/kafka/blob/3.2.0/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L368] # Before topic-A's deletion procedure completes, initiate topic deletion of topic-B ** Since topic-A's ZK delete-topic node still exists, TopicDeletionManager#enqueueTopicsForDeletion is called with argument Set(topic-A, topic-B) ** ControllerContext#cleanPreferredReplicaImbalanceMetric is called for both topic-A, topic-B *** [https://github.com/apache/kafka/blob/3.2.0/core/src/main/scala/kafka/controller/ControllerContext.scala#L496] *** Since topic-A is now NoLeader, `!hasPreferredLeader(replicaAssignment, leadershipInfo)` evaluates to true, then `preferredReplicaImbalanceCount` is decremented unexpectedly > Negative value for 'Preferred Replica Imbalance' metric > ------------------------------------------------------- > > Key: KAFKA-13572 > URL: https://issues.apache.org/jira/browse/KAFKA-13572 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.7.0 > Reporter: Siddharth Ahuja > Priority: Major > Attachments: > kafka_negative_preferred-replica-imbalance-count_jmx_2.JPG > > > A negative value (-822) for the metric - > {{kafka_controller_kafkacontroller_preferredreplicaimbalancecount}} has been > observed - please see the attached screenshot and the output below: > {code:java} > $ curl -s http://localhost:9101/metrics | fgrep > 'kafka_controller_kafkacontroller_preferredreplicaimbalancecount' > # HELP kafka_controller_kafkacontroller_preferredreplicaimbalancecount > Attribute exposed for management (kafka.controller<type=KafkaController, > name=PreferredReplicaImbalanceCount><>Value) > # TYPE kafka_controller_kafkacontroller_preferredreplicaimbalancecount gauge > kafka_controller_kafkacontroller_preferredreplicaimbalancecount -822.0 > {code} > The issue has appeared after an operation where the number of partitions for > some topics were increased, and some topics were deleted/created in order to > decrease the number of their partitions. > Ran the following command to check if there is/are any instance/s where the > preferred leader (1st broker in the Replica list) is not the current Leader: > > {code:java} > % grep ".*Topic:.*Partition:.*Leader:.*Replicas:.*Isr:.*Offline:.*" > kafka-topics_describe.out | awk '{print $6 " " $8}' | cut -d "," -f1 | awk > '{print $0, ($1==$2?_:"NOT") "MATCHED"}'|grep NOT | wc -l > 0 > {code} > but could not find any such instances. > {{leader.imbalance.per.broker.percentage=2}} is set for all the brokers in > the cluster which means that we are allowed to have an imbalance of up to 2% > for preferred leaders. This seems to be a valid value, as such, this setting > should not contribute towards a negative metric. > The metric seems to be getting subtracted in the code > [here|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/ControllerContext.scala#L474-L503] > , however it is not clear when it can become -ve (i.e. subtracted more than > added) in absence of any comments or debug/trace level logs in the code. > However, one thing is for sure, you either have no imbalance (0) or have > imbalance (> 0), it doesn’t make sense for the metric to be < 0. > FWIW, no other anomalies besides this have been detected. > Considering these metrics get actively monitored, we should look at adding > DEBUG/TRACE logging around the addition/subtraction of these metrics (and > elsewhere where appropriate) to identify any potential issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)