[jira] [Commented] (KAFKA-13572) Negative value for 'Preferred Replica Imbalance' metric

Haruki Okada (Jira) Thu, 14 Jul 2022 07:48:08 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566872#comment-17566872
 ]


Haruki Okada commented on KAFKA-13572:
--------------------------------------

We experienced similar phenomenon in our Kafka cluster and we found that 
following scenario can cause negative metric.

Let's say there are topic-A, topic-B.

 
 # Initiate topic deletion of topic-A
 ** TopicDeletionManager#enqueueTopicsForDeletion is called with argument 
Set(topic-A)
 *** 
[https://github.com/apache/kafka/blob/3.2.0/core/src/main/scala/kafka/controller/KafkaController.scala#L1771]
 # During topic-A's deletion procedure, topic-A's all partitions are marked as 
Offline (Leader = -1)
 ** 
[https://github.com/apache/kafka/blob/3.2.0/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L368]
 # Before topic-A's deletion procedure completes, initiate topic deletion of 
topic-B
 ** Since topic-A's ZK delete-topic node still exists, 
TopicDeletionManager#enqueueTopicsForDeletion is called with argument 
Set(topic-A, topic-B)
 ** ControllerContext#cleanPreferredReplicaImbalanceMetric is called for both 
topic-A, topic-B
 *** 
[https://github.com/apache/kafka/blob/3.2.0/core/src/main/scala/kafka/controller/ControllerContext.scala#L496]
 *** Since topic-A is now NoLeader, `!hasPreferredLeader(replicaAssignment, 
leadershipInfo)` evaluates to true, then `preferredReplicaImbalanceCount` is 
decremented unexpectedly

> Negative value for 'Preferred Replica Imbalance' metric
> -------------------------------------------------------
>
>                 Key: KAFKA-13572
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13572
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Siddharth Ahuja
>            Priority: Major
>         Attachments: 
> kafka_negative_preferred-replica-imbalance-count_jmx_2.JPG
>
>
> A negative value (-822) for the metric - 
> {{kafka_controller_kafkacontroller_preferredreplicaimbalancecount}} has been 
> observed - please see the attached screenshot and the output below:
> {code:java}
> $ curl -s http://localhost:9101/metrics | fgrep 
> 'kafka_controller_kafkacontroller_preferredreplicaimbalancecount'
> # HELP kafka_controller_kafkacontroller_preferredreplicaimbalancecount 
> Attribute exposed for management (kafka.controller<type=KafkaController, 
> name=PreferredReplicaImbalanceCount><>Value)
> # TYPE kafka_controller_kafkacontroller_preferredreplicaimbalancecount gauge
> kafka_controller_kafkacontroller_preferredreplicaimbalancecount -822.0
> {code}
> The issue has appeared after an operation where the number of partitions for 
> some topics were increased, and some topics were deleted/created in order to 
> decrease the number of their partitions.
> Ran the following command to check if there is/are any instance/s where the 
> preferred leader (1st broker in the Replica list) is not the current Leader:
>  
> {code:java}
> % grep ".*Topic:.*Partition:.*Leader:.*Replicas:.*Isr:.*Offline:.*" 
> kafka-topics_describe.out | awk '{print $6 " " $8}' | cut -d "," -f1 | awk 
> '{print $0, ($1==$2?_:"NOT") "MATCHED"}'|grep NOT | wc -l
>      0
> {code}
> but could not find any such instances.
> {{leader.imbalance.per.broker.percentage=2}} is set for all the brokers in 
> the cluster which means that we are allowed to have an imbalance of up to 2% 
> for preferred leaders. This seems to be a valid value, as such, this setting 
> should not contribute towards a negative metric.
> The metric seems to be getting subtracted in the code 
> [here|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/ControllerContext.scala#L474-L503]
>  , however it is not clear when it can become -ve (i.e. subtracted more than 
> added) in absence of any comments or debug/trace level logs in the code. 
> However, one thing is for sure, you either have no imbalance (0) or have 
> imbalance (> 0), it doesn’t make sense for the metric to be < 0. 
> FWIW, no other anomalies besides this have been detected.
> Considering these metrics get actively monitored, we should look at adding 
> DEBUG/TRACE logging around the addition/subtraction of these metrics (and 
> elsewhere where appropriate) to identify any potential issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-13572) Negative value for 'Preferred Replica Imbalance' metric

Reply via email to