Steven Blumenthal created KAFKA-8936:
----------------------------------------

             Summary: Connect metrics have a chance to disappear on rebalance
                 Key: KAFKA-8936
                 URL: https://issues.apache.org/jira/browse/KAFKA-8936
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect, metrics
    Affects Versions: 2.0.0
            Reporter: Steven Blumenthal


We encountered an interesting problem with our connect cluster. At times, 
seemingly randomly, some connect sink task metrics would randomly disappear 
from Datadog (which is where we are sending these metrics to). After some 
investigation, I noticed that the metrics in question weren't being reported by 
the connect servers themselves.

After some more investigation, I noticed that the metrics stopped reporting 
after a rebalance was triggered. Our logs were filled with "Graceful stop of 
task ... failed". So, further digging to understand what was happening in the 
code when this happens, it appears that this error means that the stopping of 
tasks timed out for whatever reason, and the connect cluster will no longer 
wait for them to stop. They will still stop eventually, but in the meantime new 
tasks can be spun up. 
([Worker.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587]],
 which calls 
[WorkerTask.java:cancel()|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120]])

So, new tasks are being spun up, and begin consuming records and doing work. 
Then, at some point, the old task is removed, and the very last thing that 
happens when the old task is removed is that the metric group associated with 
that task is removed. 
([WorkerTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232]]
 which, in this case, calls 
[WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179]])

The issue with this is that task based metrics are registered based on a set of 
tags that one would expect to not change during runtime. Meaning that, when the 
old task IS EVENTUALLY REMOVED, it is removing the metric group that the new 
task is using (if the new task came up on the same connect node that the old 
task was running on). 
([WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721]])

I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times 
what it had previously been set to, however that did not completely remove the 
problem. Also, even if it did, it doesn't change the fact that a minor network 
blip on my connect cluster could result in us needing to redeploy the code 
simply because metrics went missing due to task shut downs taking longer than 
intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to