Sagar Rao created KAFKA-15229: --------------------------------- Summary: Increase default value of task.shutdown.graceful.timeout.ms Key: KAFKA-15229 URL: https://issues.apache.org/jira/browse/KAFKA-15229 Project: Kafka Issue Type: Improvement Components: KafkaConnect Reporter: Sagar Rao Assignee: Sagar Rao
The Kafka Connect config [task.shutdown.graceful.timeout.ms. |https://kafka.apache.org/documentation/#connectconfigs_task.shutdown.graceful.timeout.ms]has a default value of 5s. As per it's definition: {noformat} Amount of time to wait for tasks to shutdown gracefully. This is the total amount of time, not per task. All task have shutdown triggered, then they are waited on sequentially.{noformat} it is the total timeout for all tasks to shutdown. Also, if multiple tasks are to be shutdown then, they are waited upon sequentially. Now the default value of this config is ok for smaller clusters with less number of tasks, on a larger cluster because the timeout can elapse we will see a lot of messages of the form ``` Graceful stop of task <task-id> failed. ``` In case of failure in graceful stop of tasks, the tasks are cancelled which means that they won't send out a status update. Once that happens there won't be any `UNASSIGNED` status message posted for that task. Let's say the task stop was triggered by a worker going down. If the cluster is configured to use Incremental Cooperative Assignor, then the task wouldn't be reassigned until scheduled.rebalance.delay.max.ms interval elapses. So, for that amount of duration, the task would show up with status RUNNING whenever it's status is queried for. This can be confusing for the users. This problem can be exacerbated on cloud environments(like kubernetes pods) because there is a high chance that the running status would be associated with an older worker_id which doesn't even exist in the cluster anymore. While the net effect of all of this is not catastrophic i.e it won't lead to any processing delays or loss of data but the status of the task would be off. And if there are fast rebalances happening under Incremental Cooperative Assignor, then that duration could be high as well. So, the proposal is to increase the default value to a higher value. I am thinking we can set it to 60s because as far as I can see, it doesn't interfere with any other timeout that we have. I am tagging this as need-kip because I believe we will need one. -- This message was sent by Atlassian Jira (v8.20.10#820010)