Sagar Rao created KAFKA-15229:
---------------------------------

             Summary: Increase default value of 
task.shutdown.graceful.timeout.ms
                 Key: KAFKA-15229
                 URL: https://issues.apache.org/jira/browse/KAFKA-15229
             Project: Kafka
          Issue Type: Improvement
          Components: KafkaConnect
            Reporter: Sagar Rao
            Assignee: Sagar Rao


The Kafka Connect config [task.shutdown.graceful.timeout.ms. 
|https://kafka.apache.org/documentation/#connectconfigs_task.shutdown.graceful.timeout.ms]has
 a default value of 5s. As per it's definition:

 
{noformat}
Amount of time to wait for tasks to shutdown gracefully. This is the total 
amount of time, not per task. All task have shutdown triggered, then they are 
waited on sequentially.{noformat}

it is the total timeout for all tasks to shutdown. Also, if multiple tasks are 
to be shutdown then, they are waited upon sequentially. Now the default value 
of this config is ok for smaller clusters with less number of tasks, on a 
larger cluster because the timeout can elapse we will see a lot of messages of 
the form 

```
Graceful stop of task <task-id> failed.
```

In case of failure in graceful stop of tasks, the tasks are cancelled which 
means that they won't send out a status update. Once that happens there won't 
be any `UNASSIGNED` status message posted for that task. Let's say the task 
stop was triggered by a worker going down. If the cluster is configured to use 
Incremental Cooperative Assignor, then the task wouldn't be reassigned until 
scheduled.rebalance.delay.max.ms interval elapses. So, for that amount of 
duration, the task would show up with status RUNNING whenever it's status is 
queried for. This can be confusing for the users.

This problem can be exacerbated on cloud environments(like kubernetes pods) 
because there is a high chance that the running status would be associated with 
an older worker_id which doesn't even exist in the cluster anymore. 

While the net effect of all of this is not catastrophic i.e it won't lead to 
any processing delays  or loss of data but the status of the task would be off. 
And if there are fast rebalances happening under Incremental Cooperative 
Assignor, then that duration could be high as well. 

So, the proposal is to increase the default value to a higher value. I am 
thinking we can set it to 60s because as far as I can see, it doesn't interfere 
with any other timeout that we have. 

I am tagging this as need-kip because I believe we will need one.






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to