[ https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stanislav Kozlovski updated KAFKA-7790: --------------------------------------- Description: All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions of task failure and restarts, it is intuitive to assume that a task would not be re-ran after a certain time period. Let's best illustrate the issue with an example: {code:java} startMs = 12PM; durationMs = 1hour; # 12:02 - Coordinator schedules a task to run on agent-0 # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that are not running in agent-0 # 13:20 - agent-0 process dies. # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} This can result in an endless loop of task rescheduling. If there are more tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where we overwhelm the agent with tasks that we would rather have dropped. h2. Changes We propose that the Trogdor Coordinator does not re-schedule a task if the current time of re-scheduling is greater than the start time of the task and its duration combined. More specifically: {code:java} if (currentTimeMs > startTimeMs + durationTimeMs) scheduleTask() else failTask(){code} was: All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions of task failure and restarts, it is intuitive to assume that a task would not be re-ran after a certain time period. Let's best illustrate the issue with an example: {code:java} startMs = 12PM; durationMs = 1hour; # 12:02 - Coordinator schedules a task to run on agent-0 # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that are not running in agent-0 # 13:20 - agent-0 process dies. # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} This can result in an endless loop of task rescheduling. If there are more tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where we overwhelm the agent with tasks that we would rather have dropped. h2. Changes We propose that the Trogdor Coordinator does not re-schedule a task if the current time of re-scheduling is greater than the start time of the task and its duration combined. More specifically: {code:java} if (currentTimeMs < startTimeMs + durationTimeMs) scheduleTask() else failTask(){code} > Trogdor - Does not time out tasks in time > ----------------------------------------- > > Key: KAFKA-7790 > URL: https://issues.apache.org/jira/browse/KAFKA-7790 > Project: Kafka > Issue Type: Improvement > Reporter: Stanislav Kozlovski > Assignee: Stanislav Kozlovski > Priority: Major > > All Trogdor task specifications have a defined `startMs` and `durationMs`. > Under conditions of task failure and restarts, it is intuitive to assume that > a task would not be re-ran after a certain time period. > Let's best illustrate the issue with an example: > {code:java} > startMs = 12PM; durationMs = 1hour; > # 12:02 - Coordinator schedules a task to run on agent-0 > # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. > # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it > re-schedules tasks that are not running in agent-0 > # 13:20 - agent-0 process dies. > # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} > This can result in an endless loop of task rescheduling. If there are more > tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we > can end up in a scenario where we overwhelm the agent with tasks that we > would rather have dropped. > h2. Changes > We propose that the Trogdor Coordinator does not re-schedule a task if the > current time of re-scheduling is greater than the start time of the task and > its duration combined. More specifically: > {code:java} > if (currentTimeMs > startTimeMs + durationTimeMs) > scheduleTask() > else > failTask(){code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)