[jira] [Updated] (KAFKA-7790) Trogdor - Does not time out tasks in time

Stanislav Kozlovski (JIRA) Mon, 07 Jan 2019 07:19:22 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stanislav Kozlovski updated KAFKA-7790:
---------------------------------------
    Description: 
All Trogdor task specifications have a defined `startMs` and `durationMs`. 
Under conditions of task failure and restarts, it is intuitive to assume that a 
task would not be re-ran after a certain time period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
re-schedules tasks that are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more 
tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can 
end up in a scenario where we overwhelm the agent with tasks that we would 
rather have dropped.
h2. Changes

We propose that the Trogdor Coordinator does not re-schedule a task if the 
current time of re-scheduling is greater than the start time of the task and 
its duration combined. More specifically:
{code:java}
if (currentTimeMs > startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 

  was:
All Trogdor task specifications have a defined `startMs` and `durationMs`. 
Under conditions of task failure and restarts, it is intuitive to assume that a 
task would not be re-ran after a certain time period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
re-schedules tasks that are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more 
tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can 
end up in a scenario where we overwhelm the agent with tasks that we would 
rather have dropped.
h2. Changes


We propose that the Trogdor Coordinator does not re-schedule a task if the 
current time of re-scheduling is greater than the start time of the task and 
its duration combined. More specifically:
{code:java}
if (currentTimeMs < startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 


> Trogdor - Does not time out tasks in time
> -----------------------------------------
>
>                 Key: KAFKA-7790
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7790
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Stanislav Kozlovski
>            Assignee: Stanislav Kozlovski
>            Priority: Major
>
> All Trogdor task specifications have a defined `startMs` and `durationMs`. 
> Under conditions of task failure and restarts, it is intuitive to assume that 
> a task would not be re-ran after a certain time period.
> Let's best illustrate the issue with an example:
> {code:java}
> startMs = 12PM; durationMs = 1hour;
> # 12:02 - Coordinator schedules a task to run on agent-0
> # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
> # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
> re-schedules tasks that are not running in agent-0
> # 13:20 - agent-0 process dies.
> # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
> This can result in an endless loop of task rescheduling. If there are more 
> tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we 
> can end up in a scenario where we overwhelm the agent with tasks that we 
> would rather have dropped.
> h2. Changes
> We propose that the Trogdor Coordinator does not re-schedule a task if the 
> current time of re-scheduling is greater than the start time of the task and 
> its duration combined. More specifically:
> {code:java}
> if (currentTimeMs > startTimeMs + durationTimeMs)
>   scheduleTask()
> else
>   failTask(){code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KAFKA-7790) Trogdor - Does not time out tasks in time

Reply via email to