Re: RestartStrategy failure count when losing a Task Manager

2020-07-15 Thread Chesnay Schepler
1) A restart in one region only increments the count by 1, independent 
of how many tasks from that region fail at the same time.
If tasks from different regions fail at the same time, then the bound is 
incremented by the number of affected regions.


2)

I would consider what failure rate is acceptable if there were no 
regions, and then multiple that but the number of slots to offset task 
executor failures.



Failures in the application (e.g., a source failing for some reason) 
will generally behave, failure-rate wise, as if regions would not exist. 
They are sporadic, and the chance of them appearing in different regions 
at the same time seems rather small.



On 15/07/2020 00:16, Jiahui Jiang wrote:
Hello Flink, I have some questions regarding to the guideline on 
configuring restart strategy.


I was testing a job with the following setup:

 1. There are many tasks, but currently I'm running with only 2
parallelism, but plenty of task slots (4 TM and 4 task slot in
each TM).
 2. It's ran in k8s with HA enabled.
 3. The current restart strategy is 'failure-rate' with 30mins failure
interval, 1 min delay interval and 3 failure rate.

When a TM got removed by k8s, it looked like that caused multiple 
failure to happen all at once. In the job manager log, I'm seeing 
different task failed with the same stacktrace 'Heartbeat of 
taskManager with id {SOME_ID} timed out' around the same time.


I understand that all the tasks that were running on this taskManager 
would fail. But still have these following questions:


Questions:

 1. What count as one failure for a restartStrategy? It doesn't look
like every failed task counts towards one failure according to my
other jobs. Is it because we have failover strategy defaults to be
region, and each failure only trigger part of the job graph to
restart, and the rest of the 'not retriggered' job graph can still
cause more failure that will be counted towards failure rate?
 2. If that's the case, what will be the recommended way to set
restart strategy? If we don't want to hard code a number for every
single pipeline we are running, is that a good way to infer how to
set the failure rate?

Thank you so much!
Jiahui





RestartStrategy failure count when losing a Task Manager

2020-07-14 Thread Jiahui Jiang
Hello Flink, I have some questions regarding to the guideline on configuring 
restart strategy.

I was testing a job with the following setup:

  1.  There are many tasks, but currently I'm running with only 2 parallelism, 
but plenty of task slots (4 TM and 4 task slot in each TM).
  2.  It's ran in k8s with HA enabled.
  3.  The current restart strategy is 'failure-rate' with 30mins failure 
interval, 1 min delay interval and 3 failure rate.

When a TM got removed by k8s, it looked like that caused multiple failure to 
happen all at once. In the job manager log, I'm seeing different task failed 
with the same stacktrace 'Heartbeat of taskManager with id {SOME_ID} timed out' 
around the same time.

I understand that all the tasks that were running on this taskManager would 
fail. But still have these following questions:

Questions:

  1.  What count as one failure for a restartStrategy? It doesn't look like 
every failed task counts towards one failure according to my other jobs. Is it 
because we have failover strategy defaults to be region, and each failure only 
trigger part of the job graph to restart, and the rest of the 'not retriggered' 
job graph can still cause more failure that will be counted towards failure 
rate?
  2.  If that's the case, what will be the recommended way to set restart 
strategy? If we don't want to hard code a number for every single pipeline we 
are running, is that a good way to infer how to set the failure rate?

Thank you so much!
Jiahui