Hwanju Kim created FLINK-14949:
----------------------------------

             Summary: Task cancellation can be stuck against out-of-thread error
                 Key: FLINK-14949
                 URL: https://issues.apache.org/jira/browse/FLINK-14949
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.8.2
            Reporter: Hwanju Kim


Task cancellation 
([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991])
 relies on multiple separate threads, which are _TaskCanceler_, 
_TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs 
cancellation itself, TaskInterrupter periodically interrupts a non-reacting 
task and TaskCancelerWatchdog kills JVM if cancellation has never been finished 
within a certain amount of time (by default 3 min). Those all ensure that 
cancellation can be done or either aborted transitioning to a terminal state in 
finite time (FLINK-4715).

However, if any asynchronous thread creation is failed such as by out-of-thread 
(_java.lang.OutOfMemoryError: unable to create new native thread_), the code 
transitions to CANCELING, but nothing could be performed for cancellation or 
watched by watchdog. Currently, jobmanager does [retry 
cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121]
 against any error returned, but a next retry [returns success once it sees 
CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997],
 assuming that it is in progress. This leads to complete stuck in CANCELING, 
which is non-terminal, so state machine is stuck after that.

One solution would be that if a task has transitioned to CANCELLING but it gets 
fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true) indicating 
that it could not reach spawning TaskCancelerWatchdog, it could immediately 
consider that as fatal error (not safely cancellable) calling 
_notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and 
synchronously. That way, it can at least transition out of the non-terminal 
state and furthermore clear potentially leaked thread/memory by restarting JVM. 
The same method is also invoked by _failExternally_, but transitioning to 
FAILED seems less critical as it's already terminal state.

How to reproduce is straightforward by running an application that keeps 
creating threads, each of which never finishes in a loop, and has multiple 
tasks so that one task triggers failure and then the others are attempted to be 
cancelled by full fail-over. In web UI dashboard, some tasks from a task 
manager where any of cancellation-related threads failed to be spawned are 
stuck in CANCELLING for good.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to