[jira] [Commented] (FLINK-16728) Taskmanager dies after job got stuck and canceling fails
[ https://issues.apache.org/jira/browse/FLINK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076949#comment-17076949 ] Zhu Zhu commented on FLINK-16728: - Flink forces the TM to shutdown because some tasks on TM are out of control. The task cannot be canceled and the slot cannot be released. This is a safety net which is not expected to take effect in normal cases. In this case, I think you need to investigate why the task is not responding, which is the root cause. > Taskmanager dies after job got stuck and canceling fails > > > Key: FLINK-16728 > URL: https://issues.apache.org/jira/browse/FLINK-16728 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.10.0 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: taskmanager.log.20200323.gz > > > At some point I noticed that a few jobs got stuck (they basically stopped > processing the messages, I could detect this watching the expected output), > so I tried to cancel them. > The cancel operation failed, complaining that the job got stuck at > StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:86) > and then the whole taskmanager shut down. > See the attached log. > This is actually happening practically every day in our staging environment > where we are testing Flink 1.10.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16728) Taskmanager dies after job got stuck and canceling fails
[ https://issues.apache.org/jira/browse/FLINK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071367#comment-17071367 ] Leonid Ilyevsky commented on FLINK-16728: - Hi [~zhuzh], thanks for the explanation. Of course, that job had its own problem and it got stuck. However, in my specific scenario I would really prefer somewhat different behavior. Here is what happened. I knew this job had a problem, and I tried to cancel it; I did not want automatic recovery in this case. I actually managed to cancel it, but with it it brought down two taskmanagers (out of five) where this job was running. Those taskmanagers contained other jobs, and suddenly there were not enough available slots to run all the jobs. So maybe it is possible to optionally provide behavior like this: if the job is being canceled, even in case of timeout, just cancel the job and clean up all resources associated with it, but keep taskmanager up. What do you think? > Taskmanager dies after job got stuck and canceling fails > > > Key: FLINK-16728 > URL: https://issues.apache.org/jira/browse/FLINK-16728 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.10.0 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: taskmanager.log.20200323.gz > > > At some point I noticed that a few jobs got stuck (they basically stopped > processing the messages, I could detect this watching the expected output), > so I tried to cancel them. > The cancel operation failed, complaining that the job got stuck at > StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:86) > and then the whole taskmanager shut down. > See the attached log. > This is actually happening practically every day in our staging environment > where we are testing Flink 1.10.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16728) Taskmanager dies after job got stuck and canceling fails
[ https://issues.apache.org/jira/browse/FLINK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067346#comment-17067346 ] Zhu Zhu commented on FLINK-16728: - Hi [~lilyevsky], it is intentioned to shutdown a TaskManager if task cancellation can not finish within timeout. This would trigger a failure and force the job to recovery from it rather than get stucked on it. So what matter is actually why the task are stucked, both in data processing and in task cancellation. >From the log you attached, I think the deeper stacks of the blocked the the >tasks are: org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:433) org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase.snapshotState(ElasticsearchSinkBase.java:322) app//org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) So looks the task is blocked in elasticsearch flushing and thus does not respond to task cancellation request. > Taskmanager dies after job got stuck and canceling fails > > > Key: FLINK-16728 > URL: https://issues.apache.org/jira/browse/FLINK-16728 > Project: Flink > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Leonid Ilyevsky >Priority: Major > Attachments: taskmanager.log.20200323.gz > > > At some point I noticed that a few jobs got stuck (they basically stopped > processing the messages, I could detect this watching the expected output), > so I tried to cancel them. > The cancel operation failed, complaining that the job got stuck at > StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:86) > and then the whole taskmanager shut down. > See the attached log. > This is actually happening practically every day in our staging environment > where we are testing Flink 1.10.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)