[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336195#comment-17336195 ] Flink Jira Bot commented on FLINK-17560: This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Labels: stale-major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327841#comment-17327841 ] Flink Jira Bot commented on FLINK-17560: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Labels: stale-major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283612#comment-17283612 ] Matthias commented on FLINK-17560: -- [~josson] Did you have a chance to reproduce the issue with an official Flink release without modifications? > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135609#comment-17135609 ] Till Rohrmann commented on FLINK-17560: --- I have created an issue to track the problem of reoffering slots which can contain unfinished {{Tasks}}: FLINK-18293 > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135587#comment-17135587 ] Till Rohrmann commented on FLINK-17560: --- Thanks for reporting this issue [~josson]. The root problem in your case is that the Task is not properly shutting down. From a resource management perspective it makes sense that this slot cannot be reused because it still has a resource consumer which might not have properly freed the resources. The {{TaskCancelerWatchDog}} should kill the task after the configured {{task.cancellation.timeout}}. Per default it is 3 minutes. But maybe you are right that this task is stuck because of a JVM bug. This could explain why the {{TaskManager}} process is not being killed by the {{TaskCancelerWatchDog}}. The thing which worries me a bit is actually that we are offering the slot to the new leader (before the JM fails) even though it still contains a {{Task}} which has not properly stopped yet. The problem is that the new leader does not know about the not yet stopped {{Task}} and will think that the slot is empty. Hence, when the {{JM}} deploys a new {{Task}} into this slot, it might exceed the resources of this slot (e.g. network buffers, memory, etc.). Hence, maybe we should only offer slots to {{JMs}} if they are empty. Once the {{JM}} is able to reconcile the state with a slot which can contain running tasks we could change this behavior. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127464#comment-17127464 ] Xintong Song commented on FLINK-17560: -- bq. Anybody reported this issue before. Not that I'm aware of. Have you tried an official Flink release? Can this problem be reproduced? > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127426#comment-17127426 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] [~chesnay] We are using customized Flink. I do have modification in the JobManager scheduler code. I fixed the ConcurrentModificationException which was happening in the Job Manager code. But the original issue still happens. If you see my above message (also I have attached the thread dump), you can see that Tasks are getting stuck for ever in the JVM and the 'finally' method for those 'Tasks' are never called. Because of this, the slots will never go into 'FREE' state. Anybody reported this issue before. This happens only if number of threads in Task manger is around 950. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127203#comment-17127203 ] Xintong Song commented on FLINK-17560: -- [~josson], Sorry for the late response. I've no idea how I overlooked the update notification on this ticket. As [~chesnay] already mentioned, there's a {{ConcurrentModificationException}} on the JM side when it tries to accept the slots offered by the TM. This is not any known issue that I'm aware of. The error stack seems not matching the code base of Flink 1.8.3. So I have the same question, is this a customized Flink version? > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126933#comment-17126933 ] Chesnay Schepler commented on FLINK-17560: -- There's a ConcurrentModificationException in the TM logs when the slots are being offered. If this is a bug in the slot allocation protocol then the only option I see is to try a later Flink version. Are the running a customized Flink version? > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126857#comment-17126857 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] You got a chance to see the logs. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > Attachments: jobmgr.log, threaddump-tm.txt, tm.log > > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116940#comment-17116940 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] I am able to root cause this problem to stuck tasks in a highly threaded environment. Below i have explained how I was able to re-produce this issue(not consistently though) *Scenario: 1* Job ID: JobID123 (parallelism 1) TM has only Single slot h5. Step 1: Schedule jobA JobID123 is running on TM on a slot with allocation ID: *'AllocationID123'.* Now *JobID123* is mapped to allocationID: *AllocationID123'* h5. Step 2: Zookeeper stop Task manager tries to cancel/fail all the tasks on *'AllocationID123'*. But some of the tasks got stuck and never stops. Which means the *'finally'* block which cleans up things never got called. At this point, the above mentioned tasks are in *CANCELLING* state. But the status of the Slot is still in *'ALLOCATED'* state. And it is allocated to job id : *JobID123* I can see that Flink has a Cancel task thread/Interrupter thread/ Watch dog thread. But why this task is still stuck. Below I have pasted lines from the thread dump from one such instance where we had a Task stuck. Are we hitting this issue -> [https://bugs.openjdk.java.net/browse/JDK-8227375 .|https://bugs.openjdk.java.net/browse/JDK-8227375] We are using Java-8 Note: This is not the task which is always stuck. It can be any task from the pipelines. {code:java} "OutputFlusher for Source: KUS/snmp_trap_proto/Kafka-Read/Read(KafkaUnboundedSource) -> FlatMap -> KUS/snmp_trap_proto/KafkaRecordToCTuple/ParMultiDo(KafkaRecordToCTuple) -> HSTrapParDoInit117/ParMultiDo(HealthScoreTrapProcParDo) -> ParDo(GroupDataByEntityId)/ParMultiDo(GroupDataByEntityId) -> ApplyHSEventFixedWindow118/Window.Assign.out -> ToKeyedWorkItem" #1542 daemon prio=5 os_prio=0 tid=0x7f5c0daff000 nid=0x638 sleeping[0x7f5b91267000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.flink.runtime.io.network.api.writer.RecordWriter$OutputFlusher.run(RecordWriter.java:362) Locked ownable synchronizers: - None "deviceprocessor" #1538 prio=5 os_prio=0 tid=0x7f5c0ed1a000 nid=0x634 runnable [0x7f5bb45d6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00049fc13cd8> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.nextBatch(KafkaUnboundedReader.java:613) at org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.advance(KafkaUnboundedReader.java:228) at org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:64) at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:281) - locked <0x000499000420> (a java.lang.Object) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97) at org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - None {code} h5. Step 3: Zookeeper start Even though the tasks are not fully cancelled, the the status of the Slot is still in 'ALLOCATED'. Now the TM offers this Slot (*AllocationID123*) to JM. JM doesn't care and happily deploys the tasks. (*Is this intended?.*). All are happy and the job is running fine h5. Step 4: Job manager restart (Problem happens at this stage) TM offers the same slot (with allocation id: *AllocationID123*). Now unfortunately JM throws an exception (This is a rare scenario and happened because of our internal change in the scheduling part. We are fixing it.). When this happens the TM see that it still has Tasks running (Old stuck task) and change the status of slot to *'RELEASING'.* Once this happens, the TM doesn't have any more slots to offer to the JM and the pipeline will be in a
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113678#comment-17113678 ] Xintong Song commented on FLINK-17560: -- If you can reproduce this issue, it would be helpful to provide the logs. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113567#comment-17113567 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] I am able to reproduce this issue (Not consistently) if the number of threads in the Task Manager is very high. If the number of threads are high on TM and restart the Job manager, some times we get into this issue. For me it looks like some piece of code is not executed in the path of notifyFinalState(). Some thread contention?. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112755#comment-17112755 ] Xintong Song commented on FLINK-17560: -- I'm not entirely sure about this, but AFAIK this should not happen. - If the {{notifyFinalState}} is not executed because there's an exception thrown before it, then the {{notifyFatalError}} in the following catch-block should be called and the JVM process should be terminated. - I'm not aware of any potential problem that may block the thread. If you look at {{cancelOrFailAndCancelInvokableInternal}}, there's an {{interruptingThread}} that will interrupt the canceling thread if the user code does something blocking. However, this is the part I'm not very familiar thus not entirely sure about. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112667#comment-17112667 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] What are the chances that below code doesn't execute?. If this piece of code doesn't execute, there is a chance of slot not released. Can this happen if there is something wrong in the user code (actual tranformations). Or issues with blocked threads?. https://github.com/apache/flink/blob/d54807ba10d0392a60663f030f9fe0bfa1c66754/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L840 > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104186#comment-17104186 ] Xintong Song commented on FLINK-17560: -- [~josson], The slot is marked FREE with the same code you linked. The same method will be called again when all the remaining tasks are canceled and removed. You can take a look at {{TaskSlotTable#removeTask}}. The logics in this method causes {{TaskSlotTable#freeSlot}} being called again when all tasks are removed. From its call hierarchy you can find that this method is called when a task is finished/failed/canceled. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104103#comment-17104103 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] Job Manager is completely restarted either as part of some upgrade process. [https://github.com/apache/flink/blob/d54807ba10d0392a60663f030f9fe0bfa1c66754/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/slot/TaskSlotTable.java#L320] If taskSlot.markFree() does not return true, at what point the taskSlot is marked as free?. I am not able to find the code where the slot is assigned as FREE. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103064#comment-17103064 ] Xintong Song commented on FLINK-17560: -- [~josson] The ZK entry should not be a problem. >From what you described, it seems the job cancelation is not completed. But >again, without the logs we cannot really understand what's going on. BTW, when you say "Job Manager gets restarted", how exactly is it restarted? Is the whole JM process restarted or only the job? Is it automatically restarted or the job is manually stopped and re-submitted? > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102802#comment-17102802 ] josson paul kalapparambath commented on FLINK-17560: [~xintongsong] . Unfortunately I don't have a log for this. I am trying to reproduce this issue but this doesn't happen quite often. It is not one or two slot report which wrong. If the issue occurs, all the slot reports that are sent by TM is wrong and contains old job ids report. This continues until I restart the TM. Also I noticed that when we cancel a job the leader/leaderlatch entires in the zookeeper doesn't get cleared for that job. Is that expected?. {code:java} /leader/d8beed9c9261dcf191cc7fde46869b64/job_manager_lock {code} > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102710#comment-17102710 ] Andrey Zagrebin commented on FLINK-17560: - I would also try the latest Flink 1.10 version, if not yet, because there have been some other fixes which might affect the slot managing. I am not not sure that community will decide to release further fixes for 1.8. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling
[ https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102192#comment-17102192 ] Xintong Song commented on FLINK-17560: -- Hi [~josson], Could you provide the complete logs for this issue? It is kind of expected that the slot report contains the old job id, due to the asynchronism between JM/RM/TM. It could happen when the report RM received was sent by TM before TM received message for JM to release the slots. However, usually it should not cause the scheduling failure because RM should notice that the slots become available when receiving the next slot report. It would be helpful to look into the entire jobmanager/taskmanager logs to understand what goes wrong. > No Slots available exception in Apache Flink Job Manager while Scheduling > - > > Key: FLINK-17560 > URL: https://issues.apache.org/jira/browse/FLINK-17560 > Project: Flink > Issue Type: Bug >Affects Versions: 1.8.3 > Environment: Flink verson 1.8.3 > Session cluster >Reporter: josson paul kalapparambath >Priority: Major > > Set up > -- > Flink verson 1.8.3 > Zookeeper HA cluster > 1 ResourceManager/Dispatcher (Same Node) > 1 TaskManager > 4 pipelines running with various parallelism's > Issue > -- > Occationally when the Job Manager gets restarted we noticed that all the > pipelines are not getting scheduled. The error that is reporeted by the Job > Manger is 'not enough slots are available'. This should not be the case > because task manager was deployed with sufficient slots for the number of > pipelines/parallelism we have. > We further noticed that the slot report sent by the taskmanger contains solts > filled with old CANCELLED job Ids. I am not sure why the task manager still > holds the details of the old jobs. Thread dump on the task manager confirms > that old pipelines are not running. > I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is > not the issue happening in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)