[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135609#comment-17135609
 ] 

Till Rohrmann commented on FLINK-17560:
---

I have created an issue to track the problem of reoffering slots which can 
contain unfinished {{Tasks}}: FLINK-18293

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135587#comment-17135587
 ] 

Till Rohrmann commented on FLINK-17560:
---

Thanks for reporting this issue [~josson]. The root problem in your case is 
that the Task is not properly shutting down. From a resource management 
perspective it makes sense that this slot cannot be reused because it still has 
a resource consumer which might not have properly freed the resources.

The {{TaskCancelerWatchDog}} should kill the task after the configured 
{{task.cancellation.timeout}}. Per default it is 3 minutes. But maybe you are 
right that this task is stuck because of a JVM bug. This could explain why the 
{{TaskManager}} process is not being killed by the {{TaskCancelerWatchDog}}.

The thing which worries me a bit is actually that we are offering the slot to 
the new leader (before the JM fails) even though it still contains a {{Task}} 
which has not properly stopped yet. The problem is that the new leader does not 
know about the not yet stopped {{Task}} and will think that the slot is empty. 
Hence, when the {{JM}} deploys a new {{Task}} into this slot, it might exceed 
the resources of this slot (e.g. network buffers, memory, etc.). Hence, maybe 
we should only offer slots to {{JMs}} if they are empty. Once the {{JM}} is 
able to reconcile the state with a slot which can contain running tasks we 
could change this behavior.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127464#comment-17127464
 ] 

Xintong Song commented on FLINK-17560:
--

bq. Anybody reported this issue before.
Not that I'm aware of.

Have you tried an official Flink release? Can this problem be reproduced?

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-06 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127426#comment-17127426
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong] [~chesnay]

We are using customized Flink. I do have modification in the JobManager 
scheduler code. I fixed the ConcurrentModificationException which was happening 
in the Job Manager code. But the original issue still happens. 

If you see my above message (also I have attached the thread dump), you can see 
that Tasks are getting stuck for ever in the JVM and the 'finally' method for 
those 'Tasks' are never called. Because of this, the slots will never go into 
'FREE' state. 

Anybody reported this issue before. This happens only if number of threads in 
Task manger is around 950. 

 

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127203#comment-17127203
 ] 

Xintong Song commented on FLINK-17560:
--

[~josson],

Sorry for the late response. I've no idea how I overlooked the update 
notification on this ticket.

As [~chesnay] already mentioned, there's a {{ConcurrentModificationException}} 
on the JM side when it tries to accept the slots offered by the TM.

This is not any known issue that I'm aware of. The error stack seems not 
matching the code base of Flink 1.8.3. So I have the same question, is this a 
customized Flink version?

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-05 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126933#comment-17126933
 ] 

Chesnay Schepler commented on FLINK-17560:
--

There's a ConcurrentModificationException in the TM logs when the slots are 
being offered. If this is a bug in the slot allocation protocol then the only 
option I see is to try a later Flink version.

Are the running a customized Flink version?

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-06-05 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126857#comment-17126857
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong] You got a chance to see the logs.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
> Attachments: jobmgr.log, threaddump-tm.txt, tm.log
>
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-26 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116940#comment-17116940
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong]

I am able to root cause this problem to stuck tasks in a highly threaded 
environment.

Below i have explained how I was able to re-produce this issue(not consistently 
though)

 

*Scenario: 1*

Job ID: JobID123 (parallelism 1) 

TM has only Single slot
h5. Step 1: Schedule jobA

JobID123 is running on TM on a slot with allocation ID: *'AllocationID123'.* 

Now *JobID123* is mapped to allocationID: *AllocationID123'*
h5. Step 2: Zookeeper stop 

  Task manager tries to cancel/fail all the tasks on *'AllocationID123'*. But 
some of the tasks got stuck and never stops. Which means the *'finally'* block 
which cleans up things never got called.

At this point, the above mentioned tasks are in *CANCELLING* state. But the 
status of the Slot is still in *'ALLOCATED'* state. And it is allocated to job 
id : *JobID123*

I can see that Flink has a Cancel task thread/Interrupter thread/ Watch dog 
thread. But why this task is still stuck. Below I have pasted lines from the 
thread dump from one such instance where we had a Task stuck. Are we hitting 
this issue -> [https://bugs.openjdk.java.net/browse/JDK-8227375 
.|https://bugs.openjdk.java.net/browse/JDK-8227375]

We are using Java-8

Note: This is not the task which is always stuck. It can be any task from the 
pipelines.
{code:java}
"OutputFlusher for Source: 
KUS/snmp_trap_proto/Kafka-Read/Read(KafkaUnboundedSource) -> FlatMap -> 
KUS/snmp_trap_proto/KafkaRecordToCTuple/ParMultiDo(KafkaRecordToCTuple) -> 
HSTrapParDoInit117/ParMultiDo(HealthScoreTrapProcParDo) -> 
ParDo(GroupDataByEntityId)/ParMultiDo(GroupDataByEntityId) -> 
ApplyHSEventFixedWindow118/Window.Assign.out -> ToKeyedWorkItem" #1542 daemon 
prio=5 os_prio=0 tid=0x7f5c0daff000 nid=0x638 sleeping[0x7f5b91267000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter$OutputFlusher.run(RecordWriter.java:362)


   Locked ownable synchronizers:
        - None


"deviceprocessor" #1538 prio=5 os_prio=0 tid=0x7f5c0ed1a000 nid=0x634 
runnable [0x7f5bb45d6000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00049fc13cd8> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at 
org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.nextBatch(KafkaUnboundedReader.java:613)
        at 
org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.advance(KafkaUnboundedReader.java:228)
        at 
org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:64)
        at 
org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:281)
        - locked <0x000499000420> (a java.lang.Object)
        at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
        at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
        at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
        at 
org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:745)


   Locked ownable synchronizers:
        - None {code}
h5. Step 3: Zookeeper start

  Even though the tasks are not fully cancelled, the the status of the Slot is 
still in 'ALLOCATED'. Now the TM offers this Slot (*AllocationID123*) to JM. JM 
doesn't care and happily deploys the tasks. (*Is this intended?.*). All are 
happy and the job is running fine
h5. Step 4: Job manager restart (Problem happens at this stage)

TM offers the same slot (with allocation id: *AllocationID123*). Now 
unfortunately JM throws an exception (This is a rare scenario and happened 
because of our internal change in the scheduling part. We are fixing it.). When 
this happens the TM see that it still has Tasks running (Old stuck task) and 
change the status of slot to *'RELEASING'.* Once this happens, the TM doesn't 
have any more slots to offer to the JM and the pipeline will be in a 

[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-21 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113678#comment-17113678
 ] 

Xintong Song commented on FLINK-17560:
--

If you can reproduce this issue, it would be helpful to provide the logs.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-21 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113567#comment-17113567
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong]

I am able to reproduce this issue (Not consistently)  if the number of threads 
in the Task Manager is very high. If the number of threads are high on TM and 
restart the Job manager, some times we get into this issue. For me it looks 
like some piece of code is not executed in the path of notifyFinalState(). Some 
thread contention?. 

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-20 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112755#comment-17112755
 ] 

Xintong Song commented on FLINK-17560:
--

I'm not entirely sure about this, but AFAIK this should not happen.
- If the {{notifyFinalState}} is not executed because there's an exception 
thrown before it, then the {{notifyFatalError}} in the following catch-block 
should be called and the JVM process should be terminated.
- I'm not aware of any potential problem that may block the thread. If you look 
at {{cancelOrFailAndCancelInvokableInternal}}, there's an 
{{interruptingThread}} that will interrupt the canceling thread if the user 
code does something blocking. However, this is the part I'm not very familiar 
thus not entirely sure about.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-20 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112667#comment-17112667
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong]

What are the chances that below code doesn't execute?. If this piece of code 
doesn't execute, there is a chance of slot not released. 

Can this happen if there is something wrong in the user code (actual 
tranformations). Or issues with blocked threads?.

https://github.com/apache/flink/blob/d54807ba10d0392a60663f030f9fe0bfa1c66754/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L840

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-11 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104186#comment-17104186
 ] 

Xintong Song commented on FLINK-17560:
--

[~josson],
The slot is marked FREE with the same code you linked. The same method will be 
called again when all the remaining tasks are canceled and removed.
You can take a look at {{TaskSlotTable#removeTask}}. The logics in this method 
causes {{TaskSlotTable#freeSlot}} being called again when all tasks are 
removed. From its call hierarchy you can find that this method is called when a 
task is finished/failed/canceled.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-11 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104103#comment-17104103
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong]

   Job Manager is completely restarted either as part of some upgrade process.

[https://github.com/apache/flink/blob/d54807ba10d0392a60663f030f9fe0bfa1c66754/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/slot/TaskSlotTable.java#L320]

If taskSlot.markFree() does not return true, at what point the taskSlot is 
marked as free?. I am not able to find the code where the slot is assigned as 
FREE.

 

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103064#comment-17103064
 ] 

Xintong Song commented on FLINK-17560:
--

[~josson]

The ZK entry should not be a problem.

>From what you described, it seems the job cancelation is not completed. But 
>again, without the logs we cannot really understand what's going on.

BTW, when you say "Job Manager gets restarted", how exactly is it restarted? Is 
the whole JM process restarted or only the job? Is it automatically restarted 
or the job is manually stopped and re-submitted?

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-08 Thread josson paul kalapparambath (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102802#comment-17102802
 ] 

josson paul kalapparambath commented on FLINK-17560:


[~xintongsong] . Unfortunately I don't have a log for this. I am trying to 
reproduce this issue but this doesn't happen quite often.

 It is not one or two slot report which wrong. If the issue occurs, all the 
slot reports that are sent by TM is wrong and contains old job ids report. This 
continues until I restart the TM. 

Also I noticed that when we cancel a job the leader/leaderlatch entires in the 
zookeeper doesn't get cleared for that job. Is that expected?. 
{code:java}
/leader/d8beed9c9261dcf191cc7fde46869b64/job_manager_lock
{code}
 

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-08 Thread Andrey Zagrebin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102710#comment-17102710
 ] 

Andrey Zagrebin commented on FLINK-17560:
-

I would also try the latest Flink 1.10 version, if not yet,  because there have 
been some other fixes which might affect the slot managing. I am not not sure 
that community will decide to release further fixes for 1.8.

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17560) No Slots available exception in Apache Flink Job Manager while Scheduling

2020-05-07 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102192#comment-17102192
 ] 

Xintong Song commented on FLINK-17560:
--

Hi [~josson],
Could you provide the complete logs for this issue?

It is kind of expected that the slot report contains the old job id, due to the 
asynchronism between JM/RM/TM. It could happen when the report RM received was 
sent by TM before TM received message for JM to release the slots. However, 
usually it should not cause the scheduling failure because RM should notice 
that the slots become available when receiving the next slot report.

It would be helpful to look into the entire jobmanager/taskmanager logs to 
understand what goes wrong. 

> No Slots available exception in Apache Flink Job Manager while Scheduling
> -
>
> Key: FLINK-17560
> URL: https://issues.apache.org/jira/browse/FLINK-17560
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.3
> Environment: Flink verson 1.8.3
> Session cluster
>Reporter: josson paul kalapparambath
>Priority: Major
>
> Set up
> --
> Flink verson 1.8.3
> Zookeeper HA cluster
> 1 ResourceManager/Dispatcher (Same Node)
> 1 TaskManager
> 4 pipelines running with various parallelism's
> Issue
> --
> Occationally when the Job Manager gets restarted we noticed that all the 
> pipelines are not getting scheduled. The error that is reporeted by the Job 
> Manger is 'not enough slots are available'. This should not be the case 
> because task manager was deployed with sufficient slots for the number of 
> pipelines/parallelism we have.
> We further noticed that the slot report sent by the taskmanger contains solts 
> filled with old CANCELLED job Ids. I am not sure why the task manager still 
> holds the details of the old jobs. Thread dump on the task manager confirms 
> that old pipelines are not running.
> I am aware of https://issues.apache.org/jira/browse/FLINK-12865. But this is 
> not the issue happening in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)