Re: Flink State Query Server threads stuck in infinite loop with high GC activity on CopyOnWriteStateMap get

2021-03-30 Thread Till Rohrmann
Hi Aashutosh,

The queryable state feature is no longer actively maintained by the
community. What I would recommend is to output the aggregate counts via a
sink to some key value store which you query to obtain the results.

Looking at the implementation of CopyOnWriteStateMap, it does not look like
that this class is supposed to be accessed concurrently. I suspect that
this is the cause for the infinite loop you are seeing. I think the problem
was that this class was implemented after the development of queryable
state had been stopped. Sorry for the inconveniences.

I also pulled in the author of the CopyOnWriteStateMap PengFei Li who might
give more details.

Cheers,
Till

On Mon, Mar 29, 2021 at 2:59 PM Aashutosh Swarnakar 
wrote:

> Hi Folks,
>
>
>
> I've recently started using Flink for a pilot project where I need to
> aggregate event counts on per minute window basis. The state has been made
> queryable so that external services can query the state via Flink State
> Query API. I am using memory state backend with a keyed process function
> and map state.
>
>
>
> I've a simple job running on a 6 node flink standalone cluster. 1 job
> manager and 5 task managers. External services can query the 5 task manager
> nodes for flink state.
>
>
>
> The job operates fine whenever external clients are not querying flink
> state but once the external clients start quering the flink state via flink
> queryable client, I observe that flink query server threads and the
> aggregate task thread gets stuck into an infinite loop in
> CopyOnWriteStateMap.get() method. Also the GC activity peaks to 100% along
> with 100% CPU usage. The task manager nodes are unable to recover from this
> situation and I have to restart the cluster. Let me know if anybody has
> faced this issue before.
>
>
>
> Any information with regards to below queries will be very helpful.
>
>
>
> 1. Is this a thread synchronisation issue ?
>
> 2. Is CopyOnWriteStateMap class thread safe ?
>
> 3. Is there a possibility for any race conditions when incremental
> rehashing is done for CopyOnWriteStateMap ?
>
> 4. Can this be an issue with state usage in my job implementation (I am
> doing a get and put on map state for processing each element in the stream)
> ?
>
>
>
>
>
> I have added the thread dump below along with the code snippet where the
> threads go into infinite loop.
>
>
>
> Task thread:
>
>
>
> "aggregates-stream -> Map -> Sink: Cassandra Sink
> (2/10)#0" - Thread t@76
>
>java.lang.Thread.State: RUNNABLE
>
> at
> org.apache.flink.runtime.state.heap.CopyOnWriteStateMap.get(CopyOnWriteStateMap.java:275)
>
> at
> org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:262)
>
> at
> org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:136)
>
> at
> org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:86)
>
> at
> org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47)
>
> at
> com.cybersource.risk.operator.ProcessAggregatesFunction.processElement(ProcessAggregatesFunction.java:44)
>
> at
> com.cybersource.risk.operator.ProcessAggregatesFunction.processElement(ProcessAggregatesFunction.java:20)
>
> at
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
>
> at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:187)
>
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:204)
>
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:174)
>
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$202/2001022910.runDefaultAction(Unknown
> Source)
>
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
>
> at
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
>
> at
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
>
> at java.lang.Thread.run(Thread.java:748)

Flink State Query Server threads stuck in infinite loop with high GC activity on CopyOnWriteStateMap get

2021-03-29 Thread Aashutosh Swarnakar
Hi Folks,



I've recently started using Flink for a pilot project where I need to
aggregate event counts on per minute window basis. The state has been made
queryable so that external services can query the state via Flink State
Query API. I am using memory state backend with a keyed process function
and map state.



I've a simple job running on a 6 node flink standalone cluster. 1 job
manager and 5 task managers. External services can query the 5 task manager
nodes for flink state.



The job operates fine whenever external clients are not querying flink
state but once the external clients start quering the flink state via flink
queryable client, I observe that flink query server threads and the
aggregate task thread gets stuck into an infinite loop in
CopyOnWriteStateMap.get() method. Also the GC activity peaks to 100% along
with 100% CPU usage. The task manager nodes are unable to recover from this
situation and I have to restart the cluster. Let me know if anybody has
faced this issue before.



Any information with regards to below queries will be very helpful.



1. Is this a thread synchronisation issue ?

2. Is CopyOnWriteStateMap class thread safe ?

3. Is there a possibility for any race conditions when incremental
rehashing is done for CopyOnWriteStateMap ?

4. Can this be an issue with state usage in my job implementation (I am
doing a get and put on map state for processing each element in the stream)
?





I have added the thread dump below along with the code snippet where the
threads go into infinite loop.



Task thread:



"aggregates-stream -> Map -> Sink: Cassandra Sink (2/10)#0"
- Thread t@76

   java.lang.Thread.State: RUNNABLE

at
org.apache.flink.runtime.state.heap.CopyOnWriteStateMap.get(CopyOnWriteStateMap.java:275)

at
org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:262)

at
org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:136)

at
org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:86)

at
org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47)

at
com.cybersource.risk.operator.ProcessAggregatesFunction.processElement(ProcessAggregatesFunction.java:44)

at
com.cybersource.risk.operator.ProcessAggregatesFunction.processElement(ProcessAggregatesFunction.java:20)

at
org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)

at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:187)

at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:204)

at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:174)

at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395)

at
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$202/2001022910.runDefaultAction(Unknown
Source)

at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)

at
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)

at
org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)

at java.lang.Thread.run(Thread.java:748)



   Locked ownable synchronizers:

- None



Flink State Query Server Threads:



"Flink Queryable State Server Thread 3" - Thread t@136

   java.lang.Thread.State: RUNNABLE

at
org.apache.flink.runtime.state.heap.CopyOnWriteStateMap.incrementalRehash(CopyOnWriteStateMap.java:680)

at
org.apache.flink.runtime.state.heap.CopyOnWriteStateMap.computeHashForOperationAndDoIncrementalRehash(CopyOnWriteStateMap.java:645)

at
org.apache.flink.runtime.state.heap.CopyOnWriteStateMap.get(CopyOnWriteStateMap.java:270)

at
org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:262)

at
org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:222)

at