Hi Aashutosh,
The queryable state feature is no longer actively maintained by the
community. What I would recommend is to output the aggregate counts via a
sink to some key value store which you query to obtain the results.
Looking at the implementation of CopyOnWriteStateMap, it does not look like
that this class is supposed to be accessed concurrently. I suspect that
this is the cause for the infinite loop you are seeing. I think the problem
was that this class was implemented after the development of queryable
state had been stopped. Sorry for the inconveniences.
I also pulled in the author of the CopyOnWriteStateMap PengFei Li who might
give more details.
Cheers,
Till
On Mon, Mar 29, 2021 at 2:59 PM Aashutosh Swarnakar
wrote:
> Hi Folks,
>
>
>
> I've recently started using Flink for a pilot project where I need to
> aggregate event counts on per minute window basis. The state has been made
> queryable so that external services can query the state via Flink State
> Query API. I am using memory state backend with a keyed process function
> and map state.
>
>
>
> I've a simple job running on a 6 node flink standalone cluster. 1 job
> manager and 5 task managers. External services can query the 5 task manager
> nodes for flink state.
>
>
>
> The job operates fine whenever external clients are not querying flink
> state but once the external clients start quering the flink state via flink
> queryable client, I observe that flink query server threads and the
> aggregate task thread gets stuck into an infinite loop in
> CopyOnWriteStateMap.get() method. Also the GC activity peaks to 100% along
> with 100% CPU usage. The task manager nodes are unable to recover from this
> situation and I have to restart the cluster. Let me know if anybody has
> faced this issue before.
>
>
>
> Any information with regards to below queries will be very helpful.
>
>
>
> 1. Is this a thread synchronisation issue ?
>
> 2. Is CopyOnWriteStateMap class thread safe ?
>
> 3. Is there a possibility for any race conditions when incremental
> rehashing is done for CopyOnWriteStateMap ?
>
> 4. Can this be an issue with state usage in my job implementation (I am
> doing a get and put on map state for processing each element in the stream)
> ?
>
>
>
>
>
> I have added the thread dump below along with the code snippet where the
> threads go into infinite loop.
>
>
>
> Task thread:
>
>
>
> "aggregates-stream -> Map -> Sink: Cassandra Sink
> (2/10)#0" - Thread t@76
>
>java.lang.Thread.State: RUNNABLE
>
> at
> org.apache.flink.runtime.state.heap.CopyOnWriteStateMap.get(CopyOnWriteStateMap.java:275)
>
> at
> org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:262)
>
> at
> org.apache.flink.runtime.state.heap.StateTable.get(StateTable.java:136)
>
> at
> org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:86)
>
> at
> org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47)
>
> at
> com.cybersource.risk.operator.ProcessAggregatesFunction.processElement(ProcessAggregatesFunction.java:44)
>
> at
> com.cybersource.risk.operator.ProcessAggregatesFunction.processElement(ProcessAggregatesFunction.java:20)
>
> at
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
>
> at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:187)
>
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:204)
>
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:174)
>
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$202/2001022910.runDefaultAction(Unknown
> Source)
>
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
>
> at
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
>
> at
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
>
> at java.lang.Thread.run(Thread.java:748)