Flink 1.10
________________________________
From: Kyle Weaver <[email protected]>
Sent: Thursday, September 17, 2020 9:34 AM
To: [email protected] <[email protected]>
Subject: Re: flink runner 1.10 checkpoint timeout issue

This email is from an external sender.

What is the version of your Flink cluster?

On Wed, Sep 16, 2020 at 9:10 PM Deshpande, Omkar 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I recently upgraded to beam-flink-runner-1.10:2.23.0 from 
beam-flink-runner-1.9:2.23.0. My application was working as expected with 1.9 
runner. but after upgrading the checkpoints are timing out. Even after 
increasing the timeout significantly, the checkpoints keep failing. I was 
trying to look at the stack dump to determine any deadlocks. There are no 
deadlocks. But this thread seems to be in awaiting confirmation stage for long 
time -

Legacy Source Thread - Source: read/KafkaIO.Read/Read(KafkaUnboundedSource) -> 
Flat Map -> read/Remove Kafka Metadata/ParMultiDo(Anonymous) -> Random key 
assignment SPP/ParMultiDo(RandomPartitioner) -> Window for repartitioning 
SPP/Window.Assign.out -> ToKeyedWorkItem 
(1/4)<https://jstack.review/#tda_15_threaddetails_0x00007feed3601800> awaiting 
notification on [ 
0x00000007b83b7958<https://jstack.review/#tda_15_sync_0x00000007b83b7958> ] , 
holding [

  *   0x00000007bc786fd8<https://jstack.review/#tda_15_sync_0x00000007bc786fd8>

]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)


My application is IO bound, i.e every record makes a rest call and takes a few 
seconds to complete.
Did not face this issue with 1.9 runner. What has changed in 1.10 runner ? Any 
pointers for debugging?

Omkar

Reply via email to