Hi Yun, Have captured the heap dump which includes thread stack. There is an lock in thread in elasticsearch sink operator. Screenshot of Jprofiler https://github.com/sushantbprise/flink-dashboard/tree/master/thread-blocked
And I see lot many threads in waiting state. I found this link which is kind a similar to my problem. http://mail-archives.apache.org/mod_mbox/flink-user/201710.mbox/%3cfdd0a701-d6fe-433f-b343-19fd24cb3...@data-artisans.com%3e How could I over come this condition? Thanks & Regards, Sushant Sawant On Fri, 30 Aug 2019, 12:48 Yun Tang, <myas...@live.com> wrote: > Hi Sushant > > What confuse me is that why source task cannot complete checkpoint in 3 > minutes [1]. If no sub-task has ever completed the checkpoint, which means > even source task cannot complete. Actually source task would not need to > buffer the data. From what I see, it might be affected by acquiring the > lock which hold by stream task main thread to process elements [2]. Could > you use jstack to capture your java process' threads to know what happened > when checkpoint failed? > > [1] > https://github.com/sushantbprise/flink-dashboard/blob/master/failed-checkpointing/state2.png > [2] > https://github.com/apache/flink/blob/ccc7eb431477059b32fb924104c17af953620c74/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L758 > > Best > Yun Tang > ------------------------------ > *From:* Sushant Sawant <sushantsawant7...@gmail.com> > *Sent:* Tuesday, August 27, 2019 15:01 > *To:* user <user@flink.apache.org> > *Subject:* Re: checkpoint failure suddenly even state size less than 1 mb > > Hi team, > Anyone for help/suggestion, now we have stopped all input in kafka, there > is no processing, no sink but checkpointing is failing. > Is it like once checkpoint fails it keeps failing forever until job > restart. > > Help appreciated. > > Thanks & Regards, > Sushant Sawant > > On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <sushantsawant7...@gmail.com> > wrote: > > Hi all, > m facing two issues which I believe are co-related though. > 1. Kafka source shows high back pressure. > 2. Sudden checkpoint failure for entire day until restart. > > My job does following thing, > a. Read from Kafka > b. Asyncio to external system > c. Dumping in Cassandra, Elasticsearch > > Checkpointing is using file system. > This flink job is proven under high load, > around 5000/sec throughput. > But recently we scaled down parallelism since, there wasn't any load in > production and these issues started. > > Please find the status shown by flink dashboard. > The github folder contains image where there was high back pressure and > checkpoint failure > > https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing > and after restart, "everything is fine" images in this folder, > > https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing > > -- > Could anyone point me towards direction what would have went wrong/ > trouble shooting?? > > > Thanks & Regards, > Sushant Sawant > > >