Hi, One more thing. It looks like it's not a Flink issue, but some JDK bug. Others reported that upgrading JDK version (for example to jdk1.8.0_251) seemed to be solving this problem. What JDK version are you using?
Piotrek pt., 9 paź 2020 o 17:59 Piotr Nowojski <pnowoj...@apache.org> napisał(a): > Hi, > > Thanks for reporting the problem. I think this is a known issue [1] on > which we are working to fix. > > Piotrek > > [1] https://issues.apache.org/jira/browse/FLINK-18196 > > pon., 5 paź 2020 o 08:54 Binh Nguyen Van <binhn...@gmail.com> napisał(a): > >> Hi, >> >> I have a streaming job that is written in Apache Beam and uses Flink as >> its runner. The job is working as expected for about 15 hours and then it >> started to have checkpointing error. The error message looks like this >> >> java.lang.Exception: Could not perform checkpoint 910 for operator Source: >> <source-name> (8/60). >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) >> at >> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) >> at >> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) >> at >> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469) >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.lang.NullPointerException >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776) >> ... 11 more >> >> When this happened, I have to stop the job and then start it again, and >> then 15 hours later the issue happens again. >> >> Here are some additional information >> >> - Flink version is 1.10.1 >> - Job reads data from Kafka, transform, and then writes to Kafka >> - There are 6 tasks with the parallelism of 60 each (each task reads >> from 1 Kafka topic) >> - The job is deployed to run on YARN with 60 task managers and each >> task manager has 1 slot >> - The State backend is filesystem and HDFS is the storage (Doesn’t >> seem to related to the type of state backend since the issue also happened >> when I use memory as the state backend) >> - The checkpointing interval is 60 seconds (The longest duration of >> the normal checkpoint as shown in Flink UI is 14 seconds) >> - The minimum pause between checkpoints is 30 seconds >> - Hadoop cluster is Kerberized but Kafka is not. Keytab and principal >> are set in the Flink configuration file >> >> Can someone please help? >> >> Thanks >> -Binh >> >