Hi Yun, Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this.
Thanks On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <yungao...@aliyun.com> wrote: > Hi Navneeth, > > It seems from the stack that the exception is caused by the underlying EFS > problems ? Have you checked > if there are errors reported for EFS, or if there might be duplicate > mounting for the same EFS and others > have ever deleted the directory? > > Best, > Yun > > > ------------------Original Mail ------------------ > *Sender:*Navneeth Krishnan <reachnavnee...@gmail.com> > *Send Date:*Sun Mar 7 15:44:59 2021 > *Recipients:*user <user@flink.apache.org> > *Subject:*Re: Checkpoint Error > >> Hi All, >> >> Any suggestions? >> >> Thanks >> >> On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan < >> reachnavnee...@gmail.com> wrote: >> >>> Hi All, >>> >>> We are running our streaming job on flink 1.7.2 and we are noticing the >>> below error. Not sure what's causing it, any pointers would help. We have >>> 10 TM's checkpointing to AWS EFS. >>> >>> AsynchronousException{java.lang.Exception: Could not materialize checkpoint >>> 11 for operator Processor -> Sink: KafkaSink (34/42).}at >>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at >>> >>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at >>> >>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at >>> java.util.concurrent.FutureTask.run(FutureTask.java:266)at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at >>> java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: Could >>> not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink >>> (34/42).at >>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... >>> 6 moreCaused by: java.util.concurrent.ExecutionException: >>> java.io.IOException: Could not flush and close the file system output >>> stream to >>> file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d >>> in order to obtain the stream state handleat >>> java.util.concurrent.FutureTask.report(FutureTask.java:122)at >>> java.util.concurrent.FutureTask.get(FutureTask.java:192)at >>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at >>> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at >>> >>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... >>> 5 moreCaused by: java.io.IOException: Could not flush and close the file >>> system output stream to >>> file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d >>> in order to obtain the stream state handleat >>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at >>> >>> org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at >>> >>> org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at >>> >>> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at >>> java.util.concurrent.FutureTask.run(FutureTask.java:266)at >>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... >>> 7 moreCaused by: java.io.IOException: Stale file handleat >>> java.io.FileOutputStream.close0(Native Method)at >>> java.io.FileOutputStream.access$000(FileOutputStream.java:53)at >>> java.io.FileOutputStream$1.close(FileOutputStream.java:356)at >>> java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at >>> java.io.FileOutputStream.close(FileOutputStream.java:354)at >>> org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at >>> >>> org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at >>> >>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... >>> 12 more >>> >>> >>> Thanks >>> >>>