We found a fix for this by upgrading Flink. We are forced to use the versions of Flink supported by EMR. The version of Flink that failed for us was 1.12.2. That version depends on RocksDB 5.17.2-artisans-2, which has this bug ticket:
https://github.com/facebook/rocksdb/issues/5488 You can see the version of Rocks DB used by Flink here: https://github.com/apache/flink/blob/release-1.12.2/flink-state-backends/flink-statebackend-rocksdb/pom.xml The version that works is here: https://github.com/apache/flink/blob/release-1.14.2/flink-state-backends/flink-statebackend-rocksdb/pom.xml Thus, Flink 1.12.2 was using the version of RocksDB with a known bug. > On Sep 2, 2022, at 10:49 AM, Marco Villalobos <mvillalo...@kineteque.com> > wrote: > > What is the recommended solution for this error of too many files open during > a checkpoint? > > 2022-09-02 10:04:56 java.io.IOException: Could not perform checkpoint 119366 > for operator tag enrichment (3/4)#104. at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:968) > at > org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:115) > at > org.apache.flink.streaming.runtime.io.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:156) > at > org.apache.flink.streaming.runtime.io.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:178) > at > org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155) > at > org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:179) > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:395) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:191) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:609) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755) at > org.apache.flink.runtime.taskmanager.Task.run(Task.java:570) at > java.lang.Thread.run(Thread.java:750) Caused by: > org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete > snapshot 119366 for operator tag enrichment (3/4)#104. Failure reason: > Checkpoint was declined. at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:241) > at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:162) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:685) > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:606) > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:571) > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:298) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$9(StreamTask.java:1003) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:993) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:951) > ... 13 more Caused by: org.rocksdb.RocksDBException: While open a file for > appending: > /mnt/yarn/usercache/hadoop/appcache/application_1631124824249_0061/flink-io-7f392e48-d086-492b-960b-1c56d0f864a0/job_a5b70dea0d3c27b2798c53df49065433_op_KeyedProcessOperator_a91e7e58fb0d0cb4a427ff0c6489016c__3_4__uuid_252bcc06-8857-4153-a866-2e6b3f50c4bb/chk-119366.tmp/MANIFEST-423131: > Too many open files