Thanks Stephan We can confirm that turning off RocksDB incremental checkpointing seems to help and greatly reduces the number of files (from tens of thousands to low thousands).
We still see that there is a inflection point when running > 50 jobs causes the appmaster to stop deleting files from S3 and leads to a unbounded growth (slower without incremental checkpointing) of the S3 recovery directory. We would be willing to try a patched version (already have a fork)... just to confirm you are suggest to delete the line "fs.delete(filePath, false);" from discardState()? ``` @Override public void discardState() throws Exception { FileSystem fs = getFileSystem(); fs.delete(filePath, false); try { FileUtils.deletePathIfEmpty(fs, filePath.getParent()); } catch (Exception ignored) {} } ``` -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-checkpoint-directories-exhibit-explosive-growth-tp14270p14452.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.