Hi Gabor. Thanks for chiming in. I think it is failing but I could be mistaken. There are no errors in the log, everything looks fine. However, when I inspect the _metadata file, I can see references to other files which are not present at the given locations. Here is an example.
Flink.log (time order is newer first) 2025-06-10 15:25:40.983 2025-06-10 13:25:40,983 INFO org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Time taken for Delete operation is: 0 ms with threads: 0 2025-06-10 15:25:40.983 2025-06-10 13:25:40,983 WARN org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Disabling threads for Delete operation as thread count 0 is <= 1 2025-06-10 15:25:40.936 2025-06-10 13:25:40,936 INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking checkpoint 1425 as completed for source Source: Kafka source. 2025-06-10 15:25:40.936 2025-06-10 13:25:40,936 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1425 for job 3acd203bc1b74b65803d14c9cad2df32 (3397 bytes, checkpointDuration=134 ms, finalizationTime=188 ms). 2025-06-10 15:25:40.669 2025-06-10 13:25:40,669 INFO org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream [] - Cannot create recoverable writer due to Recoverable writers on AzureBlob are only supported for ABFS, will use the ordinary writer. 2025-06-10 15:25:40.628 2025-06-10 13:25:40,628 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 1425 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1749561940614 for job 3acd203bc1b74b65803d14c9cad2df32. References in the _metadata file: wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/c628d0ed-bbdd-4edd-bfa5-c53c60da5d43 wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/6eab4448-6080-4fef-8503-7342dc407b9c wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/685aa9e3-0260-4240-b5de-249f8d9a2683 wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/9ebd7108-0073-4dd0-b047-56b69e21179b So, if I understand things correctly, there should be those 4 files in the chk-1425 folder, but it contains only the _metadata file. And this really is all there is in the logs, Task Manager is spitting some warnings about metric name collision, but that should be irrelevant. Am I making a false alarm here? Would you need to inspect the _metadata file, as well? Or can I do a better job of analyzing it? Nikola. From: Gabor Somogyi <gabor.g.somo...@gmail.com> Date: Tuesday, June 10, 2025 at 10:52 AM To: Nikola Milutinovic <n.milutino...@levi9.com> Cc: Flink Users <user@flink.apache.org> Subject: Re: Savepoints and Checkpoints missing files Hi Nikola, Fails on how? Some stack trace or error would be beneficial. G On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic <n.milutino...@levi9.com<mailto:n.milutino...@levi9.com>> wrote: Hello. We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a consistent error situation: both checkpoints and savepoints only save “_metadata” file and nothing else. Sometimes this is OK, where all data is in that one file. But sometimes “_metadata” holds references to other files, which are not present. I understand that if the size of the state is smaller than a set limit, it will be stored only in that one file. And if it is larger, it would be spilled over to additional files. Our state is generally miniscule, so it should always fit into _metadata, but sometimes I can inspect the _metadata file and see references to those additional files. Trying to restore from such a save/check-point always fails. Does anyone know of a reason for this behavior? This is our configuration (relevant parts, I have substituted our account with a variable): high-availability.type: kubernetes high-availability.cluster-id: flink-cluster-session-cluster high-availability.storageDir: wasbs://flink-storage@${account}.blob.core.windows.net/data<http://blob.core.windows.net/data> high-availability.jobmanager.port: 6123 state.backend.type: rocksdb execution.checkpointing.num-retained: 3 execution.checkpointing.savepoint-dir: wasbs://flink-storage@${account}.blob.core.windows.net/flink-savepoints<http://blob.core.windows.net/flink-savepoints> execution.checkpointing.mode: EXACTLY_ONCE execution.checkpointing.incremental: true execution.checkpointing.interval: 60000 execution.checkpointing.timeout: 300000 $internal.flink.version: v1_20 execution.checkpointing.storage: filesystem execution.checkpointing.dir: wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints<http://blob.core.windows.net/flink-checkpoints> execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION execution.checkpointing.min-pause: 5000 execution.target: kubernetes-session fs.azure.account.keyprovider.${account}.blob.core.windows.net<http://blob.core.windows.net>: org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.net<http://java.net>=ALL-UNNAMED --add-opens=java.base/java.io<http://java.io>=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch<http://sun.nio.ch>=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED Nikola.