Hi Nikola, The state files that are being inlined into _metadata file have the same keys as if they were outside, the difference would be in the type. Not inlined file with path A will have exactly the same path as a reference if it is inlined into _metadata. So as long as there are no errors during recovering from such checkpoints/savepoints - there is no issue.
Aleksandr On Tue, 10 Jun 2025 at 15:05, Nikola Milutinovic <n.milutino...@levi9.com> wrote: > Hi Gabor. > > > > Thanks for chiming in. I think it is failing but I could be mistaken. > There are no errors in the log, everything looks fine. However, when I > inspect the _metadata file, I can see references to other files which are > not present at the given locations. Here is an example. > > > > Flink.log (time order is newer first) > > > > 2025-06-10 15:25:40.983 > > 2025-06-10 13:25:40,983 INFO > org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Time > taken for Delete operation is: 0 ms with threads: 0 > > 2025-06-10 15:25:40.983 > > 2025-06-10 13:25:40,983 WARN > org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Disabling > threads for Delete operation as thread count 0 is <= 1 > > 2025-06-10 15:25:40.936 > > 2025-06-10 13:25:40,936 INFO > org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking > checkpoint 1425 as completed for source Source: Kafka source. > > 2025-06-10 15:25:40.936 > > 2025-06-10 13:25:40,936 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 1425 for job 3acd203bc1b74b65803d14c9cad2df32 (3397 bytes, > checkpointDuration=134 ms, finalizationTime=188 ms). > > 2025-06-10 15:25:40.669 > > 2025-06-10 13:25:40,669 INFO > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream > [] - Cannot create recoverable writer due to Recoverable writers on > AzureBlob are only supported for ABFS, will use the ordinary writer. > > 2025-06-10 15:25:40.628 > > 2025-06-10 13:25:40,628 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 1425 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1749561940614 for job > 3acd203bc1b74b65803d14c9cad2df32. > > > > References in the _metadata file: > > > > wasbs://flink-storage@${account}. > blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/c628d0ed-bbdd-4edd-bfa5-c53c60da5d43 > > wasbs://flink-storage@${account}. > blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/6eab4448-6080-4fef-8503-7342dc407b9c > > wasbs://flink-storage@${account}. > blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/685aa9e3-0260-4240-b5de-249f8d9a2683 > > wasbs://flink-storage@${account}. > blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/9ebd7108-0073-4dd0-b047-56b69e21179b > > > > So, if I understand things correctly, there should be those 4 files in the > chk-1425 folder, but it contains only the _metadata file. And this really > is all there is in the logs, Task Manager is spitting some warnings about > metric name collision, but that should be irrelevant. > > > > Am I making a false alarm here? Would you need to inspect the _metadata > file, as well? Or can I do a better job of analyzing it? > > > > Nikola. > > > > *From: *Gabor Somogyi <gabor.g.somo...@gmail.com> > *Date: *Tuesday, June 10, 2025 at 10:52 AM > *To: *Nikola Milutinovic <n.milutino...@levi9.com> > *Cc: *Flink Users <user@flink.apache.org> > *Subject: *Re: Savepoints and Checkpoints missing files > > Hi Nikola, > > > > Fails on how? Some stack trace or error would be beneficial. > > > > G > > > > > > On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic < > n.milutino...@levi9.com> wrote: > > Hello. > > > > We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a > consistent error situation: both checkpoints and savepoints only save > “_metadata” file and nothing else. Sometimes this is OK, where all data is > in that one file. But sometimes “_metadata” holds references to other > files, which are not present. > > > > I understand that if the size of the state is smaller than a set limit, it > will be stored only in that one file. And if it is larger, it would be > spilled over to additional files. Our state is generally miniscule, so it > should always fit into _metadata, but sometimes I can inspect the _metadata > file and see references to those additional files. Trying to restore from > such a save/check-point always fails. > > > > Does anyone know of a reason for this behavior? > > > > This is our configuration (relevant parts, I have substituted our account > with a variable): > > > > high-availability.type: kubernetes > > high-availability.cluster-id: flink-cluster-session-cluster > > high-availability.storageDir: wasbs://flink-storage@${account}. > blob.core.windows.net/data > > high-availability.jobmanager.port: 6123 > > > > state.backend.type: rocksdb > > execution.checkpointing.num-retained: 3 > > execution.checkpointing.savepoint-dir: wasbs://flink-storage@${account}. > blob.core.windows.net/flink-savepoints > > execution.checkpointing.mode: EXACTLY_ONCE > > execution.checkpointing.incremental: true > > execution.checkpointing.interval: 60000 > > execution.checkpointing.timeout: 300000 > > $internal.flink.version: v1_20 > > execution.checkpointing.storage: filesystem > > execution.checkpointing.dir: wasbs://flink-storage@${account}. > blob.core.windows.net/flink-checkpoints > > execution.checkpointing.externalized-checkpoint-retention: > RETAIN_ON_CANCELLATION > > execution.checkpointing.min-pause: 5000 > > execution.target: kubernetes-session > > > > fs.azure.account.keyprovider.${account}.blob.core.windows.net: > org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider > > > > env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED > --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED > --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED > --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/ > java.net=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED > --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/ > sun.nio.ch=ALL-UNNAMED > --add-opens=java.base/java.lang.reflect=ALL-UNNAMED > --add-opens=java.base/java.text=ALL-UNNAMED > --add-opens=java.base/java.time=ALL-UNNAMED > --add-opens=java.base/java.util=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED > > > > Nikola. > >