Hi Nikola,

The state files that are being inlined into _metadata file have the same
keys as if they were outside, the difference would be in the type. Not
inlined file with path A will have exactly the same path as a reference if
it is inlined into _metadata.
So as long as there are no errors during recovering from such
checkpoints/savepoints - there is no issue.

Aleksandr

On Tue, 10 Jun 2025 at 15:05, Nikola Milutinovic <n.milutino...@levi9.com>
wrote:

> Hi Gabor.
>
>
>
> Thanks for chiming in. I think it is failing but I could be mistaken.
> There are no errors in the log, everything looks fine. However, when I
> inspect the _metadata file, I can see references to other files which are
> not present at the given locations. Here is an example.
>
>
>
> Flink.log (time order is newer first)
>
>
>
> 2025-06-10 15:25:40.983
>
> 2025-06-10 13:25:40,983 INFO
> org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Time
> taken for Delete operation is: 0 ms with threads: 0
>
> 2025-06-10 15:25:40.983
>
> 2025-06-10 13:25:40,983 WARN
> org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Disabling
> threads for Delete operation as thread count 0 is <= 1
>
> 2025-06-10 15:25:40.936
>
> 2025-06-10 13:25:40,936 INFO
> org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking
> checkpoint 1425 as completed for source Source: Kafka source.
>
> 2025-06-10 15:25:40.936
>
> 2025-06-10 13:25:40,936 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 1425 for job 3acd203bc1b74b65803d14c9cad2df32 (3397 bytes,
> checkpointDuration=134 ms, finalizationTime=188 ms).
>
> 2025-06-10 15:25:40.669
>
> 2025-06-10 13:25:40,669 INFO
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream
> [] - Cannot create recoverable writer due to Recoverable writers on
> AzureBlob are only supported for ABFS, will use the ordinary writer.
>
> 2025-06-10 15:25:40.628
>
> 2025-06-10 13:25:40,628 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 1425 (type=CheckpointType{name='Checkpoint',
> sharingFilesStrategy=FORWARD_BACKWARD}) @ 1749561940614 for job
> 3acd203bc1b74b65803d14c9cad2df32.
>
>
>
> References in the _metadata file:
>
>
>
> wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/c628d0ed-bbdd-4edd-bfa5-c53c60da5d43
>
> wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/6eab4448-6080-4fef-8503-7342dc407b9c
>
> wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/685aa9e3-0260-4240-b5de-249f8d9a2683
>
> wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/9ebd7108-0073-4dd0-b047-56b69e21179b
>
>
>
> So, if I understand things correctly, there should be those 4 files in the
> chk-1425 folder, but it contains only the _metadata file. And this really
> is all there is in the logs, Task Manager is spitting some warnings about
> metric name collision, but that should be irrelevant.
>
>
>
> Am I making a false alarm here? Would you need to inspect the _metadata
> file, as well? Or can I do a better job of analyzing it?
>
>
>
> Nikola.
>
>
>
> *From: *Gabor Somogyi <gabor.g.somo...@gmail.com>
> *Date: *Tuesday, June 10, 2025 at 10:52 AM
> *To: *Nikola Milutinovic <n.milutino...@levi9.com>
> *Cc: *Flink Users <user@flink.apache.org>
> *Subject: *Re: Savepoints and Checkpoints missing files
>
> Hi Nikola,
>
>
>
> Fails on how? Some stack trace or error would be beneficial.
>
>
>
> G
>
>
>
>
>
> On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic <
> n.milutino...@levi9.com> wrote:
>
> Hello.
>
>
>
> We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a
> consistent error situation: both checkpoints and savepoints only save
> “_metadata” file and nothing else. Sometimes this is OK, where all data is
> in that one file. But sometimes “_metadata” holds references to other
> files, which are not present.
>
>
>
> I understand that if the size of the state is smaller than a set limit, it
> will be stored only in that one file. And if it is larger, it would be
> spilled over to additional files. Our state is generally miniscule, so it
> should always fit into _metadata, but sometimes I can inspect the _metadata
> file and see references to those additional files. Trying to restore from
> such a save/check-point always fails.
>
>
>
> Does anyone know of a reason for this behavior?
>
>
>
> This is our configuration (relevant parts, I have substituted our account
> with a variable):
>
>
>
> high-availability.type: kubernetes
>
> high-availability.cluster-id: flink-cluster-session-cluster
>
> high-availability.storageDir: wasbs://flink-storage@${account}.
> blob.core.windows.net/data
>
> high-availability.jobmanager.port: 6123
>
>
>
> state.backend.type: rocksdb
>
> execution.checkpointing.num-retained: 3
>
> execution.checkpointing.savepoint-dir: wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-savepoints
>
> execution.checkpointing.mode: EXACTLY_ONCE
>
> execution.checkpointing.incremental: true
>
> execution.checkpointing.interval: 60000
>
> execution.checkpointing.timeout: 300000
>
> $internal.flink.version: v1_20
>
> execution.checkpointing.storage: filesystem
>
> execution.checkpointing.dir: wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-checkpoints
>
> execution.checkpointing.externalized-checkpoint-retention:
> RETAIN_ON_CANCELLATION
>
> execution.checkpointing.min-pause: 5000
>
> execution.target: kubernetes-session
>
>
>
> fs.azure.account.keyprovider.${account}.blob.core.windows.net:
> org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider
>
>
>
> env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED
> --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
> --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
> --add-opens=java.base/java.lang=ALL-UNNAMED  --add-opens=java.base/
> java.net=ALL-UNNAMED  --add-opens=java.base/java.io=ALL-UNNAMED
> --add-opens=java.base/java.nio=ALL-UNNAMED  --add-opens=java.base/
> sun.nio.ch=ALL-UNNAMED
> --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
> --add-opens=java.base/java.text=ALL-UNNAMED
> --add-opens=java.base/java.time=ALL-UNNAMED
> --add-opens=java.base/java.util=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED
>
>
>
> Nikola.
>
>

Reply via email to