Hi Gabor.

Thanks for chiming in. I think it is failing but I could be mistaken. There are 
no errors in the log, everything looks fine. However, when I inspect the 
_metadata file, I can see references to other files which are not present at 
the given locations. Here is an example.

Flink.log (time order is newer first)

2025-06-10 15:25:40.983
2025-06-10 13:25:40,983 INFO 
org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Time taken 
for Delete operation is: 0 ms with threads: 0
2025-06-10 15:25:40.983
2025-06-10 13:25:40,983 WARN 
org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Disabling 
threads for Delete operation as thread count 0 is <= 1
2025-06-10 15:25:40.936
2025-06-10 13:25:40,936 INFO 
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking 
checkpoint 1425 as completed for source Source: Kafka source.
2025-06-10 15:25:40.936
2025-06-10 13:25:40,936 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed 
checkpoint 1425 for job 3acd203bc1b74b65803d14c9cad2df32 (3397 bytes, 
checkpointDuration=134 ms, finalizationTime=188 ms).
2025-06-10 15:25:40.669
2025-06-10 13:25:40,669 INFO 
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream [] - 
Cannot create recoverable writer due to Recoverable writers on AzureBlob are 
only supported for ABFS, will use the ordinary writer.
2025-06-10 15:25:40.628
2025-06-10 13:25:40,628 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering 
checkpoint 1425 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1749561940614 for job 
3acd203bc1b74b65803d14c9cad2df32.

References in the _metadata file:

wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/c628d0ed-bbdd-4edd-bfa5-c53c60da5d43
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/6eab4448-6080-4fef-8503-7342dc407b9c
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/685aa9e3-0260-4240-b5de-249f8d9a2683
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/9ebd7108-0073-4dd0-b047-56b69e21179b

So, if I understand things correctly, there should be those 4 files in the 
chk-1425 folder, but it contains only the _metadata file. And this really is 
all there is in the logs, Task Manager is spitting some warnings about metric 
name collision, but that should be irrelevant.

Am I making a false alarm here? Would you need to inspect the _metadata file, 
as well? Or can I do a better job of analyzing it?

Nikola.

From: Gabor Somogyi <gabor.g.somo...@gmail.com>
Date: Tuesday, June 10, 2025 at 10:52 AM
To: Nikola Milutinovic <n.milutino...@levi9.com>
Cc: Flink Users <user@flink.apache.org>
Subject: Re: Savepoints and Checkpoints missing files
Hi Nikola,

Fails on how? Some stack trace or error would be beneficial.

G


On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic 
<n.milutino...@levi9.com<mailto:n.milutino...@levi9.com>> wrote:
Hello.

We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a consistent 
error situation: both checkpoints and savepoints only save “_metadata” file and 
nothing else. Sometimes this is OK, where all data is in that one file. But 
sometimes “_metadata” holds references to other files, which are not present.

I understand that if the size of the state is smaller than a set limit, it will 
be stored only in that one file. And if it is larger, it would be spilled over 
to additional files. Our state is generally miniscule, so it should always fit 
into _metadata, but sometimes I can inspect the _metadata file and see 
references to those additional files. Trying to restore from such a 
save/check-point always fails.

Does anyone know of a reason for this behavior?

This is our configuration (relevant parts, I have substituted our account with 
a variable):



high-availability.type: kubernetes

high-availability.cluster-id: flink-cluster-session-cluster

high-availability.storageDir: 
wasbs://flink-storage@${account}.blob.core.windows.net/data<http://blob.core.windows.net/data>

high-availability.jobmanager.port: 6123



state.backend.type: rocksdb

execution.checkpointing.num-retained: 3

execution.checkpointing.savepoint-dir: 
wasbs://flink-storage@${account}.blob.core.windows.net/flink-savepoints<http://blob.core.windows.net/flink-savepoints>

execution.checkpointing.mode: EXACTLY_ONCE

execution.checkpointing.incremental: true

execution.checkpointing.interval: 60000

execution.checkpointing.timeout: 300000

$internal.flink.version: v1_20

execution.checkpointing.storage: filesystem

execution.checkpointing.dir: 
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints<http://blob.core.windows.net/flink-checkpoints>

execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION

execution.checkpointing.min-pause: 5000

execution.target: kubernetes-session



fs.azure.account.keyprovider.${account}.blob.core.windows.net<http://blob.core.windows.net>:
 org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider



env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED  
--add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED  
--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED  
--add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED  
--add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED  
--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED  
--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED  
--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED  
--add-opens=java.base/java.lang=ALL-UNNAMED  
--add-opens=java.base/java.net<http://java.net>=ALL-UNNAMED  
--add-opens=java.base/java.io<http://java.io>=ALL-UNNAMED  
--add-opens=java.base/java.nio=ALL-UNNAMED  
--add-opens=java.base/sun.nio.ch<http://sun.nio.ch>=ALL-UNNAMED  
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED  
--add-opens=java.base/java.text=ALL-UNNAMED  
--add-opens=java.base/java.time=ALL-UNNAMED  
--add-opens=java.base/java.util=ALL-UNNAMED  
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED  
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED  
--add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED

Nikola.

Reply via email to