Hi Nikola, Fails on how? Some stack trace or error would be beneficial.
G On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic <n.milutino...@levi9.com> wrote: > Hello. > > > > We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a > consistent error situation: both checkpoints and savepoints only save > “_metadata” file and nothing else. Sometimes this is OK, where all data is > in that one file. But sometimes “_metadata” holds references to other > files, which are not present. > > > > I understand that if the size of the state is smaller than a set limit, it > will be stored only in that one file. And if it is larger, it would be > spilled over to additional files. Our state is generally miniscule, so it > should always fit into _metadata, but sometimes I can inspect the _metadata > file and see references to those additional files. Trying to restore from > such a save/check-point always fails. > > > > Does anyone know of a reason for this behavior? > > > > This is our configuration (relevant parts, I have substituted our account > with a variable): > > > > high-availability.type: kubernetes > > high-availability.cluster-id: flink-cluster-session-cluster > > high-availability.storageDir: wasbs://flink-storage@${account}. > blob.core.windows.net/data > > high-availability.jobmanager.port: 6123 > > > > state.backend.type: rocksdb > > execution.checkpointing.num-retained: 3 > > execution.checkpointing.savepoint-dir: wasbs://flink-storage@${account}. > blob.core.windows.net/flink-savepoints > > execution.checkpointing.mode: EXACTLY_ONCE > > execution.checkpointing.incremental: true > > execution.checkpointing.interval: 60000 > > execution.checkpointing.timeout: 300000 > > $internal.flink.version: v1_20 > > execution.checkpointing.storage: filesystem > > execution.checkpointing.dir: wasbs://flink-storage@${account}. > blob.core.windows.net/flink-checkpoints > > execution.checkpointing.externalized-checkpoint-retention: > RETAIN_ON_CANCELLATION > > execution.checkpointing.min-pause: 5000 > > execution.target: kubernetes-session > > > > fs.azure.account.keyprovider.${account}.blob.core.windows.net: > org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider > > > > env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED > --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED > --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED > --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED > --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/ > java.net=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED > --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/ > sun.nio.ch=ALL-UNNAMED > --add-opens=java.base/java.lang.reflect=ALL-UNNAMED > --add-opens=java.base/java.text=ALL-UNNAMED > --add-opens=java.base/java.time=ALL-UNNAMED > --add-opens=java.base/java.util=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED > > > > Nikola. >