Hi Nikola,

Fails on how? Some stack trace or error would be beneficial.

G


On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic <n.milutino...@levi9.com>
wrote:

> Hello.
>
>
>
> We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a
> consistent error situation: both checkpoints and savepoints only save
> “_metadata” file and nothing else. Sometimes this is OK, where all data is
> in that one file. But sometimes “_metadata” holds references to other
> files, which are not present.
>
>
>
> I understand that if the size of the state is smaller than a set limit, it
> will be stored only in that one file. And if it is larger, it would be
> spilled over to additional files. Our state is generally miniscule, so it
> should always fit into _metadata, but sometimes I can inspect the _metadata
> file and see references to those additional files. Trying to restore from
> such a save/check-point always fails.
>
>
>
> Does anyone know of a reason for this behavior?
>
>
>
> This is our configuration (relevant parts, I have substituted our account
> with a variable):
>
>
>
> high-availability.type: kubernetes
>
> high-availability.cluster-id: flink-cluster-session-cluster
>
> high-availability.storageDir: wasbs://flink-storage@${account}.
> blob.core.windows.net/data
>
> high-availability.jobmanager.port: 6123
>
>
>
> state.backend.type: rocksdb
>
> execution.checkpointing.num-retained: 3
>
> execution.checkpointing.savepoint-dir: wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-savepoints
>
> execution.checkpointing.mode: EXACTLY_ONCE
>
> execution.checkpointing.incremental: true
>
> execution.checkpointing.interval: 60000
>
> execution.checkpointing.timeout: 300000
>
> $internal.flink.version: v1_20
>
> execution.checkpointing.storage: filesystem
>
> execution.checkpointing.dir: wasbs://flink-storage@${account}.
> blob.core.windows.net/flink-checkpoints
>
> execution.checkpointing.externalized-checkpoint-retention:
> RETAIN_ON_CANCELLATION
>
> execution.checkpointing.min-pause: 5000
>
> execution.target: kubernetes-session
>
>
>
> fs.azure.account.keyprovider.${account}.blob.core.windows.net:
> org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider
>
>
>
> env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED
> --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
> --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
> --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
> --add-opens=java.base/java.lang=ALL-UNNAMED  --add-opens=java.base/
> java.net=ALL-UNNAMED  --add-opens=java.base/java.io=ALL-UNNAMED
> --add-opens=java.base/java.nio=ALL-UNNAMED  --add-opens=java.base/
> sun.nio.ch=ALL-UNNAMED
> --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
> --add-opens=java.base/java.text=ALL-UNNAMED
> --add-opens=java.base/java.time=ALL-UNNAMED
> --add-opens=java.base/java.util=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED
>
>
>
> Nikola.
>

Reply via email to