Re: Job fails to restore from checkpoint in Kubernetes with FileNotFoundException

2018-10-30 Thread Till Rohrmann
As Vino pointed out, you need to configure a checkpoint directory which is accessible from all TMs. Otherwise you won't be able to recover the state if the task gets scheduled to a different TaskManager. Usually, people use HDFS or S3 for that. Cheers, Till On Tue, Oct 30, 2018 at 9:50 AM vino

Re: Job fails to restore from checkpoint in Kubernetes with FileNotFoundException

2018-10-30 Thread vino yang
Hi John, Is the file system configured by RocksDBStateBackend HDFS?[1] Thanks, vino. [1]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/state_backends.html#the-rocksdbstatebackend John Stone 于2018年10月30日周二 上午2:54写道: > I am testing Flink in a Kubernetes cluster and am

Job fails to restore from checkpoint in Kubernetes with FileNotFoundException

2018-10-29 Thread John Stone
I am testing Flink in a Kubernetes cluster and am finding that a job gets caught in a recovery loop. Logs show that the issue is that a checkpoint cannot be found although checkpoints are being taken per the Flink web UI. Any advice on how to resolve this is most appreciated. Note on below: