Till and Stephan, thanks for your clarification. @Till One more question, from what I have read about the checkpointing [1], the list operations don't seem likely to be performed frequently, so storing state backend on s3 shouldn't have any severe impact on flink performance. Is this assumption right?
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.2/internals/stream_checkpointing.html -- Ayush On Fri, May 12, 2017 at 1:05 AM Stephan Ewen <se...@apache.org> wrote: > Small addition to Till's comment: > > In the case where file:// points to a mounted distributed file system > (NFS, MapRFs, ...), then it actually works. The important thing is that the > filesystem where the checkpoints go is replicated (fault tolerant) and > accessible from all nodes. > > On Thu, May 11, 2017 at 2:16 PM, Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Ayush, >> >> you’re right that RocksDB is the recommend state backend because of the >> above-mentioned reasons. In order to make the recovery properly work, you >> have to configure a shared directory for the checkpoint data via >> state.backend.fs.checkpointdir. You can basically configure any file >> system which is supported by Hadoop (no HDFS required). The reason is that >> we use Hadoop to bridge between different file systems. The only thing you >> have to make sure is that you have the respective file system >> implementation in your class path. >> >> I think you can access Windows Azure Blob Storage via Hadoop [1] >> similarly to access S3, for example. >> >> If you use S3 to store your checkpoint data, then you will benefit from >> all the advantages of S3 but also suffer from its drawbacks (e.g. that list >> operations are more costly). But these are not specific to Flink. >> >> A URL like file:// usually indicates a local file. Thus, if your Flink >> cluster is not running on a single machine, then this won’t work. >> >> [1] https://hadoop.apache.org/docs/stable/hadoop-azure/index.html >> >> Cheers, >> Till >> >> >> On Thu, May 11, 2017 at 10:41 AM, Ayush Goyal <ay...@helpshift.com> >> wrote: >> >>> Hello, >>> >>> I had a few questions regarding checkpoint storage options using >>> RocksDBStateBackend. In the flink 1.2 documentation, it is the >>> recommended state >>> backend due to it's ability to store large states and asynchronous >>> snapshotting. >>> For high availabilty it seems HDFS is the recommended store for state >>> backend >>> data. In AWS deployment section, it is also mentioned that s3 can be >>> used for >>> storing state backend data. >>> >>> We don't want to depend on a hadoop cluster for flink deployment, so I >>> had >>> following questions: >>> >>> 1. Can we use any storage backend supported by flink for storing RocksDB >>> >>> StateBackend data with file urls: there are quite a few supported as >>> mentioned here: >>> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/filesystems.html >>> and here: >>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md >>> >>> 2. Is there some work already done to support Windows Azure Blob Storage >>> for >>> storing State backend data? There are some docs here: >>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md >>> can we utilize this for that? >>> >>> 3. If utilizing S3 for state backend, is there any performance impact? >>> >>> 4. For high availability can we use a NFS volume for state backend, with >>> >>> "file://" urls? Will there be any performance impact? >>> >>> PS: I posted this email earlier via nabble, but it's not showing up in >>> apache archive. So sending again. Apologies if it results in multiple >>> threads. >>> >>> -- Ayush >>> >> >> >