On 23 Sep 2015, at 07:10, Tathagata Das <t...@databricks.com<mailto:t...@databricks.com>> wrote:
Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia <mici...@gmail.com<mailto:mici...@gmail.com>> wrote: Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Yes. Because checkpoints are single files by itself, and does not require flush semantics to work. So S3 is fine. Trying to answer this question, I looked into Checkpoint.getCheckpointFiles [1]. It is doing findFirstIn which would probably be calling the S3 LIST operation. S3 LIST is prone to eventual consistency [2]. What would happen when getCheckpointFiles retrieves an incomplete list of files to Checkpoint.read [1]? There is a non-zero chance of that happening. But in that case it will just use an older checkpoint file to recover the DAG of DStreams. That just means that it will recover at an earlier point in time, and undergo more computation. Yes, it's the listings that are most prone to inconsistency. US-East is also the worst; the other sites all guarantee create consistency. (I think US-East does not for some endpoints) BTW google and microsoft's object stores do offer consistency. OpenStack's swift is pretty bad. The pluggable WAL interface allows me to work around the eventual consistency of S3 by storing an index of filenames in DynamoDB. However it seems that something similar is required for checkpoints as well. Netflix's s3mper extension to s3 claims to offer a consistent view of the s3 filesystem, moving the directory listings into dynamo, while leaving the data in s3.