Hello, Thanks all for considering our problem. We are doing transformations in Spark Streaming. We have also since learned that WAL to S3 on 1.4 is "not reliable" [1]
We are just going to wait for EMR to support 1.5 and hopefully this won't be a problem anymore [2]. Alan 1. https://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCA+AHuKkH9r0BwQMgQjDG+j=qdcqzpow1rw1u4d0nrcgmq5x...@mail.gmail.com%3E 2. https://issues.apache.org/jira/browse/SPARK-9215 On Fri, Sep 18, 2015 at 4:23 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Are you doing actual transformations / aggregation in Spark Streaming? Or > just using it to bulk write to S3? > > If the latter, then you could just use your AWS Lambda function to read > directly from the Kinesis stream. If the former, then perhaps either look > into the WAL option that Aniket mentioned, or perhaps you could write the > processed RDD back to Kinesis, and have the Lambda function read the > Kinesis stream and write to Redshift? > > On Thu, Sep 17, 2015 at 5:48 PM, Alan Dipert <a...@dipert.org> wrote: > >> Hello, >> We are using Spark Streaming 1.4.1 in AWS EMR to process records from >> Kinesis. Our Spark program saves RDDs to S3, after which the records are >> picked up by a Lambda function that loads them into Redshift. That no data >> is lost during processing is important to us. >> >> We have set our Kinesis checkpoint interval to 15 minutes, which is also >> our window size. >> >> Unfortunately, checkpointing happens after receiving data from Kinesis, >> not after we have successfully written to S3. If batches back up in Spark, >> and the cluster is terminated, whatever data was in-memory will be lost >> because it was checkpointed but not actually saved to S3. >> >> We are considering forking and modifying the kinesis-asl library with >> changes that would allow us to perform the checkpoint manually and at the >> right time. We'd rather not do this. >> >> Are we overlooking an easier way to deal with this problem? Thank you in >> advance for your insight! >> >> Alan >> > >