Are you doing actual transformations / aggregation in Spark Streaming? Or
just using it to bulk write to S3?

If the latter, then you could just use your AWS Lambda function to read
directly from the Kinesis stream. If the former, then perhaps either look
into the WAL option that Aniket mentioned, or perhaps you could write the
processed RDD back to Kinesis, and have the Lambda function read the
Kinesis stream and write to Redshift?

On Thu, Sep 17, 2015 at 5:48 PM, Alan Dipert <a...@dipert.org> wrote:

> Hello,
> We are using Spark Streaming 1.4.1 in AWS EMR to process records from
> Kinesis.  Our Spark program saves RDDs to S3, after which the records are
> picked up by a Lambda function that loads them into Redshift.  That no data
> is lost during processing is important to us.
>
> We have set our Kinesis checkpoint interval to 15 minutes, which is also
> our window size.
>
> Unfortunately, checkpointing happens after receiving data from Kinesis,
> not after we have successfully written to S3.  If batches back up in Spark,
> and the cluster is terminated, whatever data was in-memory will be lost
> because it was checkpointed but not actually saved to S3.
>
> We are considering forking and modifying the kinesis-asl library with
> changes that would allow us to perform the checkpoint manually and at the
> right time.  We'd rather not do this.
>
> Are we overlooking an easier way to deal with this problem?  Thank you in
> advance for your insight!
>
> Alan
>

Reply via email to