Hello,
We are using Spark Streaming 1.4.1 in AWS EMR to process records from
Kinesis.  Our Spark program saves RDDs to S3, after which the records are
picked up by a Lambda function that loads them into Redshift.  That no data
is lost during processing is important to us.

We have set our Kinesis checkpoint interval to 15 minutes, which is also
our window size.

Unfortunately, checkpointing happens after receiving data from Kinesis, not
after we have successfully written to S3.  If batches back up in Spark, and
the cluster is terminated, whatever data was in-memory will be lost because
it was checkpointed but not actually saved to S3.

We are considering forking and modifying the kinesis-asl library with
changes that would allow us to perform the checkpoint manually and at the
right time.  We'd rather not do this.

Are we overlooking an easier way to deal with this problem?  Thank you in
advance for your insight!

Alan

Reply via email to