We are using Kinesis with Spark Streaming 1.5 on a YARN cluster. When we
enable checkpointing in Spark, where in the Kinesis stream should a
restarted driver continue? I run a simple experiment as follows:
1. In the first driver run, Spark driver processes 1 million records
starting from
Your observation is correct! The current implementation of checkpointing to
DynamoDB is tied to the presence of new data from Kinesis (I think that
emulates the KCL behavior), if there is no data for while, the
checkpointing does not occur. That explains your observation.
I have filed a JIRA to