Hello,

I am trying to figure an appropriate checkpoint interval for my spark
streaming application. Its Spark Kafka integration based on Direct Streams.

If my *micro batch interval is 2 mins*, and let's say *each microbatch
takes only 15 secs to process* then shouldn't my checkpoint interval also
be exactly 2 mins?

Assuming my spark streaming application starts at t=0, following will be
the state of my checkpoint:

*Case 1: checkpoint interval is less than microbatch interval*
If I keep my *checkpoint interval at say 1 minute *then:
*t=1m: *no incomplete batches in this checkpoint
*t=2m*: first microbatch is included as an incomplete microbatch in the
checkpoint and microbatch execution then begins
*t=3m*: no incomplete batches in the checkpoint as the first microbatch is
finished processing in just 15 secs
*t=4m*: second microbatch is included as an incomplete microbatch in the
checkpoint and microbatch execution then begins
*t=4m30s*: system breaks down; on restarting the streaming application
finds the checkpoint at t=4 with the second microbatch as the incomplete
microbatch and processes it again. But what's the point of reprocessing it
again since the second microbatch's processing was completed at the=4m15s

*Case 2: checkpoint interval is more than microbatch interval*
If I keep my *checkpoint interval at say 4 minutes* then:
*t=2m* first microbatch execution begins
*t=4m* first checkpoint with second microbatch included as the only
incomplete batch; second microbatch processing begins

*Sub case 1 :* *system breaks down at t=2m30s :* the first microbatch
execution was completed at the=2m15s but there is no checkpoint information
about this microbatch since the first checkpoint will happen at t=4m.
Consequently, when the streaming app restarts it will re-execute by
fetching the offsets from Kafka.

*Sub case 2 :* *system breaks down at t=5m :* The second microbatch was
already completed in 15 secs i.e. t=4m15s which means at t=5 there should
ideally be no incomplete batches. When I restart my application, the
streaming application finds the second microbatch as incomplete from the
checkpoint made at t=4m, and re-executes that microbatch.


Is my understanding right? If yes, then isn't my checkpoint interval
incorrectly set resulting in duplicate processing in both the cases above?
If yes, then how do I choose an appropriate checkpoint interval?

Regards,
Sheel

Reply via email to