Hello, I am trying to figure an appropriate checkpoint interval for my spark streaming application. Its Spark Kafka integration based on Direct Streams.
If my *micro batch interval is 2 mins*, and let's say *each microbatch takes only 15 secs to process* then shouldn't my checkpoint interval also be exactly 2 mins? Assuming my spark streaming application starts at t=0, following will be the state of my checkpoint: *Case 1: checkpoint interval is less than microbatch interval* If I keep my *checkpoint interval at say 1 minute *then: *t=1m: *no incomplete batches in this checkpoint *t=2m*: first microbatch is included as an incomplete microbatch in the checkpoint and microbatch execution then begins *t=3m*: no incomplete batches in the checkpoint as the first microbatch is finished processing in just 15 secs *t=4m*: second microbatch is included as an incomplete microbatch in the checkpoint and microbatch execution then begins *t=4m30s*: system breaks down; on restarting the streaming application finds the checkpoint at t=4 with the second microbatch as the incomplete microbatch and processes it again. But what's the point of reprocessing it again since the second microbatch's processing was completed at the=4m15s *Case 2: checkpoint interval is more than microbatch interval* If I keep my *checkpoint interval at say 4 minutes* then: *t=2m* first microbatch execution begins *t=4m* first checkpoint with second microbatch included as the only incomplete batch; second microbatch processing begins *Sub case 1 :* *system breaks down at t=2m30s :* the first microbatch execution was completed at the=2m15s but there is no checkpoint information about this microbatch since the first checkpoint will happen at t=4m. Consequently, when the streaming app restarts it will re-execute by fetching the offsets from Kafka. *Sub case 2 :* *system breaks down at t=5m :* The second microbatch was already completed in 15 secs i.e. t=4m15s which means at t=5 there should ideally be no incomplete batches. When I restart my application, the streaming application finds the second microbatch as incomplete from the checkpoint made at t=4m, and re-executes that microbatch. Is my understanding right? If yes, then isn't my checkpoint interval incorrectly set resulting in duplicate processing in both the cases above? If yes, then how do I choose an appropriate checkpoint interval? Regards, Sheel