Where are you calling checkpointing? Metadata checkpointing for a kafa direct stream should just be the offsets, not the data.
TD can better speak to reduceByKeyAndWindow behavior when restoring from a checkpoint, but ultimately the only available choices would be replay the prior window data from kafka; replay the prior window data from checkpoint / other storage (not much reason for this, since it's stored in kafka); or lose the prior window data. On Sat, Jan 23, 2016 at 3:47 PM, gaurav sharma <sharmagaura...@gmail.com> wrote: > Hi Tathagata/Cody, > > I am facing a challenge in Production with DAG behaviour during > checkpointing in spark streaming - > > Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~ > 100 GB of data > > Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to > parallelise processing - call this RepartitionedKafkaStreamRdd > > Step 3 : on this RepartitionedKafkaStreamRdd I run map and > reduceByKeyAndWindow over a window of 2 hours, call this RDD1 ~ 100 MB of > data > > Checkpointing is enabled. > > If i restart my streaming context, it picks up from last checkpointed > state, > > READS data for all the 8 SUCCESSFULLY FINISHED 15 minute batches from > Kafka , re-performs Repartition of all the data of all these 8 , 15 minute > batches. > > Then reads data for current 15 minute batch and runs map and > reduceByKeyAndWindow over a window of 2 hours. > > Challenge - > 1> I cant checkpoint KafkaStreamRDD or RepartitionedKafkaStreamRdd as this > is huge data around 800GB for 2 hours, reading and writing (checkpointing) > this at every 15 minutes would be very slow. > > 2> When i have checkpointed data of RDD1 at every 15 minutes, and map and > reduceByKeyAndWindow is being run over RDD1 only, and i have snapshot of > all of the last 8, 15 minute batches of RDD1, > why is spark reading all the data for last 8 successfully completed > batches from Kafka again(Step 1) and again performing re-partitioning(Step > 2) and then again running map and reduceByKeyandWindow over these newly > fetched kafkaStreamRdd data of last 8 , 15 minute batches. > > Because of above mentioned challenges, i am not able to exploit > checkpointing, in case streaming context is restarted at high load. > > Please help out in understanding, if there is something that i am missing > > Regards, > Gaurav >