Where are you calling checkpointing? Metadata checkpointing for a kafa
direct stream should just be the offsets, not the data.

TD can better speak to reduceByKeyAndWindow behavior when restoring from a
checkpoint, but ultimately the only available choices would be replay the
prior window data from kafka; replay the prior window data from checkpoint
/ other storage (not much reason for this, since it's stored in kafka); or
lose the prior window data.



On Sat, Jan 23, 2016 at 3:47 PM, gaurav sharma <sharmagaura...@gmail.com>
wrote:

> Hi Tathagata/Cody,
>
> I am facing a challenge in Production with DAG behaviour during
> checkpointing in spark streaming -
>
> Step 1 : Read data from Kafka every 15 min -  call this KafkaStreamRDD ~
> 100 GB of data
>
> Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to
> parallelise processing - call this RepartitionedKafkaStreamRdd
>
> Step 3 : on this RepartitionedKafkaStreamRdd I run map and
> reduceByKeyAndWindow over a window of 2 hours, call this RDD1 ~ 100 MB of
> data
>
> Checkpointing is enabled.
>
> If i restart my streaming context, it picks up from last checkpointed
> state,
>
> READS data for all the 8 SUCCESSFULLY FINISHED 15 minute batches from
> Kafka , re-performs Repartition of all the data of all these 8 , 15 minute
> batches.
>
> Then reads data for current 15 minute batch and runs map and
> reduceByKeyAndWindow over a window of 2 hours.
>
> Challenge -
> 1> I cant checkpoint KafkaStreamRDD or RepartitionedKafkaStreamRdd as this
> is huge data around 800GB for 2 hours, reading and writing (checkpointing)
> this at every 15 minutes would be very slow.
>
> 2> When i have checkpointed data of RDD1 at every 15 minutes, and map and
> reduceByKeyAndWindow is being run over RDD1 only, and i have snapshot of
> all of the last 8, 15 minute batches of RDD1,
> why is spark reading all the data for last 8 successfully completed
> batches from Kafka again(Step 1) and again performing re-partitioning(Step
> 2) and then again running map and reduceByKeyandWindow over these newly
> fetched kafkaStreamRdd data of last 8 , 15 minute batches.
>
> Because of above mentioned challenges, i am not able to exploit
> checkpointing, in case streaming context is restarted at high load.
>
> Please help out in understanding, if there is something that i am missing
>
> Regards,
> Gaurav
>

Reply via email to