Re: Spark RDD DAG behaviour understanding in case of checkpointing

Tathagata Das Mon, 25 Jan 2016 12:56:20 -0800

First of all, if you are running batches of 15 minutes, and you dont need
second level latencies, it might be just easier to run batch jobs in a for
loop - you will have greater control over what is going on.
And if you are using reduceByKeyAndWindow without the
inverseReduceFunction, then Spark has to read all the 2 hours of data back
after recovering the driver from checkpoint files if it has to calculate
aggregations on last two hours of data (the data read earlier is lost when
the driver is dead).


On Sat, Jan 23, 2016 at 1:47 PM, gaurav sharma <sharmagaura...@gmail.com>
wrote:

> Hi Tathagata/Cody,
>
> I am facing a challenge in Production with DAG behaviour during
> checkpointing in spark streaming -
>
> Step 1 : Read data from Kafka every 15 min -  call this KafkaStreamRDD ~
> 100 GB of data
>
> Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to
> parallelise processing - call this RepartitionedKafkaStreamRdd
>
> Step 3 : on this RepartitionedKafkaStreamRdd I run map and
> reduceByKeyAndWindow over a window of 2 hours, call this RDD1 ~ 100 MB of
> data
>
> Checkpointing is enabled.
>
> If i restart my streaming context, it picks up from last checkpointed
> state,
>
> READS data for all the 8 SUCCESSFULLY FINISHED 15 minute batches from
> Kafka , re-performs Repartition of all the data of all these 8 , 15 minute
> batches.
>
> Then reads data for current 15 minute batch and runs map and
> reduceByKeyAndWindow over a window of 2 hours.
>
> Challenge -
> 1> I cant checkpoint KafkaStreamRDD or RepartitionedKafkaStreamRdd as this
> is huge data around 800GB for 2 hours, reading and writing (checkpointing)
> this at every 15 minutes would be very slow.
>
> 2> When i have checkpointed data of RDD1 at every 15 minutes, and map and
> reduceByKeyAndWindow is being run over RDD1 only, and i have snapshot of
> all of the last 8, 15 minute batches of RDD1,
> why is spark reading all the data for last 8 successfully completed
> batches from Kafka again(Step 1) and again performing re-partitioning(Step
> 2) and then again running map and reduceByKeyandWindow over these newly
> fetched kafkaStreamRdd data of last 8 , 15 minute batches.
>
> Because of above mentioned challenges, i am not able to exploit
> checkpointing, in case streaming context is restarted at high load.
>
> Please help out in understanding, if there is something that i am missing
>
> Regards,
> Gaurav
>

Re: Spark RDD DAG behaviour understanding in case of checkpointing

Reply via email to