Is DStream read from the beginning upon node crash or from where it left off

Adrian Mocanu Tue, 18 Feb 2014 14:32:28 -0800

Hi,
I have a question on achieving fault tolerance when counting with spark and 
storing the aggregate count into Cassandra.


Example input: RDD 1 [a,a,a], RDD 2 [a,a]
After aggregation of RDD1 (ie map + reduceByKey) we get Map:[a->3]
And after aggregation for RDD2 we get Map:[a->2]

Now lets store these into Cassandra.
Someone here, Pankaj, mentioned the common trick to store the last transaction 
id of the RDD into Cassandra along with the data.
Now this could work, if after a spark node crash the DStream is not replayed 
entirely, but only starting from the last RDD.
For example, after saving RDD1 into Cassandra, the table has name: a, count: 3, 
RDD_id: 1

Now let's crash the spark node with my DStream.

Now the DStream is being recovered in parallel on other spark nodes
AND if the DStream is continued then RDD 2 will be processed next which 
correctly gives name: a, count: 5, RDD_id: 2
But if the DStream is executed from the beginning, it will over count RDD1 a 
tuples giving duplicates in the count value.

I need to know which way a node crash is handled:
a) the DStream is entirely restarted and read from the beginning or
 b) the DStream is read from the last RDD read
so that I can design the db schema accordingly to make each update idempotent:
for a) I need to save RDD IDs for each RDD while for b) all I need is the ID of 
the last RDD

I also welcome schema suggestions to make aggregation idempotent. Thanks a lot!
-Adrian

Is DStream read from the beginning upon node crash or from where it left off

Reply via email to