Hi, I have a question on achieving fault tolerance when counting with spark and storing the aggregate count into Cassandra.
Example input: RDD 1 [a,a,a], RDD 2 [a,a] After aggregation of RDD1 (ie map + reduceByKey) we get Map:[a->3] And after aggregation for RDD2 we get Map:[a->2] Now lets store these into Cassandra. Someone here, Pankaj, mentioned the common trick to store the last transaction id of the RDD into Cassandra along with the data. Now this could work, if after a spark node crash the DStream is not replayed entirely, but only starting from the last RDD. For example, after saving RDD1 into Cassandra, the table has name: a, count: 3, RDD_id: 1 Now let's crash the spark node with my DStream. Now the DStream is being recovered in parallel on other spark nodes AND if the DStream is continued then RDD 2 will be processed next which correctly gives name: a, count: 5, RDD_id: 2 But if the DStream is executed from the beginning, it will over count RDD1 a tuples giving duplicates in the count value. I need to know which way a node crash is handled: a) the DStream is entirely restarted and read from the beginning or b) the DStream is read from the last RDD read so that I can design the db schema accordingly to make each update idempotent: for a) I need to save RDD IDs for each RDD while for b) all I need is the ID of the last RDD I also welcome schema suggestions to make aggregation idempotent. Thanks a lot! -Adrian
