Hey, I actually have 2 question
(1) I want to generate unique IDs for each RDD element and I want to assign them in parallel so I do rdd.mapPartitionsWithIndex((index, s) => { var count = 0L s.zipWithIndex.map { case (t, i) => { count += 1 (index * GLOBAL.MAX_PARTITION_SIZE + i, t) } } }) This works ok, but we noticed that unless we checkpoint, if a partition is recomputed, the IDs will get messed up. Question 1: is there a better way to create unique IDs in a distributed way? (2) To solve the stability issue with (1) we did: rdd.persist.checkpoint The Spark logs suggested that checkpointed RDDs be persisted. should the persist be before or after checkpointing? ok I lied,I have 3 question (3) We are checkpointing to HDFS. we've noticed that sometimes the checkpointing works and I see /RDD-1 etc written in HDFS, but other times we only see the checkpoint dir created and not data ... I suspect (2) but I'm not certain what is really happening. Any pointers would be appreciated. I'm using AWS r3.4xlarge machines with Spark 0.9.2 tks shay