persist before or after checkpoint?

Shay Seng Wed, 24 Sep 2014 12:27:43 -0700

Hey,

I actually have 2 question


(1)  I want to generate unique IDs for each RDD element and I want to
assign them in parallel so I do

rdd.mapPartitionsWithIndex((index, s) => {
      var count = 0L
      s.zipWithIndex.map {
        case (t, i) => {
          count += 1
          (index * GLOBAL.MAX_PARTITION_SIZE + i, t)
        }
      }
  })

This works ok, but we noticed that unless we checkpoint, if a partition is
recomputed, the IDs will get messed up.

Question 1: is there a better way to create unique IDs in a distributed way?

(2) To solve the stability issue with (1) we did:

rdd.persist.checkpoint

The Spark logs suggested that checkpointed RDDs be persisted. should the
persist be before or after checkpointing?

ok I lied,I  have 3 question
(3) We are checkpointing to HDFS. we've noticed that sometimes the
checkpointing works and I see /RDD-1 etc written in HDFS, but other times
we only see the checkpoint dir created and not data ... I suspect (2) but
I'm not certain what is really happening.

Any pointers would be appreciated.
I'm using AWS r3.4xlarge machines with Spark 0.9.2

tks
shay

persist before or after checkpoint?

Reply via email to