save checkpoint during dataframe row iteration

Justin Permar Mon, 05 Oct 2015 08:00:09 -0700

Good morning,

I have a typical iterator loop on a DataFrame loaded from a parquet data
source:


val conf = new SparkConf().setAppName("Simple
Application").setMaster("local")
val sc = new JavaSparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetDataFrame =
sqlContext.read.parquet(parquetFilename.getAbsolutePath)
parquetDataFrame.foreachPartition {
  rowIterator =>
    rowIterator.foreach { row =>
    // ... do work
    }
}

My use case is quite simple: I would like to save a checkpoint during
processing, and if the driver program fails, skip over the initial records
in the parquet file, and continue from the checkpoint. This would be
analogous to storing a loop iterator value in a standard C++/Java for loop.

My question is: are there any guarantees about the ordering of rows in the
"foreach" closure? Even if there are not guarantees in general (i.e., for
DataFrame from any source), considering that the data frame is created from
a parquet file, are there any guarantees?

Is it possible to implement my use case?

Thanks for your time.

save checkpoint during dataframe row iteration

Reply via email to