Good morning, I have a typical iterator loop on a DataFrame loaded from a parquet data source:
val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new JavaSparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetDataFrame = sqlContext.read.parquet(parquetFilename.getAbsolutePath) parquetDataFrame.foreachPartition { rowIterator => rowIterator.foreach { row => // ... do work } } My use case is quite simple: I would like to save a checkpoint during processing, and if the driver program fails, skip over the initial records in the parquet file, and continue from the checkpoint. This would be analogous to storing a loop iterator value in a standard C++/Java for loop. My question is: are there any guarantees about the ordering of rows in the "foreach" closure? Even if there are not guarantees in general (i.e., for DataFrame from any source), considering that the data frame is created from a parquet file, are there any guarantees? Is it possible to implement my use case? Thanks for your time.