I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key.
The best solution I could come up with is to zip each row with the partition index and local index, like this: rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => rows.zipWithIndex.map { case (row, localIndex) => (row.key, ((partitionIndex, localIndex), row)) } } And then using reduceByKey with a min ordering on the (partitionIndex, localIndex) pair. First, can i count on SparkContext.textFile to read the lines in such that the partition indexes are always increasing so that the above works? And, is there a better way to accomplish the same effect? Thanks! - Philip