Is it possible to retrieve a specific partition (e.g., the first partition) of a DataFrame and apply some function there? My data is too large, and I just want to get some approximate measures using the first few partitions in the data. I'll illustrate what I want to accomplish using the example below:
// create date val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1), ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", "value") // I want to get the first partition only, and do some calculation, for example, count by the value of "var" tmp1 = tmp.getPartition(0) tmp1.groupBy("var").count() The idea is not to go through all the data to save computational time. So I am not sure whether mapPartitionsWithIndex is helpful in this case, since it still maps all data. Regards, Wayne