Is it possible to retrieve a specific partition  (e.g., the first partition) of 
a DataFrame and apply some function there? My data is too large, and I just 
want to get some approximate measures using the first few partitions in the 
data. I'll illustrate what I want to accomplish using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
                                  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", 
"value")
// I want to get the first partition only, and do some calculation, for 
example, count by the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am 
not sure whether mapPartitionsWithIndex is helpful in this case, since it still 
maps all data.

Regards,
Wayne


Reply via email to