Question regarding doing aggregation over custom partitions

Arun Swami Thu, 01 May 2014 17:05:18 -0700

Hi,

I am a newbie to Spark. I looked for documentation or examples to answer my
question but came up empty handed.


I don't know whether I am using the right terminology but here goes.

I have a file of records. Initially, I had the following Spark program (I
am omitting all the surrounding code and focusing only on the Spark related
code):

...
val recordsRDD = sc.textFile(pathSpec, 2).cache
val countsRDD: RDD[(String, Int)] = recordsRDD.flatMap(x =>
getCombinations(x))
  .map(e => (e, 1))
  .reduceByKey(_ + _)
...

Here getCombinations() is a function I have written that takes a record and
returns List[String].

This program works as expected.

Now, I want to do the following. I want to partition the records in
recordsRDD by some key extracted from each record. I do this as follows:

val keyValueRecordsRDD: RDD[(String, String)] =
  recodsRDD.flatMap(getKeyValueRecord(_))

Here getKeyValueRecord() is a function I have written that takes a record
and returns a Tuple2 of a key and the original record.

Now I want to do the same operations as before (getCombinations(), and
count occurrences) BUT on each partition as defined by the key.
Essentially, I want to apply the operations individually in each partition.
In a separate step, I want to recover the
global counts across all partitions while keeping the partition based
counts.

How can I do this in Spark?

Thanks!

arun
*______________*
*Arun Swami*

Question regarding doing aggregation over custom partitions

Reply via email to