Hi Marco,In your case, since you don't need to perform an aggregation (such as 
a sum or average) over each key, using groupByKey may perform better. 
groupByKey inherently utilizes compactBuffer which is much more efficient than 
ArrayBuffer.
Thanks.LIN Chen

Date: Tue, 5 Jan 2016 21:13:40 +0000
Subject: aggregateByKey vs combineByKey
From: mmistr...@gmail.com
To: user@spark.apache.org

Hi all
 i have the following dataSet
 kv = [(2,Hi), (1,i), (2,am), (1,a), (4,test), (6,s
tring)]

It's a simple list of tuples containing (word_length, word)

What i wanted to do was to group the result by key in order to have a result in 
the form

[(word_length_1, [word1, word2, word3], word_length_2, [word4, word5, word6])

so i browsed spark API and was able to get the result i wanted using two 
different
functions
.
scala> kv.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[St
ring], y:List[String]) => x ::: y).collect()
res86: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, am)), (4,L
ist(test)), (6,List(string)))

and

scala>
scala> kv.aggregateByKey(List[String]())((acc, item) => item :: acc,
     |                    (acc1, acc2) => acc1 ::: acc2).collect()
 
 
 
res87: Array[(Int, List[String])] = Array((1,List(i, a)), (2,List(Hi, am)), (4,L
ist(test)), (6,List(string)))

Now, question is: any advantages of using one instead of the others?
Am i somehow misusing the API for what i want to do?

kind regards
 marco







                                          

Reply via email to