reduceByKey to get all associated values

2014-08-07 Thread Konstantin Kudryavtsev
Hi there,

I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
values ListV.

There is function reduceByKey that works only with separate V from list. Is
it exist any way to get list? Because I have to sort it in particular way
and apply some business logic.

Thank you in advance,
Konstantin Kudryavtsev


Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
You may use groupByKey in this case.

On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 Hi there,
 
 I'm interested if it is possible to get the same behavior as for reduce 
 function from MR framework. I mean for each key K get list of associated 
 values ListV.
 
 There is function reduceByKey that works only with separate V from list. Is 
 it exist any way to get list? Because I have to sort it in particular way and 
 apply some business logic.
 
 Thank you in advance,
 Konstantin Kudryavtsev


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reduceByKey to get all associated values

2014-08-07 Thread chutium
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about
performance
(http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/)

that, reduceByKey will be more efficient than groupByKey... he mentioned
groupByKey copies all data over network.

is that still true? which one should we choice? because actually we can
replace all of groupByKey with reduceByKey

for example, if we want to use groupByKey on a RDD[ String, String ], to get
a RDD[ String, Seq[String] ], 

we can also do it with reduceByKey:
at first, map RDD[ String, String ] to RDD[ String, Seq[String] ]
then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ]



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
The point is that in many cases the operation passed to reduceByKey aggregates 
data into much smaller size, say + and * for integer. String concatenation 
doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and 
rdd.groupByKey suffer similar performance issue. In general, don’t do these 
unless you have to.

And in Konstantin’s case, I guess he knows what he’s doing. At least we can’t 
know whether we can help to optimize without further information about the 
business logic” is provided.

On Aug 7, 2014, at 10:22 PM, chutium teng@gmail.com wrote:

 a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about
 performance
 (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/)
 
 that, reduceByKey will be more efficient than groupByKey... he mentioned
 groupByKey copies all data over network.
 
 is that still true? which one should we choice? because actually we can
 replace all of groupByKey with reduceByKey
 
 for example, if we want to use groupByKey on a RDD[ String, String ], to get
 a RDD[ String, Seq[String] ], 
 
 we can also do it with reduceByKey:
 at first, map RDD[ String, String ] to RDD[ String, Seq[String] ]
 then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ]
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a combiner in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions




On Thu, Aug 7, 2014 at 8:15 AM, Cheng Lian lian.cs@gmail.com wrote:

 The point is that in many cases the operation passed to reduceByKey
 aggregates data into much smaller size, say + and * for integer. String
 concatenation doesn’t actually “shrink” data, thus in your case,
 rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance
 issue. In general, don’t do these unless you have to.

 And in Konstantin’s case, I guess he knows what he’s doing. At least we
 can’t know whether we can help to optimize without further information
 about the business logic” is provided.

 On Aug 7, 2014, at 10:22 PM, chutium teng@gmail.com wrote:

  a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk
 about
  performance
  (
 http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/
 )
 
  that, reduceByKey will be more efficient than groupByKey... he mentioned
  groupByKey copies all data over network.
 
  is that still true? which one should we choice? because actually we can
  replace all of groupByKey with reduceByKey
 
  for example, if we want to use groupByKey on a RDD[ String, String ], to
 get
  a RDD[ String, Seq[String] ],
 
  we can also do it with reduceByKey:
  at first, map RDD[ String, String ] to RDD[ String, Seq[String] ]
  then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ]
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org