reduceByKey to get all associated values
Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated values ListV. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have to sort it in particular way and apply some business logic. Thank you in advance, Konstantin Kudryavtsev
Re: reduceByKey to get all associated values
You may use groupByKey in this case. On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated values ListV. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have to sort it in particular way and apply some business logic. Thank you in advance, Konstantin Kudryavtsev - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reduceByKey to get all associated values
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network. is that still true? which one should we choice? because actually we can replace all of groupByKey with reduceByKey for example, if we want to use groupByKey on a RDD[ String, String ], to get a RDD[ String, Seq[String] ], we can also do it with reduceByKey: at first, map RDD[ String, String ] to RDD[ String, Seq[String] ] then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reduceByKey to get all associated values
The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general, don’t do these unless you have to. And in Konstantin’s case, I guess he knows what he’s doing. At least we can’t know whether we can help to optimize without further information about the business logic” is provided. On Aug 7, 2014, at 10:22 PM, chutium teng@gmail.com wrote: a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network. is that still true? which one should we choice? because actually we can replace all of groupByKey with reduceByKey for example, if we want to use groupByKey on a RDD[ String, String ], to get a RDD[ String, Seq[String] ], we can also do it with reduceByKey: at first, map RDD[ String, String ] to RDD[ String, Seq[String] ] then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reduceByKey to get all associated values
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a combiner in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu, Aug 7, 2014 at 8:15 AM, Cheng Lian lian.cs@gmail.com wrote: The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general, don’t do these unless you have to. And in Konstantin’s case, I guess he knows what he’s doing. At least we can’t know whether we can help to optimize without further information about the business logic” is provided. On Aug 7, 2014, at 10:22 PM, chutium teng@gmail.com wrote: a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance ( http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/ ) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network. is that still true? which one should we choice? because actually we can replace all of groupByKey with reduceByKey for example, if we want to use groupByKey on a RDD[ String, String ], to get a RDD[ String, Seq[String] ], we can also do it with reduceByKey: at first, map RDD[ String, String ] to RDD[ String, Seq[String] ] then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org