Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Yanzhe Chen
Hi Eugen, Sorry if I haven’t catch your point. In the second example, val result = data.mapPartitions(points => skyline(points.toArray).iterator) .reduce { case (left, right) => skyline(left ++ right) } In my understanding, if the data is type RDD, then both left and right

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Cheng Lian
Hmm… The good part of reduce is that it performs local combining within a single partition automatically, but since you turned each partition into a single-value one, local combining is not applicable, and reduce simply degrades to collect and then perform a skyline over all collected partial resul

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
Yes, the second example does that. It transforms all the points of a partition into a single element the skyline, thus reduce will run on the skyline of two partitions and not on single points. Le 16 avr. 2014 06:47, "Yanzhe Chen" a écrit : > Eugen, > > Thanks for your tip and I do want to merge

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Eugen, Thanks for your tip and I do want to merge the result of a partition with another one but I am still not quite clear how to do it. Say the original data rdd has 32 partitions and since mapPartitions won’t change the number of partitions, it will remain 32 partitions which each contains

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
It depends on your algorithm but I guess that you probably should use reduce (the code probably doesn't compile but it shows you the idea). val result = data.reduce { case (left, right) => skyline(left ++ right) } Or in the case you want to merge the result of a partition with another one you c

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Cheng Lian
Your Spark solution first reduces partial results into a single partition, computes the final result, and then collects to the driver side. This involves a shuffle and two waves of network traffic. Instead, you can directly collect partial results to the driver and then computes the final results o

Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Hi all, As a previous thread, I am asking how to implement a divide-and-conquer algorithm (skyline) in Spark. Here is my current solution: val data = sc.textFile(…).map(line => line.split(“,”).map(_.toDouble)) val result = data.mapPartitions(points => skyline(points.toArray).iterator).coales