It would be even faster to load the data on the driver and sort it there without using Spark :). Using reduce() is cheating, because it only works as long as the data fits on one machine. That is not the targeted use case of a distributed computation system. You can repeat your test with more data (that doesn't fit on one machine) to see what I mean.
On Tue, Jun 9, 2015 at 8:30 AM, raggy <raghav0110...@gmail.com> wrote: > For a research project, I tried sorting the elements in an RDD. I did this > in > two different approaches. > > In the first method, I applied a mapPartitions() function on the RDD, so > that it would sort the contents of the RDD, and provide a result RDD that > contains the sorted list as the only record in the RDD. Then, I applied a > reduce function which basically merges sorted lists. > > I ran these experiments on an EC2 cluster containing 30 nodes. I set it up > using the spark ec2 script. The data file was stored in HDFS. > > In the second approach I used the sortBy method in Spark. > > I performed these operation on the US census data(100MB) found here > > A single lines looks like this > > 9, Not in universe, 0, 0, Children, 0, Not in universe, Never married, Not > in universe or children, Not in universe, White, All other, Female, Not in > universe, Not in universe, Children or Armed Forces, 0, 0, 0, Nonfiler, Not > in universe, Not in universe, Child <18 never marr not in subfamily, Child > under 18 never married, 1758.14, Nonmover, Nonmover, Nonmover, Yes, Not in > universe, 0, Both parents present, United-States, United-States, > United-States, Native- Born in the United States, 0, Not in universe, 0, 0, > 94, - 50000. > I sorted based on the 25th value in the CSV. In this line that is 1758.14. > > I noticed that sortBy performs worse than the other method. Is this the > expected scenario? If it is, why wouldn't the mapPartitions() and reduce() > be the default sorting approach? > > Here is my implementation > > public static void sortBy(JavaSparkContext sc){ > JavaRDD<String> rdd = sc.textFile("/data.txt",32); > long start = System.currentTimeMillis(); > rdd.sortBy(new Function<String, Double>(){ > > @Override > public Double call(String v1) throws Exception { > // TODO Auto-generated method stub > String [] arr = v1.split(","); > return Double.parseDouble(arr[24]); > } > }, true, 9).collect(); > long end = System.currentTimeMillis(); > System.out.println("SortBy: " + (end - start)); > } > > public static void sortList(JavaSparkContext sc){ > JavaRDD<String> rdd = sc.textFile("/data.txt",32); //parallelize(l, > 8); > long start = System.currentTimeMillis(); > JavaRDD<LinkedList<Tuple2<Double, String>>> rdd3 = > rdd.mapPartitions(new FlatMapFunction<Iterator<String>, > LinkedList<Tuple2<Double, String>>>(){ > > @Override > public Iterable<LinkedList<Tuple2<Double, String>>> > call(Iterator<String> t) > throws Exception { > // TODO Auto-generated method stub > LinkedList<Tuple2<Double, String>> lines = new > LinkedList<Tuple2<Double, String>>(); > while(t.hasNext()){ > String s = t.next(); > String arr1[] = s.split(","); > Tuple2<Double, String> t1 = new Tuple2<Double, > String>(Double.parseDouble(arr1[24]),s); > lines.add(t1); > } > Collections.sort(lines, new IncomeComparator()); > LinkedList<LinkedList<Tuple2<Double, String>>> list = new > LinkedList<LinkedList<Tuple2<Double, String>>>(); > list.add(lines); > return list; > } > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Different-Sorting-RDD-methods-in-Apache-Spark-tp23214.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >