At 2014-11-24 19:02:08 -0800, Harihar Nahak <hna...@wynyardgroup.com> wrote: > According to documentation GraphX runs 10x faster than normal Spark. So I > run Page Rank algorithm in both the applications: > [...] > Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor > Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); ==> > > *Spark Page Rank took -> 21.29 mins > GraphX Page Rank took -> 42.01 mins * > > Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 > driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions > : 50, random vertex cut ; no. of iteration : 2) => > > *Spark Page Rank took -> 10.54 mins > GraphX Page Rank took -> 7.54 mins * > > Could you please help me to determine, when to use Spark and GraphX ? If > GraphX took same amount of time than Spark then its better to use Spark > because spark has variey of operators to deal with any type of RDD.
If you have a problem that's naturally expressible as a graph computation, it makes sense to use GraphX in my opinion. In addition to the optimizations that GraphX incorporates which you would otherwise have to implement manually, GraphX's programming model is likely a better fit. But even if you start off by using pure Spark, you'll still have the flexibility to use GraphX for other parts of the problem since it's part of the same system. To address the benchmark results you got: 1. GraphX takes more time than Spark to load the graph, because it has to index it, but subsequent iterations should be faster. We benchmarked with 20 iterations to show this effect, but you only used 2 iterations, which doesn't give much time to amortize the loading cost. 2. The benchmarks in the GraphX OSDI paper are against a naive implementation of PageRank in Spark, while the version you benchmarked against has some of the same optimizations as GraphX does. I believe we found that the optimized Spark PageRank was only 3x slower than GraphX. 3. When running those benchmarks, we used an experimental version of Spark with in-memory shuffle, which disproportionately benefits GraphX since its shuffle files are smaller due to specialized compression. 4. We haven't optimized GraphX for local mode, so it's not surprising that it's slower there. Ankur --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org