Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Yes, I had try that too. I took the pre-built spark 1.1 release. If you there are changes in up coming changes for GraphX library, just let me know or in spark 1.2 I can do try on that. --Harihar - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20874.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Did you try running PageRank.scala instead of LiveJournalPageRank.scala? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20808.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Thanks Ankur, Its really help full. I've few queries on optimization techniques. for the current I used RandomVertexCut partition. But what partition should be used if have: 1. No. of edges in edgeList file are to large like 50,000,000; where multiple edges to same pair of vertices are many 2. No of unique Vertex are to large suppose 10,000,000 in above edgeList file 3. No of unique Vertex are small suppose less than 100,000 in above edgeList file On 27 November 2014 at 20:23, ankurdave [via Apache Spark User List] < ml-node+s1001560n1995...@n3.nabble.com> wrote: > At 2014-11-24 19:02:08 -0800, Harihar Nahak <[hidden email] > <http://user/SendEmail.jtp?type=node&node=19956&i=0>> wrote: > > > According to documentation GraphX runs 10x faster than normal Spark. So > I > > run Page Rank algorithm in both the applications: > > [...] > > Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor > > Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); ==> > > > > *Spark Page Rank took -> 21.29 mins > > GraphX Page Rank took -> 42.01 mins * > > > > Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 > > driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge > partitions > > : 50, random vertex cut ; no. of iteration : 2) => > > > > *Spark Page Rank took -> 10.54 mins > > GraphX Page Rank took -> 7.54 mins * > > > > Could you please help me to determine, when to use Spark and GraphX ? If > > GraphX took same amount of time than Spark then its better to use Spark > > because spark has variey of operators to deal with any type of RDD. > > If you have a problem that's naturally expressible as a graph computation, > it makes sense to use GraphX in my opinion. In addition to the > optimizations that GraphX incorporates which you would otherwise have to > implement manually, GraphX's programming model is likely a better fit. But > even if you start off by using pure Spark, you'll still have the > flexibility to use GraphX for other parts of the problem since it's part of > the same system. > > To address the benchmark results you got: > > 1. GraphX takes more time than Spark to load the graph, because it has to > index it, but subsequent iterations should be faster. We benchmarked with > 20 iterations to show this effect, but you only used 2 iterations, which > doesn't give much time to amortize the loading cost. > > 2. The benchmarks in the GraphX OSDI paper are against a naive > implementation of PageRank in Spark, while the version you benchmarked > against has some of the same optimizations as GraphX does. I believe we > found that the optimized Spark PageRank was only 3x slower than GraphX. > > 3. When running those benchmarks, we used an experimental version of Spark > with in-memory shuffle, which disproportionately benefits GraphX since its > shuffle files are smaller due to specialized compression. > > 4. We haven't optimized GraphX for local mode, so it's not surprising that > it's slower there. > > Ankur > > - > To unsubscribe, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=19956&i=1> > For additional commands, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=19956&i=2> > > > > -------------- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19956.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Is Spark? or GraphX runs fast? a performance > comparison on Page Rank, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19710&code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MTk3MTB8LTE4MTkxOTE5Mjk=> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19986.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
At 2014-11-24 19:02:08 -0800, Harihar Nahak wrote: > According to documentation GraphX runs 10x faster than normal Spark. So I > run Page Rank algorithm in both the applications: > [...] > Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor > Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); ==> > > *Spark Page Rank took -> 21.29 mins > GraphX Page Rank took -> 42.01 mins * > > Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 > driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions > : 50, random vertex cut ; no. of iteration : 2) => > > *Spark Page Rank took -> 10.54 mins > GraphX Page Rank took -> 7.54 mins * > > Could you please help me to determine, when to use Spark and GraphX ? If > GraphX took same amount of time than Spark then its better to use Spark > because spark has variey of operators to deal with any type of RDD. If you have a problem that's naturally expressible as a graph computation, it makes sense to use GraphX in my opinion. In addition to the optimizations that GraphX incorporates which you would otherwise have to implement manually, GraphX's programming model is likely a better fit. But even if you start off by using pure Spark, you'll still have the flexibility to use GraphX for other parts of the problem since it's part of the same system. To address the benchmark results you got: 1. GraphX takes more time than Spark to load the graph, because it has to index it, but subsequent iterations should be faster. We benchmarked with 20 iterations to show this effect, but you only used 2 iterations, which doesn't give much time to amortize the loading cost. 2. The benchmarks in the GraphX OSDI paper are against a naive implementation of PageRank in Spark, while the version you benchmarked against has some of the same optimizations as GraphX does. I believe we found that the optimized Spark PageRank was only 3x slower than GraphX. 3. When running those benchmarks, we used an experimental version of Spark with in-memory shuffle, which disproportionately benefits GraphX since its shuffle files are smaller due to specialized compression. 4. We haven't optimized GraphX for local mode, so it's not surprising that it's slower there. Ankur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Hi Guys, is there any one experience the same thing as above? - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Hi All, I started exploring Spark from past 2 months. I'm looking for some concrete features from both Spark and GraphX so that I'll take some decisions what to use, based upon who get highest performance. According to documentation GraphX runs 10x faster than normal Spark. So I run Page Rank algorithm in both the applications: For Spark I used: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala For GraphX I used : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/graphx/LiveJournalPageRank.scala Input data : http://snap.stanford.edu/data/soc-LiveJournal1.html (1 Gb in size) No of Iterations : 2 *Time Taken : * Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); ==> *Spark Page Rank took -> 21.29 mins GraphX Page Rank took -> 42.01 mins * Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions : 50, random vertex cut ; no. of iteration : 2) => *Spark Page Rank took -> 10.54 mins GraphX Page Rank took -> 7.54 mins * Could you please help me to determine, when to use Spark and GraphX ? If GraphX took same amount of time than Spark then its better to use Spark because spark has variey of operators to deal with any type of RDD. Any suggestions or feedback or pointers will highly appreciate Thanks, - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org