Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-12-28 Thread Harihar Nahak
Yes, I had try that too. I took the pre-built spark 1.1 release. If you there
are changes in up coming changes for GraphX library, just let me know or in
spark 1.2 I can do try on that. 

--Harihar



-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20874.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-12-22 Thread pradhandeep
Did you try running PageRank.scala instead of LiveJournalPageRank.scala?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20808.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-27 Thread Harihar Nahak
Thanks Ankur, Its really help full. I've few queries on optimization
techniques. for the current I used RandomVertexCut partition.

But what partition should be used if have:
1. No. of edges in edgeList file are to large like 50,000,000; where
multiple edges to same pair of vertices are many
2. No of unique Vertex are to large suppose 10,000,000 in above edgeList
file
3. No of unique Vertex are small suppose less than 100,000 in above
edgeList file





On 27 November 2014 at 20:23, ankurdave [via Apache Spark User List] <
ml-node+s1001560n1995...@n3.nabble.com> wrote:

> At 2014-11-24 19:02:08 -0800, Harihar Nahak <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19956&i=0>> wrote:
>
> > According to documentation GraphX runs 10x faster than normal Spark. So
> I
> > run Page Rank algorithm in both the applications:
> > [...]
> > Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor
> > Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2);   ==>
> >
> > *Spark Page Rank took -> 21.29 mins
> > GraphX Page Rank took -> 42.01 mins *
> >
> > Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1
> > driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge
> partitions
> > : 50, random vertex cut ; no. of iteration : 2) =>
> >
> > *Spark Page Rank took -> 10.54 mins
> > GraphX Page Rank took -> 7.54 mins *
> >
> > Could you please help me to determine, when to use Spark and GraphX ? If
> > GraphX took same amount of time than Spark then its better to use Spark
> > because spark has variey of operators to deal with any type of RDD.
>
> If you have a problem that's naturally expressible as a graph computation,
> it makes sense to use GraphX in my opinion. In addition to the
> optimizations that GraphX incorporates which you would otherwise have to
> implement manually, GraphX's programming model is likely a better fit. But
> even if you start off by using pure Spark, you'll still have the
> flexibility to use GraphX for other parts of the problem since it's part of
> the same system.
>
> To address the benchmark results you got:
>
> 1. GraphX takes more time than Spark to load the graph, because it has to
> index it, but subsequent iterations should be faster. We benchmarked with
> 20 iterations to show this effect, but you only used 2 iterations, which
> doesn't give much time to amortize the loading cost.
>
> 2. The benchmarks in the GraphX OSDI paper are against a naive
> implementation of PageRank in Spark, while the version you benchmarked
> against has some of the same optimizations as GraphX does. I believe we
> found that the optimized Spark PageRank was only 3x slower than GraphX.
>
> 3. When running those benchmarks, we used an experimental version of Spark
> with in-memory shuffle, which disproportionately benefits GraphX since its
> shuffle files are smaller due to specialized compression.
>
> 4. We haven't optimized GraphX for local mode, so it's not surprising that
> it's slower there.
>
> Ankur
>
> -
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19956&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19956&i=2>
>
>
>
> --------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19956.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Is Spark? or GraphX runs fast? a performance
> comparison on Page Rank, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19710&code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MTk3MTB8LTE4MTkxOTE5Mjk=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-26 Thread Ankur Dave
At 2014-11-24 19:02:08 -0800, Harihar Nahak  wrote:
> According to documentation GraphX runs 10x faster than normal Spark. So I
> run Page Rank algorithm in both the applications:
> [...]
> Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor
> Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2);   ==>
>
> *Spark Page Rank took -> 21.29 mins
> GraphX Page Rank took -> 42.01 mins *
>
> Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1
> driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions
> : 50, random vertex cut ; no. of iteration : 2) =>
>
> *Spark Page Rank took -> 10.54 mins
> GraphX Page Rank took -> 7.54 mins *
>
> Could you please help me to determine, when to use Spark and GraphX ? If
> GraphX took same amount of time than Spark then its better to use Spark
> because spark has variey of operators to deal with any type of RDD.

If you have a problem that's naturally expressible as a graph computation, it 
makes sense to use GraphX in my opinion. In addition to the optimizations that 
GraphX incorporates which you would otherwise have to implement manually, 
GraphX's programming model is likely a better fit. But even if you start off by 
using pure Spark, you'll still have the flexibility to use GraphX for other 
parts of the problem since it's part of the same system.

To address the benchmark results you got:

1. GraphX takes more time than Spark to load the graph, because it has to index 
it, but subsequent iterations should be faster. We benchmarked with 20 
iterations to show this effect, but you only used 2 iterations, which doesn't 
give much time to amortize the loading cost.

2. The benchmarks in the GraphX OSDI paper are against a naive implementation 
of PageRank in Spark, while the version you benchmarked against has some of the 
same optimizations as GraphX does. I believe we found that the optimized Spark 
PageRank was only 3x slower than GraphX.

3. When running those benchmarks, we used an experimental version of Spark with 
in-memory shuffle, which disproportionately benefits GraphX since its shuffle 
files are smaller due to specialized compression.

4. We haven't optimized GraphX for local mode, so it's not surprising that it's 
slower there.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-26 Thread Harihar Nahak
Hi Guys, 

is there any one experience the same thing as above?  



-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-24 Thread Harihar Nahak
Hi All, 

I started exploring Spark from past 2 months. I'm looking for some concrete
features from both Spark and GraphX so that I'll take some decisions what to
use, based upon who get highest performance. 

According to documentation GraphX runs 10x faster than normal Spark. So I
run Page Rank algorithm in both the applications: 
For Spark I used:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala
For GraphX I used :
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/graphx/LiveJournalPageRank.scala
  

Input data : http://snap.stanford.edu/data/soc-LiveJournal1.html (1 Gb in
size)
No of Iterations : 2 

*Time Taken : *

Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor
Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2);   ==>  

*Spark Page Rank took -> 21.29 mins 
GraphX Page Rank took -> 42.01 mins *   
 
Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1
driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions
: 50, random vertex cut ; no. of iteration : 2) =>

*Spark Page Rank took -> 10.54 mins 
GraphX Page Rank took -> 7.54 mins * 


Could you please help me to determine, when to use Spark and GraphX ? If
GraphX took same amount of time than Spark then its better to use Spark
because spark has variey of operators to deal with any type of RDD. 

Any suggestions or feedback or pointers will highly appreciate

Thanks,


 



-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org