Hi Ankur, hi Deb, Thanks for the information and for the reference to the recent paper. I understand that GraphLab is highly optimized for graph algorithms and consistently outperforms GraphX for graph related tasks. I'd like to further evaluate the cost of moving data between Spark and some other graph processing framework (e.g. GraphLab). The paper touches on this briefly citing serialization, replication and disk I/0 as the main factors.
Do you have any suggestions on how to further investigate the impact of these factors? For example, I suppose the impact of replication depends on cluster size and HDFS configuration. Your help is greatly appreciated. Best, Niko On Mon, Mar 24, 2014 at 8:35 PM, Debasish Das <debasish.da...@gmail.com>wrote: > Hi Ankur, > > Given enough memory and proper caching, I don't understand why is this the > case? > > GraphX may actually be slower when Spark is configured to launch many > tasks per machine, because shuffle communication between Spark tasks on the > same machine still occurs by reading and writing from disk, while GraphLab > uses shared memory for same-machine communication > > Could you please elaborate more on it ? > > Thanks. > Deb > > > > On Mon, Mar 24, 2014 at 1:01 PM, Ankur Dave <ankurd...@gmail.com> wrote: > >> Hi Niko, >> >> The GraphX team recently wrote a longer paper with more benchmarks and >> optimizations: http://arxiv.org/abs/1402.2394 >> >> Regarding the performance of GraphX vs. GraphLab, I believe GraphX >> currently outperforms GraphLab only in end-to-end benchmarks of pipelines >> involving both graph-parallel operations (e.g. PageRank) and data-parallel >> operations (e.g. ETL and data cleaning). This is due to the overhead of >> moving data between GraphLab and a data-parallel system like Spark. There's >> an example of a pipeline in Section 5.2 in the linked paper, and the >> results are in Figure 10 on page 11. >> >> GraphX has a very similar architecture as GraphLab, so I wouldn't expect >> it to have better performance on pure graph algorithms. GraphX may actually >> be slower when Spark is configured to launch many tasks per machine, >> because shuffle communication between Spark tasks on the same machine still >> occurs by reading and writing from disk, while GraphLab uses shared memory >> for same-machine communication. >> >> I've CC'd Joey and Reynold as well. >> >> Ankur <http://www.ankurdave.com/> >> >> On Mar 24, 2014 11:00 AM, "Niko Stahl" <r.niko.st...@gmail.com> wrote: >> >>> I'm interested in extending the comparison between GraphX and GraphLab >>> presented in Xin et. al (2013). The evaluation presented there is rather >>> limited as it only compares the frameworks for one algorithm (PageRank) on >>> a cluster with a fixed number of nodes. Are there any graph algorithms >>> where one might expect GraphX to perform better than GraphLab? Do you >>> expect the scaling properties (i.e. performance as a function of # of >>> worker nodes) to differ? >>> >> >