Re: Comparing GraphX and GraphLab

Niko Stahl Mon, 24 Mar 2014 16:25:19 -0700

Hi Ankur, hi Deb,

Thanks for the information and for the reference to the recent paper. I
understand that GraphLab is highly optimized for graph algorithms and
consistently outperforms GraphX for graph related tasks. I'd like to
further evaluate the cost of moving data between Spark and some other graph
processing framework (e.g. GraphLab). The paper touches on this briefly
citing serialization, replication and disk I/0 as the main factors.


Do you have any suggestions on how to further investigate the impact of
these factors? For example, I suppose the impact of replication depends on
cluster size and HDFS configuration. Your help is greatly appreciated.

Best,
Niko


On Mon, Mar 24, 2014 at 8:35 PM, Debasish Das <debasish.da...@gmail.com>wrote:

> Hi Ankur,
>
> Given enough memory and proper caching, I don't understand why is this the
> case?
>
> GraphX may actually be slower when Spark is configured to launch many
> tasks per machine, because shuffle communication between Spark tasks on the
> same machine still occurs by reading and writing from disk, while GraphLab
> uses shared memory for same-machine communication
>
> Could you please elaborate more on it ?
>
>  Thanks.
> Deb
>
>
>
> On Mon, Mar 24, 2014 at 1:01 PM, Ankur Dave <ankurd...@gmail.com> wrote:
>
>> Hi Niko,
>>
>> The GraphX team recently wrote a longer paper with more benchmarks and
>> optimizations: http://arxiv.org/abs/1402.2394
>>
>> Regarding the performance of GraphX vs. GraphLab, I believe GraphX
>> currently outperforms GraphLab only in end-to-end benchmarks of pipelines
>> involving both graph-parallel operations (e.g. PageRank) and data-parallel
>> operations (e.g. ETL and data cleaning). This is due to the overhead of
>> moving data between GraphLab and a data-parallel system like Spark. There's
>> an example of a pipeline in Section 5.2 in the linked paper, and the
>> results are in Figure 10 on page 11.
>>
>> GraphX has a very similar architecture as GraphLab, so I wouldn't expect
>> it to have better performance on pure graph algorithms. GraphX may actually
>> be slower when Spark is configured to launch many tasks per machine,
>> because shuffle communication between Spark tasks on the same machine still
>> occurs by reading and writing from disk, while GraphLab uses shared memory
>> for same-machine communication.
>>
>> I've CC'd Joey and Reynold as well.
>>
>> Ankur <http://www.ankurdave.com/>
>>
>> On Mar 24, 2014 11:00 AM, "Niko Stahl" <r.niko.st...@gmail.com> wrote:
>>
>>> I'm interested in extending the comparison between GraphX and GraphLab
>>> presented in Xin et. al (2013). The evaluation presented there is rather
>>> limited as it only compares the frameworks for one algorithm (PageRank) on
>>> a cluster with a fixed number of nodes. Are there any graph algorithms
>>> where one might expect GraphX to perform better than GraphLab? Do you
>>> expect the scaling properties (i.e. performance as a function of # of
>>> worker nodes) to differ?
>>>
>>
>

Re: Comparing GraphX and GraphLab

Reply via email to