GraphX: unbalanced computation and slow runtime on livejournal network

Steven Harenberg Sun, 19 Apr 2015 17:52:46 -0700

Hi all,

I have been testing GraphX on the soc-LiveJournal1 network from the SNAP
repository. Currently I am running on c3.8xlarge EC2 instances on Amazon.
These instances have 32 cores and 60GB RAM per node, and so far I have run
SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster.


The issues I am having, which are present for all three algorithms, is that
(1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to
be heavily unbalanced with some machines doing the majority of the
computation.

PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node
clusters I get the following runtimes (wallclock): 192s, 154s, and 154s.
These results are potentially understandable, though the times are
significantly worse than the results in the paper
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where
this algorithm ran in ~75s on a worse cluster.

My main concern is that the computation seems to be heavily unbalanced. I
have measured the CPU time of all the process associated with GraphX during
its execution and for a 4-node cluster it yielded the following CPU times
(for each machine): 724s, 697s, 2216s, 694s.

Is this normal? Should I expect a more even distribution of work across
machines?

I am using the stock pagerank code found here:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala.
I use the configurations "spark.executor.memory=40g" and
"spark.cores.max=128" for the 4-node case. I also set the number of edge
partitions to be 64.

Could you please let me know if these results are reasonable or if there is
a better way to ensure the computation is better distributed among the
nodes in a cluster. I really appreciate the help.

Thanks,
Steve

GraphX: unbalanced computation and slow runtime on livejournal network

Reply via email to