Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster.
The issues I am having, which are present for all three algorithms, is that (1) GraphX is not improving between 4 and 8 nodes and (2) GraphX seems to be heavily unbalanced with some machines doing the majority of the computation. PageRank (20 iterations) is the worst. For 1-node, 4-node, an 8-node clusters I get the following runtimes (wallclock): 192s, 154s, and 154s. These results are potentially understandable, though the times are significantly worse than the results in the paper https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf, where this algorithm ran in ~75s on a worse cluster. My main concern is that the computation seems to be heavily unbalanced. I have measured the CPU time of all the process associated with GraphX during its execution and for a 4-node cluster it yielded the following CPU times (for each machine): 724s, 697s, 2216s, 694s. Is this normal? Should I expect a more even distribution of work across machines? I am using the stock pagerank code found here: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala. I use the configurations "spark.executor.memory=40g" and "spark.cores.max=128" for the 4-node case. I also set the number of edge partitions to be 64. Could you please let me know if these results are reasonable or if there is a better way to ensure the computation is better distributed among the nodes in a cluster. I really appreciate the help. Thanks, Steve