Greetings,
We are looking into using the GraphX connected-components algorithm on Hadoop 
for grouping operations.  Our typical data is on the order of 50-200M vertices 
with an edge:vertex ratio between 2 and 30.  While there are pathological cases 
of very large groups, they tend to be small.  I am trying to get a handle on 
the level of performance and scaling we should expect, and how to best 
configure GraphX/Spark to get there.  After some trying, we cannot get to 100M 
vertices/edges without running out of memory on a small cluster (8 nodes with 4 
cores and 8GB available for YARN on each node).  This limit seems low, as 
64GB/100M is 640 bytes per vertex, which should be enough.  Is this within 
reason?  Does anyone have sample they can share that has the right 
configurations for succeeding with this size of data and cluster?  What level 
of performance should we expect?  What happens when the data set exceed memory, 
does it spill to disk "nicely" or degrade catastrophically?

Thanks,
John Lilley

Reply via email to