Hi Spark Devs,

  As part of a project at work, I have written a graph generator for RMAT 
graphs consistent with the specifications in the Graph 500 benchmark 
(http://www.graph500.org/specifications). We had originally planned to use the 
rmatGenerator function in GraphGenerators, but found that it wasn't suitable 
for generating graphs with billions of edges; the edges are generated in a 
single thread and stored in a Set, meaning it can't generate a graph larger 
than memory on a single JVM (and I think Sets are limited to Int.MaxValue 
elements anyway).

  The generator I have is essentially a more scalable version of rmatGenerator. 
We have used it to generate a graph with 2^32 vertices and 2^36 edges on our 
modestly-specced cluster of 16 machines. It seems like other people interested 
in Spark might want to play with some large RMAT graphs (or run the Graph 500 
benchmark), so I would like to contribute my generator. It does have some minor 
differences from the current generator, though:

  1.  Vertex IDs are shuffled after the graph structure is generated, so the 
degree of a vertex cannot be predicted from its ID (without this step vertex 0 
would always have the largest degree, followed by vertices 1,2,4,8, etc.). This 
is per the Graph 500 spec. It could be easily made optional.
  2.  Duplicate edges are not removed from the resulting graph. This could 
easily be done with a call to distinct() on the resulting edge list, but then 
there would be slightly fewer edges than one generated by the current 
rmatGenerator. Also this process would be very slow on large graphs due to skew.
  3.  Doesn't set the out degree as the vertex attribute. Again this would be 
simple to add, but it could be slow on the super vertices.

  My question for the Spark Devs is: Is this something you would want as part 
of GraphX (either as a replacement for the current rmatGenerator or a separate 
function in GraphGenerators)? Since it was developed at work I need to go 
through our legal department and QA processes to open-source it, and to fill 
out the paperwork I need to know whether I'll be submitting a pull request or 
standing it up as a separate project on GitHub.

Thanks!

-Ryan

--
J. Ryan Carr, Ph. D.

The Johns Hopkins University, Applied Physics Laboratory
11100 Johns Hopkins Rd., Laurel, MD 20723
Office: 240-228-9157
Cell: 443-744-1004
Email: ryan.c...@jhuapl.edu<mailto:ryan.c...@jhuapl.edu> or 
james.c...@jhuapl.edu<mailto:james.c...@jhuapl.edu>

Reply via email to