There are a number real (medium sized) graphs at http://snap.stanford.edu/data/index.html which we use for similar benchmarks. It has a good mix of graph types, sparse/dense, ground truth graphs (e.g. social networks that follow power law distribution etc.). So far we have observed that the type of graph has a high impact on the performance of algorithms that Claudio mentioned.
On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <[email protected] > wrote: > Hi Wei, > > it depends on what you mean by workload for a batch processing system. I > believe we can split the problem in two: generating a realistic graph, and > using "representative" algorithms. > > To generate graphs we have two options in giraph: > > 1) random graph: you specify the number of vertices and the number of > edges for each vertex, and the edges will connect two random vertices. This > creates a graph with (i) low clustering coefficient, (ii) low average path > length, (ii) a uniform degree distribution > > 2) watts strogatz: you specify the number of vertices, the number of > edges, and a rewire probability beta. giraph will generate a ring lattice > (each vertex is connected to k preceeding vertices and k following > vertices) and rewire some of the edges randomly. This will create a graph > with (i) high clustering coefficient, (ii) low average path length, (iii) > poisson-like degree distribution (depends on beta). This graph will > resemble a small world graph such as a social network, except for the > degree distribution which will not a power law. > > To use representative algorithms you can choose: > > 1) PageRank: it's a ranking algorithm where all the vertices are active > and send messages along the edges at each superstep (hence you'll have O(V) > active vertices and O(E) messages) > > 2) Shortest Paths: starting from a random vertex you'll visit al the > vertices in the graph (some multiple times). This will have an aggregate > O(V) active vertices and O(E) messages, but this is only a lower bound. In > general you'l have different areas of the graph explored at each superstep, > and hence potentially a varying workload across different supersteps. > > 3) Connected Components: this will have something opposite to (2) as it > will have many active vertices at the beginning, where the detection is > refined towards the end. > > > Hope this helps, > Claudio > > > On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <[email protected]> wrote: > >> Hi, >> >> I am interested in measuring some performance numbers of Giraph on my >> machine. >> >> I am wondering are there some pointers where I can get some >> (configurable) reasonably large workload to work on ? >> >> Thanks! >> >> Wei >> > > > > -- > Claudio Martella > [email protected] >
