There are a number real (medium sized) graphs at
http://snap.stanford.edu/data/index.html which we use for similar
benchmarks. It has a good mix of graph types, sparse/dense, ground truth
graphs (e.g. social networks that follow power law distribution etc.). So
far we have observed that the type of graph has a high impact on the
performance of algorithms that Claudio mentioned.


On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <[email protected]
> wrote:

> Hi Wei,
>
> it depends on what you mean by workload for a batch processing system. I
> believe we can split the problem in two: generating a realistic graph, and
> using "representative" algorithms.
>
> To generate graphs we have two options in giraph:
>
> 1) random graph: you specify the number of vertices and the number of
> edges for each vertex, and the edges will connect two random vertices. This
> creates a graph with (i) low clustering coefficient, (ii) low average path
> length, (ii) a uniform degree distribution
>
> 2) watts strogatz: you specify the number of vertices, the number of
> edges, and a rewire probability beta. giraph will generate a ring lattice
> (each vertex is connected to k preceeding vertices and k following
> vertices) and rewire some of the edges randomly. This will create a graph
> with (i) high clustering coefficient, (ii) low average path length, (iii)
> poisson-like degree distribution (depends on beta). This graph will
> resemble a small world graph such as a social network, except for the
> degree distribution which will not a power law.
>
> To use representative algorithms you can choose:
>
> 1) PageRank: it's a ranking algorithm where all the vertices are active
> and send messages along the edges at each superstep (hence you'll have O(V)
> active vertices and O(E) messages)
>
> 2) Shortest Paths: starting from a random vertex you'll visit al the
> vertices in the graph (some multiple times). This will have an aggregate
> O(V) active vertices and O(E) messages, but this is only a lower bound. In
> general you'l have different areas of the graph explored at each superstep,
> and hence potentially a varying workload across different supersteps.
>
> 3) Connected Components: this will have something opposite to (2) as it
> will have many active vertices at the beginning, where the detection is
> refined towards the end.
>
>
> Hope this helps,
> Claudio
>
>
> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <[email protected]> wrote:
>
>> Hi,
>>
>> I am interested in measuring some performance numbers of Giraph on my
>> machine.
>>
>> I am wondering are there some pointers where I can get some
>> (configurable) reasonably large workload to work on ?
>>
>> Thanks!
>>
>> Wei
>>
>
>
>
> --
>    Claudio Martella
>    [email protected]
>

Reply via email to