Re: Scalability results for GoldenOrb and comparison with Giraph

2011-12-14 Thread Avery Ching
If you're going to do benchmarking, make sure to include 
https://issues.apache.org/jira/browse/GIRAPH-57 as it should provide a 
nice messaging boost!


Avery

On 12/14/11 2:16 PM, Jon Allen wrote:

Hi Claudio,

It looks like it might be a little tough to squeeze out scalability tests for 
hama and giraph by the FOSDEM deadline. We can try to put something together if 
you'd like though (not sure where I'll be able to procure time on a cluster for 
testing by then, but it won't hurt to try I suppose).

If you just want to present a technical discussion and background for 
scalability testing graph processing frameworks, I should have time this 
upcoming sunday to have a chat and help with presentation materials. Just drop 
me a line if you're interested and we'll set something up over Skype.

Thanks,
Jon

On Dec 12, 2011, at 2:44 PM, Claudio Martella wrote:


This is all very interesting. As I wrote a few weeks ago also on
golden orb's ML, i thought about discussing a nice benchmarking
toolset at the graph devroom of FOSDEM with hama, goldenorb and giraph
devs.

Apparently everything got quite anticipated, cool :)

I believe the SSSP and PageRank algorithms are great examples for
benchmarking as they have a completely different messaging pattern.
There are though other technicalities to test, such as the
scalability of graph mutation operations, graph load etc.

Jon, thanks for your nice contribution from my side as well.

On Mon, Dec 12, 2011 at 8:19 PM, Avery Chingach...@apache.org  wrote:

Thanks for the detail on your experiments.  I certainly agree that it would
be very useful to make some sort of scalability/performance testing
framework to evaluate improvements.  Definitely would appreciate your help
in putting one together.  We have a few benchmarks (PageRankBenchmark and
RandomMessageBenchmark), but would appreciate any help you would like to
provide.

Otherwise, if that doesn't interest you, please have a look at the open
JIRAs

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+GIRAPH+AND+resolution+%3D+Unresolved+AND+assignee+is+EMPTY+ORDER+BY+priority+DESCmode=hide

and see what's interesting for you.  If nothing there interests you, feel
freel to discuss here or on giraph-dev or open up a JIRA.  =)

Avery


On 12/11/11 1:23 PM, Jon Allen wrote:

Hi Avery,

Thanks for the response. I reached out to the graph user mailing list
because I am quite interested in helping develop / execute standardized
scalability testing for Giraph, so I'm glad to see that there is interest!

Here's some follow up to some of the points you raised / questions you
asked:

Currently, the biggest limitation faced by GoldenOrb is the capacity
issue; it can't handle more than roughly 100,000 vertices per node. This low
maximum vertices per node limitation, coupled with instability issues,
obviously hampered the ability to conduct ideal scalability testing, but
even with graphs totaling 100,000 to 250,000 vertices a clear power law
slope can be found before hitting an inevitable communication bottleneck.
This can be seen by noting that the log-log slopes of the 20k, 50k, and 100k
graphs (for SSSP) remain fairly constant, and negative, as the number of
nodes in the cluster grows, unlike the slopes for 5k, 2k, and 1k graphs
which demonstrate a framework overhead bottleneck, corresponding to the
point where the slope changes from negative to roughly 0 or positive (which
appears to happen at around 1k vertices per node).

On to the second issue you brought up…

Graph problems can be notoriously difficult to implement scalability
testing for precisely for the reasons you brought up. A few things were done
to allow an apples-to-apples comparison with the Pregel results. First, the
single source shortest path algorithm used for testing comes directly from
the Pregel paper. Second, just as in the Pregel tests, binary tree graphs
were used to ensure that each vertex had the same fixed, low order,
outdegree. Last, the tests were repeated using non-binary tree graphs
(generated by a python script) with a non-constant, but low order, average
outdegree per vertex (average 10 edges per vertex, then again with graphs
averaging 90 edges per vertex), the results of which were seen to be quite
close to the binary tree graph data.

As mentioned in passing, the scalability test results allow for a direct
comparison with the Pregel results, but should also allow for a meaningful
comparison to your scalability results for Giraph precisely because the
edges per vertex have been fixed. While this is not ideal (I would prefer a
standardized set of tests which everybody runs in standardized
configurations), the proposition that the results can be meaningfully
compared is backed up by two points; First, the log-log slope of the data
you presented is right in line with the value reported by Pregel for their
SSSP tests, both of which are realistic values (and show very good
parallelization!), meaning that both 

Re: Scalability results for GoldenOrb and comparison with Giraph

2011-12-11 Thread Jon Allen
Hi Avery,

Thanks for the response. I reached out to the graph user mailing list because I 
am quite interested in helping develop / execute standardized scalability 
testing for Giraph, so I'm glad to see that there is interest!

Here's some follow up to some of the points you raised / questions you asked:

Currently, the biggest limitation faced by GoldenOrb is the capacity issue; it 
can't handle more than roughly 100,000 vertices per node. This low maximum 
vertices per node limitation, coupled with instability issues, obviously 
hampered the ability to conduct ideal scalability testing, but even with graphs 
totaling 100,000 to 250,000 vertices a clear power law slope can be found 
before hitting an inevitable communication bottleneck. This can be seen by 
noting that the log-log slopes of the 20k, 50k, and 100k graphs (for SSSP) 
remain fairly constant, and negative, as the number of nodes in the cluster 
grows, unlike the slopes for 5k, 2k, and 1k graphs which demonstrate a 
framework overhead bottleneck, corresponding to the point where the slope 
changes from negative to roughly 0 or positive (which appears to happen at 
around 1k vertices per node).

On to the second issue you brought up… 

Graph problems can be notoriously difficult to implement scalability testing 
for precisely for the reasons you brought up. A few things were done to allow 
an apples-to-apples comparison with the Pregel results. First, the single 
source shortest path algorithm used for testing comes directly from the Pregel 
paper. Second, just as in the Pregel tests, binary tree graphs were used to 
ensure that each vertex had the same fixed, low order, outdegree. Last, the 
tests were repeated using non-binary tree graphs (generated by a python script) 
with a non-constant, but low order, average outdegree per vertex (average 10 
edges per vertex, then again with graphs averaging 90 edges per vertex), the 
results of which were seen to be quite close to the binary tree graph data.

As mentioned in passing, the scalability test results allow for a direct 
comparison with the Pregel results, but should also allow for a meaningful 
comparison to your scalability results for Giraph precisely because the edges 
per vertex have been fixed. While this is not ideal (I would prefer a 
standardized set of tests which everybody runs in standardized configurations), 
the proposition that the results can be meaningfully compared is backed up by 
two points; First, the log-log slope of the data you presented is right in line 
with the value reported by Pregel for their SSSP tests, both of which are 
realistic values (and show very good parallelization!), meaning that both 
algorithms display similar properties for configurations in the regime not 
dominated by a framework overhead bottleneck. And second, the GoldenOrb SSSP 
results being compared are also from configurations which have reached a steady 
power law slope over the range of nodes considered, for runs using the same 
algorithm as the Pregel results. These two points, I feel, justify the 
comparisons made (though, again, it would be better to have a standardized set 
of configurations for testing to facilitate comparing results, even between 
algorithms). Since all three sets of scalability tests yield fairly linear 
complexity plots (execution time vs. number of vertices in the graph, slide 29 
of your talk), it also makes sense to compare weak scaling results, a 
proposition supported by the consistency of the observed GoldenOrb weak scaling 
results for SSSP across multiple test configurations.


As for the results found in your October 2011 talk, they are impressive and 
clearly demonstrate an ability to effectively scale to large graph problems 
(shown by the weak scaling slope of ~ 0.01) and to maximize the benefit of 
throwing additional computational resources at a known problem (shown by the 
strong scaling slope of ~ -0.93), so I'm interested to see the results of the 
improvements that have been made. I'm a big proponent of routine scalability 
testing using a fixed set of configurations as part of the software testing 
process, as the comparable results help to quantify improvement as the 
software is developed further and can often help to identify unintended side 
effects of changes / find optimal configurations for various regimes of 
problems, and would like to see Giraph succeed, so let me know if there's any 
open issues which I might be able to dig into (I'm on the dev mailing list as 
well, though haven't posted there).

Thanks,
Jon


On Dec 11, 2011, at 1:02 PM, Avery Ching wrote:

 Hi Jon,
 
 -golden...@googlegroups.com (so as to not clog up their mailing list 
 uninvited)
 
 First of all, thank you for sharing this comparison.  I would like to note a 
 few things.  The results I posted in October 2011 were actually a bit old 
 (done in June 2011) and do not have several improvements that reduce memory 
 usage significantly (i.e. GIRAPH-12 and 

Scalability results for GoldenOrb and comparison with Giraph

2011-12-10 Thread Jon Allen
Since GoldenOrb was released this past summer, a number of people have asked 
questions regarding scalability and performance testing, as well as a 
comparison of these results with those of Giraph ( 
http://incubator.apache.org/giraph/ ), so I went forward with running tests to 
help answer some of these questions.

A full report of the scalability testing results, along with methodology 
details, relevant information regarding testing and analysis, links to data 
points for Pregel and Giraph, scalability testing references, and background 
mathematics, can be found here:

http://wwwrel.ph.utexas.edu/Members/jon/golden_orb/

Since this data will also be of interest to the Giraph community (for 
methodology, background references, and analysis reasons), I am cross posting 
to the Giraph user mailing list.

A synopsis of the scalability results for GoldenOrb, and comparison data points 
for Giraph and Google's Pregel framework are provided below. 

The setup and execution of GoldenOrb scalability tests were conducted by three 
former Ravel (http://www.raveldata.com ) developers, including myself, with 
extensive knowledge of the GoldenOrb code base and optimal system 
configurations, ensuring the most optimal settings were used for scalability 
testing. 


RESULTS SUMMARY:


MAX CAPACITY: 

Pregel (at least): 166,666,667 vertices per node. 

Giraph (at least): 1,666,667 vertices per worker. 

GoldenOrb: ~ 100,000 vertices per node, 33,333 vertices per worker. 


STRONG SCALING (SSSP): 
Note: Optimal parallelization corresponds to the minimum value -1.0. Deviation 
from the minimum possible value of -1.0 corresponds to non-optimal 
parallelization. 

Pregel: -0.924 (1 billion total vertices) 

Giraph: -0.934 (250 Million total vertices) 

GoldenOrb: -0.031 Average, -0.631 Best (10 total vertices), 0.020 Worst 
(1000 total vertices) 


WEAK SCALING (SSSP): 
Note: Optimal weak scalability corresponds to the value 0.0. Deviation from the 
optimal value of 0.0, corresponds to non-optimal usage of computational 
resources as managed by the framework.

Pregel: No Data Available 

Giraph: 0.01 (1,666,667 vertices per worker) 

GoldenOrb: 0.37 Average, 0.23 Best (500 vertices per node), 0.48 Worst (12500 
vertices per node) 



I hope this answers some of the many questions which have been posted regarding 
scalability and performance. Be sure to check out the full scalability testing 
report at http://wwwrel.ph.utexas.edu/Members/jon/golden_orb/  Please let me 
know if you have any questions.

Thanks,
Jon