RE: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library

dlmarion Fri, 28 Aug 2015 05:09:13 -0700

    
Dylan,
  I am a little confused about whether you want to place this in the contrib 
area or whether you want to create a sub-project as both are mentioned in your 
proposal. Also, if you intend for this to be a sub-project, have you looked at 
the incubator process? From what I understand given that this is a code 
contribution,it will have to go through that process.

-------- Original message --------
From: Dylan Hutchison <[email protected]> 
Date: 08/28/2015  2:43 AM  (GMT-05:00) 
To: Accumulo Dev List <[email protected]> 
Cc: Accumulo User List <[email protected]> 
Subject: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library 

Dear Accumulo community,
I humbly ask your consideration of Graphulo as a new contrib project to 
Accumulo.  Let's use this thread to discuss what Graphulo is, how it fits into 
the Accumulo community, where we can take it together as a new community, and 
how you can use it right now.  Please see the README at Graphulo's Github, and 
for a more in-depth look see the docs/ folder or the examples.
https://github.com/Accla/graphulo
Graphulo is a Java library for the Apache Accumulo database delivering 
server-side sparse matrix math primitives that enable higher-level graph 
algorithms and analytics.
Pitch: Organizations use Accumulo for high performance indexed and distributed 
data storage.  What do they do after their data is stored?  Many use cases 
perform analytics and algorithms on data in Accumulo, which aside from simple 
iterators uses, require scanning data out from Accumulo to a computation 
engine, only to write computation results back to Accumulo.  Graphulo enables a 
class of algorithms to run inside the Accumulo server like a stored procedure, 
especially (but not restricted to) those written in the language of graphs and 
linear algebra.  Take breadth first search as a simple use case and PageRank as 
one more complex.  As a stretch goal, imagine analysts and mathematicians 
executing PageRank and other high level algorithms on top of the Graphulo 
library on top of Accumulo at high performance.
I have developed Graphulo at the MIT Lincoln Laboratory with support from the 
NSF since last March.  I owe thanks to Jeremy Kepner, Vijay Gadepally, and Adam 
Fuchs for high level comments during design and performance testing phases.  I 
proposed a now-obsolete design document last Spring to the Accumulo community 
too which received good feedback.
The time is ripe for Graphulo to graduate my personal development into larger 
participation.  Beyond myself and beyond the Lincoln Laboratory, Graphulo is 
for the Accumulo community.  Users need a place where they can interact, 
developers need a place where they can look, comment, and debate designs and 
diffs, and both users and developers need a place where they can interact and 
see Graphulo alongside its Accumulo base.  
The following outlines a few reasons why I see contrib to Accumulo as 
Graphulo's proper place:Establishing Graphulo as an Apache (sub-)project is a 
first step toward building a community.  The spirit of Apache--its mailing list 
discussions, low barrier to interactions between users and developers new and 
old, open meritocracy and more--is a surefire way to bring Graphulo to the 
people it will help and the people who want to help it in turn.

Committing to core Accumulo doesn't seem appropriate for all of Graphulo, 
because Graphulo uses Accumulo in a specific way (server-side computation) in 
support of algorithms and applications.  Parts of Graphulo that are useful for 
all Accumulo users (not just matrix math for algorithms) could be transferred 
from Graphulo to Accumulo, such as ApplyIterator or SmallLargeRowFilter or 
DynamicIterator.  

Leaving Graphulo as an external project leaves Graphulo too decoupled from 
Accumulo.  Graphulo has potential to drive features in core Accumulo such as 
ACCUMULO-3978, ACCUMULO-3710, and ACCUMULO-3751.  By making Graphulo a contrib 
sub-project, the two can maintain a tight relationship while still maintaining 
independent versions.  
Historically, contrib projects have gone into Accumulo contrib and become 
stale.  I assure you I do not intend Graphulo this fate.  The Lincoln 
Laboratory has interests in Graphulo, and I will continue developing Graphulo 
at the very least to help transition Graphulo to greater community involvement. 
 However, since I will start a PhD program at UW next month, I cannot make 
Graphulo a full time job as I have in recent history.  I do have ideas for 
using Graphulo as part of my PhD database research.  
Thus, in the case of large community support, I can transition to a support 
role while others in the community step up.  If smaller community support, I 
can continue working on Graphulo as before at my own pace and perhaps more 
publicly.  There are only a few steps left before Graphulo could be considered 
"finished software": Developing a new interface to Graphulo's core functions 
using immutable argument objects, which simplifies developer APIs, increases 
generalizability, and facilitates features like asynchronous and parallel 
operations.  It would be good if other developers weigh their opinions on 
designs as we propose them, since this decides how Graphulo users interact with 
Graphulo.

Instrumenting Graphulo for monitoring, profiling and benchmarking.  I have a 
blueprint on how to use HTrace to make these tasks as easy as browsing a web 
page.  Needs careful thought and discussion before implementing, since this 
instrumentation will go everywhere.  It would be nice if Graphulo and Accumulo 
mirror instrumentation strategies, so it would be good to have that discussion 
in the same venue.

Rigorous scale testing.  Good instrumentation is key.  With successful scale 
testing, we paint a clear picture for which operations Graphulo excels to 
potential adopters, ultimately plotting where Graphulo stands in the world of 
big data software.

Explicitly supporting the GraphBLAS spec, once it is agreed upon.  Graphulo was 
designed from the ground up with the GraphBLAS in mind, so this should be an 
easy task.  Aligning with this upcoming industry standard bodes well for ease 
of developing Graphulo algorithms.Developing more algorithms and applications 
will follow too, and I imagine this as an excellent place where newcomer users 
can get involved.  
Some other places Graphulo needs work worth mentioning are creating a proper 
release framework (the release here could use improvement, starting with signed 
artifacts) and reviewing the way Graphulo runs tests (currently centered around 
a critical file called TEST_CONFIG.java which is great for one developer, 
whereas a config file probably works better).  Both of these are places more 
experienced developers could help.  I should also mention that Graphulo has 
groundwork in place for operations between Accumulo instances, but I doubt many 
users would need that level of control.
Regarding IP, I'm happy to donate my commits to the ASF, which covers 99% of 
the Graphulo code base.  I'm sure other issues will arise and we can sort them 
out.  Sean Busbey, perhaps I could ask your assistance as someone more 
knowledgeable in this area.  Regarding dependencies, effectively every direct 
dependency is org.apache, so nothing to worry about here.
I acknowledge that I will lose dictatorial power and gain some bureaucratic / 
discussion overhead by moving from sole developer to an Apache model.  The 
benefits of a community are well worth it.
If we as a community decide that contrib is the right place for Graphulo, then 
there are lots of logistical questions to decide like where the code will live, 
where JIRA will live, what mailing lists to use, what URL to give Graphulo 
within apache.org, etc.  We can tackle these at our leisure.  Let's discuss 
Graphulo and Accumulo here first.
Warmly, Dylan Hutchison

RE: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library

Reply via email to