[Neo4j] sub-graphs

Niels Hoogeveen Sun, 07 Aug 2011 20:37:42 -0700

While I am at it, let's post another brain dump.

A couple of weeks ago, I worked on SortedTree/IndexRelationships in an attempt 
to solve the densely-connected-node-problem.


SortedTree is a Btree layed-out in the graph, sorted by some function on a node 
(eg. the nodeId, or a property value).

This approach worked, to a degree, but at some point, load times decrease 
because of reorganizations of the tree. Too much memory is needed to keep the 
entire tree in memory and standard nodes and relationships are simply too fine 
grained for the job. Instead of loading each individual node and each 
individual relationships in an index block, it would be nice to be able to load 
the entire block with one read operation, and swap out an entire memory block 
when memory is needed. 

This brought me to the idea of sub-graphs. Let's say every node (and possibly 
relationship) is a graph, containing nodes and relationships. Each graph has 
its own store (if the contained graph is not empty). Relationships are 
lightweight (offset based) when associated with Nodes (and possibly 
Relationships) within the same graph, but require an extra store_id when 
associating with nodes (and possibly Relationship) outside that graph. This 
gives control over where things are stored, and what is stored together. 

Using RelationshipRoles, as I described in another post we can state which 
association of a Relationship is certainly stored local, and what is certainly 
stored in another another Graph and what is either stored local or in another 
Graph. This way we can have full control over the locality of the associations 
of a Relationship.

If we make each index block of SortedTree its own graph we can make sure that 
all Relationship associations are local, except the ones eventually pointing to 
the Nodes we want indexed, those are certainly stored in another Graph. This 
way the store will only contain Nodes and Relationships belonging to that index 
block, so we can load the entire store in a set of buffers and flush those 
buffers when no longer needed. 

This approach could also be used for sharding the database. Since each node in 
the graph can be a store of its own, we have a natural means to distribute 
graphs over different shards.

Lets define a shard as a set of graphs, which membership is decided by some 
rules defined on the RelationshipTypes used in the shard.

We could add the following options to the RelationshipRoles: must be shard, may 
be in shard and must not be in shard. 

This way the RelationshipRoles used in a store determine the dependencies of 
that store.

RelationshipRoles can form 9 possible combinations of settings over the 
locality of each Relationship association, one of which is mutually exclusive 
and some are tautological or inconsequential:

Must be in store and Must be in shard (is tautological).
Must be in store and May be in shard (is inconsequential).
Must be in store and Must not be in shard (is impossible)
May be in store and Must be in shard
May be in store and May be in shard (is inconsequential)
May be in store and Must not be in shard (is inconsequential)
Must not be in store and Must be in shard
Must not be in store and May be in shard (inconsequential)
Must not be in store and Must not be in shard (is tautological)

So that leaves the following RelationshipRole options:
Must be in store 
May be in store and Must be in shard
May be in shard 
Must not be in store and Must be in shard
Must not be in store 
Must not be in shard 

The default RelationshipRoles of a standard binary relationship are StartNode 
and EndNode, which both will have as default setting "Must be in store". This 
way an implementation of such an approach remains backwards compatible. 

When combining RelationshipRoles into a RelationshipType, at least one 
RelationshipRole in the set must not have the setting "Must not be in store", 
which is implied by "Must not be in shard". Any such combination cannot be 
stored, since no store can contain any of the associated Nodes.

When creating a Relationship, a store adds that Relationship when at least one 
associated Node is actually present in the database.

When adding NodeTypes to the mix, the distribution of Nodes and Relationships 
over the various stores can even be further controlled. If we would know for 
each created Node if it must have an associated RelationshipRole, may have an 
associated RelationshipRole, or must not have an associated RelationshipRole, 
it becomes possible to decide if a Node Must be added to a store, may be added 
to a store, or must not be added to a store. For the may be added to a store 
cases, a Coordinator can decide where to store those particular Nodes.

Finally, this approach allows for distributed traversals. Traversals are always 
local, when a traversal branch hits upon a relationship association that is 
external to the store, that traversal will asynchronously be continued on that 
other store. When the traversal ends its local search, it pulls the results 
from the asynchronous traversals from a queue, until all traversals have ended. 

With the proper settings of RelationshipRoles we can neatly control the flow of 
a Traveral to obtain parallelism where that is desired. 

--End of brain dump

Niels



                                          
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

[Neo4j] sub-graphs

Reply via email to