While I am at it, let's post another brain dump.
A couple of weeks ago, I worked on SortedTree/IndexRelationships in an attempt
to solve the densely-connected-node-problem.
SortedTree is a Btree layed-out in the graph, sorted by some function on a node
(eg. the nodeId, or a property value).
This approach worked, to a degree, but at some point, load times decrease
because of reorganizations of the tree. Too much memory is needed to keep the
entire tree in memory and standard nodes and relationships are simply too fine
grained for the job. Instead of loading each individual node and each
individual relationships in an index block, it would be nice to be able to load
the entire block with one read operation, and swap out an entire memory block
when memory is needed.
This brought me to the idea of sub-graphs. Let's say every node (and possibly
relationship) is a graph, containing nodes and relationships. Each graph has
its own store (if the contained graph is not empty). Relationships are
lightweight (offset based) when associated with Nodes (and possibly
Relationships) within the same graph, but require an extra store_id when
associating with nodes (and possibly Relationship) outside that graph. This
gives control over where things are stored, and what is stored together.
Using RelationshipRoles, as I described in another post we can state which
association of a Relationship is certainly stored local, and what is certainly
stored in another another Graph and what is either stored local or in another
Graph. This way we can have full control over the locality of the associations
of a Relationship.
If we make each index block of SortedTree its own graph we can make sure that
all Relationship associations are local, except the ones eventually pointing to
the Nodes we want indexed, those are certainly stored in another Graph. This
way the store will only contain Nodes and Relationships belonging to that index
block, so we can load the entire store in a set of buffers and flush those
buffers when no longer needed.
This approach could also be used for sharding the database. Since each node in
the graph can be a store of its own, we have a natural means to distribute
graphs over different shards.
Lets define a shard as a set of graphs, which membership is decided by some
rules defined on the RelationshipTypes used in the shard.
We could add the following options to the RelationshipRoles: must be shard, may
be in shard and must not be in shard.
This way the RelationshipRoles used in a store determine the dependencies of
that store.
RelationshipRoles can form 9 possible combinations of settings over the
locality of each Relationship association, one of which is mutually exclusive
and some are tautological or inconsequential:
Must be in store and Must be in shard (is tautological).
Must be in store and May be in shard (is inconsequential).
Must be in store and Must not be in shard (is impossible)
May be in store and Must be in shard
May be in store and May be in shard (is inconsequential)
May be in store and Must not be in shard (is inconsequential)
Must not be in store and Must be in shard
Must not be in store and May be in shard (inconsequential)
Must not be in store and Must not be in shard (is tautological)
So that leaves the following RelationshipRole options:
Must be in store
May be in store and Must be in shard
May be in shard
Must not be in store and Must be in shard
Must not be in store
Must not be in shard
The default RelationshipRoles of a standard binary relationship are StartNode
and EndNode, which both will have as default setting "Must be in store". This
way an implementation of such an approach remains backwards compatible.
When combining RelationshipRoles into a RelationshipType, at least one
RelationshipRole in the set must not have the setting "Must not be in store",
which is implied by "Must not be in shard". Any such combination cannot be
stored, since no store can contain any of the associated Nodes.
When creating a Relationship, a store adds that Relationship when at least one
associated Node is actually present in the database.
When adding NodeTypes to the mix, the distribution of Nodes and Relationships
over the various stores can even be further controlled. If we would know for
each created Node if it must have an associated RelationshipRole, may have an
associated RelationshipRole, or must not have an associated RelationshipRole,
it becomes possible to decide if a Node Must be added to a store, may be added
to a store, or must not be added to a store. For the may be added to a store
cases, a Coordinator can decide where to store those particular Nodes.
Finally, this approach allows for distributed traversals. Traversals are always
local, when a traversal branch hits upon a relationship association that is
external to the store, that traversal will asynchronously be continued on that
other store. When the traversal ends its local search, it pulls the results
from the asynchronous traversals from a queue, until all traversals have ended.
With the proper settings of RelationshipRoles we can neatly control the flow of
a Traveral to obtain parallelism where that is desired.
--End of brain dump
Niels
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user