[Neo4j] Store refactoring suggestions

Niels Hoogeveen Sun, 07 Aug 2011 14:33:36 -0700

Over the last couple of weeks, I have been working on an Enhanced API for 
Neo4j, and would like to make some suggestions on how some changes to the store 
layout can improve Neo4j and make the implementation of the Enhanced API 
simpler and more transparent.


Let me start with a definition of a relationship. 

A relationship is a set of roles, where each role is a set of nodes.

This is a much more general definition of a relationship than the one used in 
the current incarnation of Neo4j, but the Neo4j Relationship fits within this 
definition.

The Neo4j Relationship has two distinct roles: StartNode and EndNode, which 
each has a NodeSet of size 1.

Roles can come in four variations:

unrestricted: a role can contain an unlimited number of nodes, and each node 
can be connected to an unlimited number of relationships with that particular 
role
injective (functional): a role can contain only one node, but each node can be 
connected to an unlimited number of relationships with that particular role.
surjective (inverse functional): a role can contain an unlimited number of 
nodes, but each node can only be connected to one relationship with that 
particular role
bijective (one on one): a role can contain only one node and each node can only 
be connected to one relationship with that particular role.

The StartNode and EndNode roles, as they are implicitly available for the Neo4j 
Relationship are both injective (functional) roles.

Even in the binary case, where a relationship has exactly two sides, this 
limits the possibilities.

Suppose we want to store the fact that company A fired 500 different people on 
a particular day.

Using Neo4j Relationships we would have to create 500 relationships from 
company A to each individual person, with the fire-date as a property.

This is of course highly redundant, having the same date in the database 500 
times.

It would be much better if we could have one relationship going from company A 
with the fire-date as a property, connecting to all 500 individual persons.

Another limitation: suppose we want to register the birthplace of a person, 
considering the constraint that every person can only have one place of birth.

Since StartNode and EndNode  are functional roles, we cannot guarantee that 
only one relationship is created from a person to a place with the relationship 
BIRTHPLACE. We would need a bijective (one on one) role instead of StartNode. 
EndNode is suitable for the other role in this case, because a place can be the 
birth place of many people.

As a result of this limitation, the method Node#getSingleRelationship can throw 
an exception when more than one relationship is found for particular node with 
a certain RelationshipType and Direction. This could be avoided by making sure 
that getSingleRelationship can only be called by a RelationshipType of a 
certain kind.

Graphically we could interpret the Neo4j Relationship as follows:

Node(1)--(N)Relationship(N)--(1)Node

while the following 15 binary case are not support:

Node(N)--(N)Relationship(N)--(N)Node
Node(N)--(N)Relationship(N)--(1)Node
Node(N)--(N)Relationship(1)--(N)Node
Node(N)--(N)Relationship(1)--(1)Node
Node(N)--(1)Relationship(N)--(N)Node
Node(N)--(1)Relationship(N)--(1)Node
Node(N)--(1)Relationship(1)--(N)Node
Node(N)--(1)Relationship(1)--(1)Node
Node(1)--(N)Relationship(N)--(N)Node
Node(1)--(N)Relationship(1)--(N)Node
Node(1)--(N)Relationship(1)--(1)Node
Node(1)--(1)Relationship(N)--(N)Node
Node(1)--(1)Relationship(N)--(1)Node
Node(1)--(1)Relationship(1)--(N)Node
Node(1)--(1)Relationship(1)--(1)Node

With some modifications, the Neo4j database could support all these cases, and 
in fact could support relationships of higher arity at the same time. To 
achieve that, only two levels of indirection need to be added to the Neo4j 
store. One of these levels of indirection would actually improve the 
performance of the database.

Let's look at the current implementation of the Neo4j store, and in particular 
to two files: the node store and the relationship store.

The node store contains records containing information pertaining to the node. 
The record contains an entry point to a linked list of Relationships (all 
relationships associated with this Node). 

Suggestion 1:

Replace the entry point of the Relationships linked list with an entry point to 
a linked list of NodeRelationshipTypeRole records.

The NodeRelationshipTypeRole record has the following layout:

Id
RelationshipTypeId
RelationshipRoleId
NextRelationship (linked list of relationships associated with a node that have 
the same RelationshipType and RelationshipRole)
NextNodeRelationshipTypeRole (needed to iterate over the list)
PreviousRelationshipTypeRole (needed when a NodeRelationshipTypeRole is removed)

To make this possible, three additional steps need to be take: 

Create a RelationshipRole store with the following record layout:

id
RelationshipRoleName
RelationshipRoleType (indicating if the role is unrestricted, injective, 
surjective or bijective)

Create a RelationshipTypeRole store with the following record layout:

id
RelationshipRoleId
NextRelationshipTypeRoleId
PreviousRelationshipTypeRoleId

Add an entry point to the first RelationshipTypeRole in the RelationshipType 
record.

Due to the partitioning of Relationships per RelationshipType per 
RelationshipRole, we can easily find out if there is a Relationship with that 
RelationshipType and RelationshipRole associated with the Node, so it becomes 
feasible to support bijective RelationshipRoles as well. 

Graphically, the following binary relationships can be supported using this 
change:

Node(1)--(N)Relationship(N)--(1)Node
Node(1)--(N)Relationship(1)--(1)Node
Node(1)--(1)Relationship(N)--(1)Node
Node(1)--(1)Relationship(1)--(1)Node

This suggestion not only support one more RelationshipRole, it also improves 
performance of the database. Right now when a relationship of a certain type 
and direction is sought, all relationships associated with a Node are fetched, 
and only those matching are returned. Partitioning Relationships per 
RelationshipType per RelationshipRole makes it possible to quickly find the 
linked list of requested Relationships. 

This suggestion solves the densely-connected-node problem, addressed in various 
other posts.

Suggestion 2:

Right now the Relationship record contains four fields to support two linked 
lists of relationships (one for each connected node), plus two fields 
containing the node id of the connected node.

Replace the support for the two linked lists to a separate store 
NodeRelationshipTypeRoleRelationship, with the following record layout:

id
RelationshipId
NextNodeRelationshipTypeRoleRelationshipId
PreviousNodeRelationshipTypeRoleRelationshipId

Each NodeRelationshipTypeRole as described under suggestion 1, no longer points 
to the first Relationship in the list, but to a 
NodeRelationshipTypeRoleRelationship, except in the case where the Role is 
surjective or bijective, in that case the RelationshipId can be stored.

Further, replace the two node id fields in Relationship with an entry point to 
a list of RelationshipRole records, which have the following record layout:

id
NextRelationshipRoleId
PreviousRelationshipRoleId
NextRelationshipRoleNodeId

The NextRelationshipRoleNodeId field points to a linked list of 
RelationshipRoleNode records, with the following record layout:

id
NodeId
NextRelationshipRoleNodeId
PreviousRelationshipRoleNodeId

If the Role is functional or bijective, we don't have to store an entry to the 
RelationshipRoleNode record, but can instead just store the node id, because 
there can only be one node for that Role.

With this change a Relationship can have an unlimited number of Roles attached 
to it and each Role can have an unlimited number of Nodes attached to it.

If we want to fetch the Relationships given a certain Node, RelationshipType 
and Role the number of operations remains almost the same as under scenario 1, 
but instead of fetching the Relationship record with each iteration, we now 
fetch the RelationshipId with each iteration. Fetching the Relationship itself 
would require one more read operation.

Fetching the nodes of a relationship becomes a bit more complicated. Once we 
have the relationship record in the current situation, we actually have the 
NodeIds of the associated Nodes for free. 

With this suggestions we have to do a little more work. We need to iterate over 
the RelationshipRole records (for the binary case this is at most two reads) 
and then iterate over the RelationshipRoleNodes. For the common binary cases, 
there will not be a RelationshipRoleNodes record, but simply a NodeId stored in 
the RelationshipRole record, reducing the number of reads to fetch the nodeIds 
to at most two.

With this suggestion, the other twelve scenarios we can have with binary 
relations can be maintained, and we can even maintain any type of N-ary 
relationship.

Suggestion 1 and 2 combined still give an average increase of performance (due 
to suggestion 1, though suggestion 2 adds very little additional overhead).

Suggestion 3

Fuse the Node store and the Relationship store. With suggestion 1 and 2, the 
difference between a Node and a Relationship becomes minimal. A Node has an 
additional entry point for the NodeRelationshipTypeRole list, something a 
Relationship would benefit from having too. A Relationship has two additional 
fields: RelationshipTypeId and NextRelationshipRoleId. Those two fields can 
remain empty in the Node record if it relates to a non-Relationship node.

Fusing the two stores makes Relationships to Relationships possible without 
having to create shadow nodes.

With these suggestions implemented in core, much of the Enhanced API could be 
moved to the standard API.

The impact on performance of the database is positive in the general case, 
slightly worse in some rare corner cases, and a huge improvement in some fairly 
common cases where a node has many relationships attached to it.

I will make further suggestions for dealing with properties in a separate post. 
This is already enough to chew on for now.

Niels                                     
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

[Neo4j] Store refactoring suggestions

Reply via email to