Over the last couple of weeks, I have been working on an Enhanced API for Neo4j, and would like to make some suggestions on how some changes to the store layout can improve Neo4j and make the implementation of the Enhanced API simpler and more transparent.
Let me start with a definition of a relationship. A relationship is a set of roles, where each role is a set of nodes. This is a much more general definition of a relationship than the one used in the current incarnation of Neo4j, but the Neo4j Relationship fits within this definition. The Neo4j Relationship has two distinct roles: StartNode and EndNode, which each has a NodeSet of size 1. Roles can come in four variations: unrestricted: a role can contain an unlimited number of nodes, and each node can be connected to an unlimited number of relationships with that particular role injective (functional): a role can contain only one node, but each node can be connected to an unlimited number of relationships with that particular role. surjective (inverse functional): a role can contain an unlimited number of nodes, but each node can only be connected to one relationship with that particular role bijective (one on one): a role can contain only one node and each node can only be connected to one relationship with that particular role. The StartNode and EndNode roles, as they are implicitly available for the Neo4j Relationship are both injective (functional) roles. Even in the binary case, where a relationship has exactly two sides, this limits the possibilities. Suppose we want to store the fact that company A fired 500 different people on a particular day. Using Neo4j Relationships we would have to create 500 relationships from company A to each individual person, with the fire-date as a property. This is of course highly redundant, having the same date in the database 500 times. It would be much better if we could have one relationship going from company A with the fire-date as a property, connecting to all 500 individual persons. Another limitation: suppose we want to register the birthplace of a person, considering the constraint that every person can only have one place of birth. Since StartNode and EndNode are functional roles, we cannot guarantee that only one relationship is created from a person to a place with the relationship BIRTHPLACE. We would need a bijective (one on one) role instead of StartNode. EndNode is suitable for the other role in this case, because a place can be the birth place of many people. As a result of this limitation, the method Node#getSingleRelationship can throw an exception when more than one relationship is found for particular node with a certain RelationshipType and Direction. This could be avoided by making sure that getSingleRelationship can only be called by a RelationshipType of a certain kind. Graphically we could interpret the Neo4j Relationship as follows: Node(1)--(N)Relationship(N)--(1)Node while the following 15 binary case are not support: Node(N)--(N)Relationship(N)--(N)Node Node(N)--(N)Relationship(N)--(1)Node Node(N)--(N)Relationship(1)--(N)Node Node(N)--(N)Relationship(1)--(1)Node Node(N)--(1)Relationship(N)--(N)Node Node(N)--(1)Relationship(N)--(1)Node Node(N)--(1)Relationship(1)--(N)Node Node(N)--(1)Relationship(1)--(1)Node Node(1)--(N)Relationship(N)--(N)Node Node(1)--(N)Relationship(1)--(N)Node Node(1)--(N)Relationship(1)--(1)Node Node(1)--(1)Relationship(N)--(N)Node Node(1)--(1)Relationship(N)--(1)Node Node(1)--(1)Relationship(1)--(N)Node Node(1)--(1)Relationship(1)--(1)Node With some modifications, the Neo4j database could support all these cases, and in fact could support relationships of higher arity at the same time. To achieve that, only two levels of indirection need to be added to the Neo4j store. One of these levels of indirection would actually improve the performance of the database. Let's look at the current implementation of the Neo4j store, and in particular to two files: the node store and the relationship store. The node store contains records containing information pertaining to the node. The record contains an entry point to a linked list of Relationships (all relationships associated with this Node). Suggestion 1: Replace the entry point of the Relationships linked list with an entry point to a linked list of NodeRelationshipTypeRole records. The NodeRelationshipTypeRole record has the following layout: Id RelationshipTypeId RelationshipRoleId NextRelationship (linked list of relationships associated with a node that have the same RelationshipType and RelationshipRole) NextNodeRelationshipTypeRole (needed to iterate over the list) PreviousRelationshipTypeRole (needed when a NodeRelationshipTypeRole is removed) To make this possible, three additional steps need to be take: Create a RelationshipRole store with the following record layout: id RelationshipRoleName RelationshipRoleType (indicating if the role is unrestricted, injective, surjective or bijective) Create a RelationshipTypeRole store with the following record layout: id RelationshipRoleId NextRelationshipTypeRoleId PreviousRelationshipTypeRoleId Add an entry point to the first RelationshipTypeRole in the RelationshipType record. Due to the partitioning of Relationships per RelationshipType per RelationshipRole, we can easily find out if there is a Relationship with that RelationshipType and RelationshipRole associated with the Node, so it becomes feasible to support bijective RelationshipRoles as well. Graphically, the following binary relationships can be supported using this change: Node(1)--(N)Relationship(N)--(1)Node Node(1)--(N)Relationship(1)--(1)Node Node(1)--(1)Relationship(N)--(1)Node Node(1)--(1)Relationship(1)--(1)Node This suggestion not only support one more RelationshipRole, it also improves performance of the database. Right now when a relationship of a certain type and direction is sought, all relationships associated with a Node are fetched, and only those matching are returned. Partitioning Relationships per RelationshipType per RelationshipRole makes it possible to quickly find the linked list of requested Relationships. This suggestion solves the densely-connected-node problem, addressed in various other posts. Suggestion 2: Right now the Relationship record contains four fields to support two linked lists of relationships (one for each connected node), plus two fields containing the node id of the connected node. Replace the support for the two linked lists to a separate store NodeRelationshipTypeRoleRelationship, with the following record layout: id RelationshipId NextNodeRelationshipTypeRoleRelationshipId PreviousNodeRelationshipTypeRoleRelationshipId Each NodeRelationshipTypeRole as described under suggestion 1, no longer points to the first Relationship in the list, but to a NodeRelationshipTypeRoleRelationship, except in the case where the Role is surjective or bijective, in that case the RelationshipId can be stored. Further, replace the two node id fields in Relationship with an entry point to a list of RelationshipRole records, which have the following record layout: id NextRelationshipRoleId PreviousRelationshipRoleId NextRelationshipRoleNodeId The NextRelationshipRoleNodeId field points to a linked list of RelationshipRoleNode records, with the following record layout: id NodeId NextRelationshipRoleNodeId PreviousRelationshipRoleNodeId If the Role is functional or bijective, we don't have to store an entry to the RelationshipRoleNode record, but can instead just store the node id, because there can only be one node for that Role. With this change a Relationship can have an unlimited number of Roles attached to it and each Role can have an unlimited number of Nodes attached to it. If we want to fetch the Relationships given a certain Node, RelationshipType and Role the number of operations remains almost the same as under scenario 1, but instead of fetching the Relationship record with each iteration, we now fetch the RelationshipId with each iteration. Fetching the Relationship itself would require one more read operation. Fetching the nodes of a relationship becomes a bit more complicated. Once we have the relationship record in the current situation, we actually have the NodeIds of the associated Nodes for free. With this suggestions we have to do a little more work. We need to iterate over the RelationshipRole records (for the binary case this is at most two reads) and then iterate over the RelationshipRoleNodes. For the common binary cases, there will not be a RelationshipRoleNodes record, but simply a NodeId stored in the RelationshipRole record, reducing the number of reads to fetch the nodeIds to at most two. With this suggestion, the other twelve scenarios we can have with binary relations can be maintained, and we can even maintain any type of N-ary relationship. Suggestion 1 and 2 combined still give an average increase of performance (due to suggestion 1, though suggestion 2 adds very little additional overhead). Suggestion 3 Fuse the Node store and the Relationship store. With suggestion 1 and 2, the difference between a Node and a Relationship becomes minimal. A Node has an additional entry point for the NodeRelationshipTypeRole list, something a Relationship would benefit from having too. A Relationship has two additional fields: RelationshipTypeId and NextRelationshipRoleId. Those two fields can remain empty in the Node record if it relates to a non-Relationship node. Fusing the two stores makes Relationships to Relationships possible without having to create shadow nodes. With these suggestions implemented in core, much of the Enhanced API could be moved to the standard API. The impact on performance of the database is positive in the general case, slightly worse in some rare corner cases, and a huge improvement in some fairly common cases where a node has many relationships attached to it. I will make further suggestions for dealing with properties in a separate post. This is already enough to chew on for now. Niels _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

