Re: [Neo4j] traversing densely populated nodes

Niels Hoogeveen Fri, 01 Jul 2011 12:42:26 -0700

It seems I was looking in the wrong place. There actually is a Btree 
implementation that works on the basis of nodes instead of properties. I will 
test this implementation to see if it has the performance characteristics we 
are looking for.
Niels


> From: [email protected]
> To: [email protected]
> Date: Fri, 1 Jul 2011 15:24:03 +0200
> Subject: Re: [Neo4j] traversing densely populated nodes
> 
> 
> I delved a bit deeper into the BTree implementation in graph-collections and 
> learned that it is essentially a tree based version of a property container. 
> It is not possible to add nodes to the BTree, only primitive values. The 
> graph based Timeline uses the BTree to store some entries, but does so by 
> adding the ID of a node.
> The current implementation has a method addEntry(long key, Object value), 
> where value can be any of the types that can be stored in a property 
> container, while the key always has to be a long. 
> 
> This makes it impossible to use the current implementation of BTree for the 
> purpose of storing dense relationships, if we want to be able to simply 
> traverse from the tree root to the entries and vice versa.
> I will roll up my sleeves and refactor the current BTree implementation so it 
> can be used to store relationships to nodes. In phase 2 we can then find a 
> way to use keys other than long, but that is not required for the purpose of 
> storing dense relationships. Using the node ID of the node to store as the 
> key for the dense relationship BTree is almost ideal, since it allows us to 
> ask the question if there is a dense relationship for a given node. The only 
> issue here is that in practice node ID's will be monotonically increasing, 
> leading to many balancing acts while inserting in the BTree. I suggest we use 
> the reverse of the node ID, which still allows for easy lookups, but gives a 
> much more homogeneous filling pattern, leading to better insert times.
> Niels
> Niels
> 
> 
> > From: [email protected]
> > To: [email protected]
> > Date: Fri, 1 Jul 2011 02:06:24 +0200
> > Subject: Re: [Neo4j] traversing densely populated nodes
> > 
> > 
> > So far we have only considered the situation where one side of the 
> > relationship leads to a densely populated node, while the other side of the 
> > relationship leads to a non-densely populated node. This is likely to be 
> > the most prevalent situation (eg. category or class nodes), but it is 
> > conceivable that both sides of the relationship lead to densely populated 
> > nodes (eg. social networks with hugely popular users).
> > A single Btree per relationship type per direction may not suffice in such 
> > situation, but instead a double Btree should be used.
> > In the dense to non-dense scenario, the dense side will have a relationship 
> > to the root of the Btree for a particular combination of relationship type 
> > and direction, and the leave nodes of the Btree will have relations to the 
> > non-dense nodes.
> > A traversal listing all entries in the tree will look 
> > like:node-(TREE_ROOT)->node-(SUB_TREE)*->node-(KEY_ENTRY)->node
> > A traversal from an entry to to the related node is the 
> > reverse:node<-(KEY_ENTRY)-node<-(SUB_TREE)*-node<-(TREE_ROOT)-node
> > In the dense to dense scenario, both sides will have a relationship to the 
> > root of a Btree for a particular combination of relationship type and 
> > direction, and the leave nodes of both Btrees will have relations to one 
> > another.
> > The traversal in a dense to dense scenario 
> > is:node-(TREE_ROOT)->node-(SUB_TREE)*->node-(KEY_ENTRY)->node<-(KEY_ENTRY)-node<-(SUB_TREE)*-node<-(TREE_ROOT)-node
> > Note: Relationships that need to be traversed transitively are marked with 
> > a *
> > Niels
> > 
> > > From: [email protected]
> > > To: [email protected]
> > > Date: Fri, 1 Jul 2011 00:36:32 +0200
> > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > 
> > > 
> > > The relationship expander approach could certainly work for me. I already 
> > > have a meta layer containing information about relationship types, so it 
> > > would be a minor upgrade to store the type of relationship expander per 
> > > relationship type and to access relationships through the expander 
> > > instead of the direct API. 
> > > What we need now is a solid balanced tree to partition those dense 
> > > relationships. The Btree implementation in graph-collections is the 
> > > probably our best option right now, though as I said earlier, it needs to 
> > > attention to make it production ready. We need to make sure the 
> > > implementation is thread safe and there are some other issues we need to 
> > > look into. For example, right now the order of the Btree is fixed at 9. 
> > > We could lift this value into the constructor of the Btree, so we can 
> > > create Btrees of various different orders and then do performance tests, 
> > > looking for the optimal order. The larger the order, the fewer the hops, 
> > > but too large an order may in fact create dense nodes, something we try 
> > > to avoid in the first place. 
> > > There will certainly be more issues we need to look into, but the 
> > > combination of the use of relationship expanders and a balanced tree 
> > > index looks very promising to me.
> > > As to the value for the Btree hash, how about trying the inverse of the 
> > > node id. Using the node id itself or a time stamp is actually 
> > > disadvantageous, because those values are in practice monotone (except 
> > > for the reuse of node ids) leading to lots of balancing acts. The inverse 
> > > of the node id is equally unique, but has a nicer distribution, making 
> > > inserts into the Btree probably a bit faster.
> > > Niels
> > > 
> > > 
> > > > Date: Thu, 30 Jun 2011 23:50:24 +0200
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > 
> > > > In the amanzi-index I link all indexed nodes into the index tree, so
> > > > traversals are straight up the tree. Of course this also means that 
> > > > there
> > > > are at least as many relationships as indexed nodes.
> > > > 
> > > > I was reviewing Michaels code for the relationship expander, and think 
> > > > that
> > > > is a great idea, tranparently using an index instead of the normal
> > > > relationships API, and can imagine using the relationship expander to
> > > > instead traverse the BTree to the final relationship to the leaf nodes.
> > > > 
> > > > So if we imagine a BTree with perhaps 10 or 20 hops from the root to the
> > > > leaf node, the relationship expander Michael described would complete 
> > > > all
> > > > hops and return only the last relationship, giving the illusion of 
> > > > direct
> > > > connections from root to leaf. This would certainly perform well, 
> > > > especially
> > > > for cases where there are factors limiting the number of relationships 
> > > > we
> > > > want returned. I think the request for type and direction is the first
> > > > obvious case, but we could be even more explicit than that, if we pass
> > > > constraints based on the BTree's consistent hash.
> > > > 
> > > > On Thu, Jun 30, 2011 at 11:36 PM, Niels Hoogeveen 
> > > > <[email protected]
> > > > > wrote:
> > > > 
> > > > >
> > > > > In theory the approach I described earlier could work, though there 
> > > > > are
> > > > > some pitfalls to the current implementation that need ironing out 
> > > > > before
> > > > > this can become a recommended approach.
> > > > > The choice of Timeline instead of Btree may actually be the wrong 
> > > > > choice
> > > > > after all. I chose Timeline because of my familiarity with this 
> > > > > particular
> > > > > class, but its implementation may actually not be all that suitable 
> > > > > for this
> > > > > particular use case. This has to do with the fact that Timeline is 
> > > > > not just
> > > > >  a tree, but a list where entries with an interval of max. 1000 are 
> > > > > stored
> > > > > in a Btree index. This works reasonably well for a Timeline, but 
> > > > > makes the
> > > > > approach less ideal for storing dense relationships.
> > > > > The problem with the Timeline implementation is the ability to lookup 
> > > > > the
> > > > > tree root from a particular leave. In an ordinary Btree is would 
> > > > > simply be a
> > > > > traversal from the leave through the layers of block nodes to the 
> > > > > tree root.
> > > > > In Timeline the traversal will be different. It first has to move 
> > > > > through
> > > > > the Timeline list until it finds an entry that is stored in the Btree 
> > > > > (which
> > > > > worst case takes 1000 hops), and then it has to traverse the Btree up 
> > > > > to the
> > > > > tree root. To avoid this complicated traversal I ended up doing a 
> > > > > lookup
> > > > > through Lucene of the timeline URI (which is stored in all timeline 
> > > > > list
> > > > > entries). In fact I might as well have added the URI of the dense 
> > > > > node as a
> > > > > property and do the lookup through Lucene without the Timeline, it 
> > > > > just
> > > > > happens that I like the sort order of Timeline, making it a useful 
> > > > > approach
> > > > > anyway.
> > > > > I will experiment using Btree directly (without Timeline) and see if 
> > > > > that
> > > > > leads to a simpler and faster traversal from leave to root node.
> > > > > There is one more issue before this can become production ready. 
> > > > > Btree as
> > > > > it is implemented now is not thread safe (per the implementations 
> > > > > Javadocs),
> > > > > so it need some love and attention to make it work properly.
> > > > > Niels
> > > > >
> > > > > > Date: Thu, 30 Jun 2011 13:57:20 +0200
> > > > > > From: [email protected]
> > > > > > To: [email protected]
> > > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > > >
> > > > > > This topics has come up before, and the domain level solutions are
> > > > > usually
> > > > > > very similar, like Norbert's category/proxy nodes (to group by
> > > > > > type/direction) and Niels' TimeLineIndex (BTree). I wonder whether 
> > > > > > we can
> > > > > > build a generic user-level solution that can also be wrapped to 
> > > > > > appear as
> > > > > an
> > > > > > internal database solution?
> > > > > >
> > > > > > For example, consider Niels's solution of the TimeLine index. In 
> > > > > > this
> > > > > case
> > > > > > we group all the nodes based on a consistent hash. Usually the 
> > > > > > timeline
> > > > > > would use a timestamp, but really any reasonably variable property 
> > > > > > can
> > > > > do,
> > > > > > even the node-id itself. Then we have a BTree between the dense 
> > > > > > nodes and
> > > > > > the root node (node with too many relationships). How about this 
> > > > > > crazy
> > > > > idea,
> > > > > > create an API that mimics the normal node.getRelationship*() API, 
> > > > > > but
> > > > > > internally traverses the entire tree? And also for creating the
> > > > > > relationships? So for most cod we just do the usual
> > > > > > node.createRelationshipTo(node,type,direction) and 
> > > > > > node.traverse(...),
> > > > > but
> > > > > > internally we actually traverse the b-tree.
> > > > > >
> > > > > > This would solve the performance bottleneck being observed while 
> > > > > > keeping
> > > > > the
> > > > > > 'illusion' of directly connected relationships. The solution would 
> > > > > > be
> > > > > > implemented mostly in the application space, so will not need any 
> > > > > > changes
> > > > > to
> > > > > > the core database. I see this as being of the same kind of solution 
> > > > > > as
> > > > > the
> > > > > > auto-indexing. We setup some initial configuration that results in
> > > > > certain
> > > > > > structures being created on demand. With auto-indexing we are 
> > > > > > talking
> > > > > about
> > > > > > mostly automatically adding lucene indexes. With this idea we are 
> > > > > > talking
> > > > > > about automatically replacing direct relationships with b-trees to
> > > > > resolve a
> > > > > > specific performance issue.
> > > > > >
> > > > > > And when the relationship density is very low, if the b-tree is
> > > > > > auto-balancing, it could just be a direct relationship anyway.
> > > > > >
> > > > > > On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas
> > > > > > <[email protected]>wrote:
> > > > > >
> > > > > > > My problem pattern is exactly the same as Niels's :
> > > > > > >
> > > > > > > A dense-node has millions of relations of a certain direction & 
> > > > > > > type,
> > > > > > > and only a few (sparse) relations of a different direction and 
> > > > > > > type.
> > > > > > > The traversing is usually following only those sparse 
> > > > > > > relationships on
> > > > > > > those
> > > > > > > dense-nodes.
> > > > > > >
> > > > > > > Now, even when traversing on these sparse relations, neo4j becomes
> > > > > > > extremely
> > > > > > > slow
> > > > > > > on a certainly non linear Order (the big cs O).
> > > > > > >
> > > > > > > Some tests I run (email me if u want the code) reveal that even 
> > > > > > > the
> > > > > number
> > > > > > > of those dense-nodes in the database greatly influences the 
> > > > > > > results.
> > > > > > >
> > > > > > > I just reported to Michael the runs with the latest M05 snapshot, 
> > > > > > > which
> > > > > are
> > > > > > > not very positive...
> > > > > > > I have suggested an (auto) indexing of relationship types / 
> > > > > > > direction
> > > > > that
> > > > > > > is used by traversing frameworks,
> > > > > > > but I ain't no graphdb-engine expert :-(
> > > > > > >
> > > > > > > A'
> > > > > > >
> > > > > > >
> > > > > > > Message: 5
> > > > > > > > Date: Wed, 29 Jun 2011 18:19:10 +0200
> > > > > > > > From: Niels Hoogeveen <[email protected]>
> > > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > > > > > To: <[email protected]>
> > > > > > > > Message-ID: <[email protected]>
> > > > > > > > Content-Type: text/plain; charset="iso-8859-1"
> > > > > > > >
> > > > > > > >
> > > > > > > > Michael,
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > The issue I am refering to does not pertain to traversing many
> > > > > relations
> > > > > > > at
> > > > > > > > once
> > > > > > > >
> > > > > > > > but the impact many relationship of one type have on 
> > > > > > > > relationships
> > > > > > > >
> > > > > > > > of another type on the same node.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Example:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > A topic class has 2 million outgoing relationships of type
> > > > > "HAS_INSTANCE"
> > > > > > > > and
> > > > > > > >
> > > > > > > > has 3 outgoing relationships of type "SUB_CLASS_OF".
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Fetching the 3 relations of type "SUB_CLASS_OF" takes very long,
> > > > > > > >
> > > > > > > > I presume due to the presence of the 2 million other 
> > > > > > > > relationships.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I have no need to ever fetch the "HAS_INSTANCE" relationships 
> > > > > > > > from
> > > > > > > >
> > > > > > > > the topic node. That relation is always traversed from the other
> > > > > > > direction.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I do want to know the class of a topic instance, leading to he 
> > > > > > > > topic
> > > > > > > class,
> > > > > > > >
> > > > > > > > but have no real interest ever to traverse all topic instance 
> > > > > > > > from
> > > > >  the
> > > > > > > > topic
> > > > > > > >
> > > > > > > > class (at least not directly.. i do want to know the most recent
> > > > > > > addition,
> > > > > > > >
> > > > > > > > and that's what I use the timeline index for).
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Niels
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: [email protected]
> > > > > > > > > Date: Wed, 29 Jun 2011 17:50:08 +0200
> > > > > > > > > To: [email protected]
> > > > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > > > > > >
> > > > > > > > > I think this is the same problem that Angelos is facing, we 
> > > > > > > > > are
> > > > > > > currently
> > > > > > > > evaluating options to improve the performance on those highly
> > > > > connected
> > > > > > > > supernodes.
> > > > > > > > >
> > > > > > > > > A traditional option is really to split them into group or 
> > > > > > > > > even
> > > > > kind of
> > > > > > > > shard their relationships to a second layer.
> > > > > > > > >
> > > > > > > > > We're looking into storage improvement options as well as
> > > > > modifications
> > > > > > > > to retrieval of that many relationships at once.
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > Michael
> > > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Neo4j mailing list
> > > > > > > [email protected]
> > > > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > > > >
> > > > > > _______________________________________________
> > > > > > Neo4j mailing list
> > > > > > [email protected]
> > > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > >
> > > > > _______________________________________________
> > > > > Neo4j mailing list
> > > > > [email protected]
> > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > >
> > > > _______________________________________________
> > > > Neo4j mailing list
> > > > [email protected]
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > >                                     
> > > _______________________________________________
> > > Neo4j mailing list
> > > [email protected]
> > > https://lists.neo4j.org/mailman/listinfo/user
> >                                       
> > _______________________________________________
> > Neo4j mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
>                                         
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
                                          
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] traversing densely populated nodes

Reply via email to