Re: [Neo4j] traversing densely populated nodes

Niels Hoogeveen Fri, 01 Jul 2011 06:24:19 -0700

I delved a bit deeper into the BTree implementation in graph-collections and 
learned that it is essentially a tree based version of a property container. It 
is not possible to add nodes to the BTree, only primitive values. The graph 
based Timeline uses the BTree to store some entries, but does so by adding the 
ID of a node.
The current implementation has a method addEntry(long key, Object value), where 
value can be any of the types that can be stored in a property container, while 
the key always has to be a long.


This makes it impossible to use the current implementation of BTree for the 
purpose of storing dense relationships, if we want to be able to simply 
traverse from the tree root to the entries and vice versa.
I will roll up my sleeves and refactor the current BTree implementation so it 
can be used to store relationships to nodes. In phase 2 we can then find a way 
to use keys other than long, but that is not required for the purpose of 
storing dense relationships. Using the node ID of the node to store as the key 
for the dense relationship BTree is almost ideal, since it allows us to ask the 
question if there is a dense relationship for a given node. The only issue here 
is that in practice node ID's will be monotonically increasing, leading to many 
balancing acts while inserting in the BTree. I suggest we use the reverse of 
the node ID, which still allows for easy lookups, but gives a much more 
homogeneous filling pattern, leading to better insert times.
Niels
Niels


> From: [email protected]
> To: [email protected]
> Date: Fri, 1 Jul 2011 02:06:24 +0200
> Subject: Re: [Neo4j] traversing densely populated nodes
> 
> 
> So far we have only considered the situation where one side of the 
> relationship leads to a densely populated node, while the other side of the 
> relationship leads to a non-densely populated node. This is likely to be the 
> most prevalent situation (eg. category or class nodes), but it is conceivable 
> that both sides of the relationship lead to densely populated nodes (eg. 
> social networks with hugely popular users).
> A single Btree per relationship type per direction may not suffice in such 
> situation, but instead a double Btree should be used.
> In the dense to non-dense scenario, the dense side will have a relationship 
> to the root of the Btree for a particular combination of relationship type 
> and direction, and the leave nodes of the Btree will have relations to the 
> non-dense nodes.
> A traversal listing all entries in the tree will look 
> like:node-(TREE_ROOT)->node-(SUB_TREE)*->node-(KEY_ENTRY)->node
> A traversal from an entry to to the related node is the 
> reverse:node<-(KEY_ENTRY)-node<-(SUB_TREE)*-node<-(TREE_ROOT)-node
> In the dense to dense scenario, both sides will have a relationship to the 
> root of a Btree for a particular combination of relationship type and 
> direction, and the leave nodes of both Btrees will have relations to one 
> another.
> The traversal in a dense to dense scenario 
> is:node-(TREE_ROOT)->node-(SUB_TREE)*->node-(KEY_ENTRY)->node<-(KEY_ENTRY)-node<-(SUB_TREE)*-node<-(TREE_ROOT)-node
> Note: Relationships that need to be traversed transitively are marked with a *
> Niels
> 
> > From: [email protected]
> > To: [email protected]
> > Date: Fri, 1 Jul 2011 00:36:32 +0200
> > Subject: Re: [Neo4j] traversing densely populated nodes
> > 
> > 
> > The relationship expander approach could certainly work for me. I already 
> > have a meta layer containing information about relationship types, so it 
> > would be a minor upgrade to store the type of relationship expander per 
> > relationship type and to access relationships through the expander instead 
> > of the direct API. 
> > What we need now is a solid balanced tree to partition those dense 
> > relationships. The Btree implementation in graph-collections is the 
> > probably our best option right now, though as I said earlier, it needs to 
> > attention to make it production ready. We need to make sure the 
> > implementation is thread safe and there are some other issues we need to 
> > look into. For example, right now the order of the Btree is fixed at 9. We 
> > could lift this value into the constructor of the Btree, so we can create 
> > Btrees of various different orders and then do performance tests, looking 
> > for the optimal order. The larger the order, the fewer the hops, but too 
> > large an order may in fact create dense nodes, something we try to avoid in 
> > the first place. 
> > There will certainly be more issues we need to look into, but the 
> > combination of the use of relationship expanders and a balanced tree index 
> > looks very promising to me.
> > As to the value for the Btree hash, how about trying the inverse of the 
> > node id. Using the node id itself or a time stamp is actually 
> > disadvantageous, because those values are in practice monotone (except for 
> > the reuse of node ids) leading to lots of balancing acts. The inverse of 
> > the node id is equally unique, but has a nicer distribution, making inserts 
> > into the Btree probably a bit faster.
> > Niels
> > 
> > 
> > > Date: Thu, 30 Jun 2011 23:50:24 +0200
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > 
> > > In the amanzi-index I link all indexed nodes into the index tree, so
> > > traversals are straight up the tree. Of course this also means that there
> > > are at least as many relationships as indexed nodes.
> > > 
> > > I was reviewing Michaels code for the relationship expander, and think 
> > > that
> > > is a great idea, tranparently using an index instead of the normal
> > > relationships API, and can imagine using the relationship expander to
> > > instead traverse the BTree to the final relationship to the leaf nodes.
> > > 
> > > So if we imagine a BTree with perhaps 10 or 20 hops from the root to the
> > > leaf node, the relationship expander Michael described would complete all
> > > hops and return only the last relationship, giving the illusion of direct
> > > connections from root to leaf. This would certainly perform well, 
> > > especially
> > > for cases where there are factors limiting the number of relationships we
> > > want returned. I think the request for type and direction is the first
> > > obvious case, but we could be even more explicit than that, if we pass
> > > constraints based on the BTree's consistent hash.
> > > 
> > > On Thu, Jun 30, 2011 at 11:36 PM, Niels Hoogeveen 
> > > <[email protected]
> > > > wrote:
> > > 
> > > >
> > > > In theory the approach I described earlier could work, though there are
> > > > some pitfalls to the current implementation that need ironing out before
> > > > this can become a recommended approach.
> > > > The choice of Timeline instead of Btree may actually be the wrong choice
> > > > after all. I chose Timeline because of my familiarity with this 
> > > > particular
> > > > class, but its implementation may actually not be all that suitable for 
> > > > this
> > > > particular use case. This has to do with the fact that Timeline is not 
> > > > just
> > > >  a tree, but a list where entries with an interval of max. 1000 are 
> > > > stored
> > > > in a Btree index. This works reasonably well for a Timeline, but makes 
> > > > the
> > > > approach less ideal for storing dense relationships.
> > > > The problem with the Timeline implementation is the ability to lookup 
> > > > the
> > > > tree root from a particular leave. In an ordinary Btree is would simply 
> > > > be a
> > > > traversal from the leave through the layers of block nodes to the tree 
> > > > root.
> > > > In Timeline the traversal will be different. It first has to move 
> > > > through
> > > > the Timeline list until it finds an entry that is stored in the Btree 
> > > > (which
> > > > worst case takes 1000 hops), and then it has to traverse the Btree up 
> > > > to the
> > > > tree root. To avoid this complicated traversal I ended up doing a lookup
> > > > through Lucene of the timeline URI (which is stored in all timeline list
> > > > entries). In fact I might as well have added the URI of the dense node 
> > > > as a
> > > > property and do the lookup through Lucene without the Timeline, it just
> > > > happens that I like the sort order of Timeline, making it a useful 
> > > > approach
> > > > anyway.
> > > > I will experiment using Btree directly (without Timeline) and see if 
> > > > that
> > > > leads to a simpler and faster traversal from leave to root node.
> > > > There is one more issue before this can become production ready. Btree 
> > > > as
> > > > it is implemented now is not thread safe (per the implementations 
> > > > Javadocs),
> > > > so it need some love and attention to make it work properly.
> > > > Niels
> > > >
> > > > > Date: Thu, 30 Jun 2011 13:57:20 +0200
> > > > > From: [email protected]
> > > > > To: [email protected]
> > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > >
> > > > > This topics has come up before, and the domain level solutions are
> > > > usually
> > > > > very similar, like Norbert's category/proxy nodes (to group by
> > > > > type/direction) and Niels' TimeLineIndex (BTree). I wonder whether we 
> > > > > can
> > > > > build a generic user-level solution that can also be wrapped to 
> > > > > appear as
> > > > an
> > > > > internal database solution?
> > > > >
> > > > > For example, consider Niels's solution of the TimeLine index. In this
> > > > case
> > > > > we group all the nodes based on a consistent hash. Usually the 
> > > > > timeline
> > > > > would use a timestamp, but really any reasonably variable property can
> > > > do,
> > > > > even the node-id itself. Then we have a BTree between the dense nodes 
> > > > > and
> > > > > the root node (node with too many relationships). How about this crazy
> > > > idea,
> > > > > create an API that mimics the normal node.getRelationship*() API, but
> > > > > internally traverses the entire tree? And also for creating the
> > > > > relationships? So for most cod we just do the usual
> > > > > node.createRelationshipTo(node,type,direction) and node.traverse(...),
> > > > but
> > > > > internally we actually traverse the b-tree.
> > > > >
> > > > > This would solve the performance bottleneck being observed while 
> > > > > keeping
> > > > the
> > > > > 'illusion' of directly connected relationships. The solution would be
> > > > > implemented mostly in the application space, so will not need any 
> > > > > changes
> > > > to
> > > > > the core database. I see this as being of the same kind of solution as
> > > > the
> > > > > auto-indexing. We setup some initial configuration that results in
> > > > certain
> > > > > structures being created on demand. With auto-indexing we are talking
> > > > about
> > > > > mostly automatically adding lucene indexes. With this idea we are 
> > > > > talking
> > > > > about automatically replacing direct relationships with b-trees to
> > > > resolve a
> > > > > specific performance issue.
> > > > >
> > > > > And when the relationship density is very low, if the b-tree is
> > > > > auto-balancing, it could just be a direct relationship anyway.
> > > > >
> > > > > On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > My problem pattern is exactly the same as Niels's :
> > > > > >
> > > > > > A dense-node has millions of relations of a certain direction & 
> > > > > > type,
> > > > > > and only a few (sparse) relations of a different direction and type.
> > > > > > The traversing is usually following only those sparse relationships 
> > > > > > on
> > > > > > those
> > > > > > dense-nodes.
> > > > > >
> > > > > > Now, even when traversing on these sparse relations, neo4j becomes
> > > > > > extremely
> > > > > > slow
> > > > > > on a certainly non linear Order (the big cs O).
> > > > > >
> > > > > > Some tests I run (email me if u want the code) reveal that even the
> > > > number
> > > > > > of those dense-nodes in the database greatly influences the results.
> > > > > >
> > > > > > I just reported to Michael the runs with the latest M05 snapshot, 
> > > > > > which
> > > > are
> > > > > > not very positive...
> > > > > > I have suggested an (auto) indexing of relationship types / 
> > > > > > direction
> > > > that
> > > > > > is used by traversing frameworks,
> > > > > > but I ain't no graphdb-engine expert :-(
> > > > > >
> > > > > > A'
> > > > > >
> > > > > >
> > > > > > Message: 5
> > > > > > > Date: Wed, 29 Jun 2011 18:19:10 +0200
> > > > > > > From: Niels Hoogeveen <[email protected]>
> > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > > > > To: <[email protected]>
> > > > > > > Message-ID: <[email protected]>
> > > > > > > Content-Type: text/plain; charset="iso-8859-1"
> > > > > > >
> > > > > > >
> > > > > > > Michael,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The issue I am refering to does not pertain to traversing many
> > > > relations
> > > > > > at
> > > > > > > once
> > > > > > >
> > > > > > > but the impact many relationship of one type have on relationships
> > > > > > >
> > > > > > > of another type on the same node.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Example:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > A topic class has 2 million outgoing relationships of type
> > > > "HAS_INSTANCE"
> > > > > > > and
> > > > > > >
> > > > > > > has 3 outgoing relationships of type "SUB_CLASS_OF".
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Fetching the 3 relations of type "SUB_CLASS_OF" takes very long,
> > > > > > >
> > > > > > > I presume due to the presence of the 2 million other 
> > > > > > > relationships.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I have no need to ever fetch the "HAS_INSTANCE" relationships from
> > > > > > >
> > > > > > > the topic node. That relation is always traversed from the other
> > > > > > direction.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I do want to know the class of a topic instance, leading to he 
> > > > > > > topic
> > > > > > class,
> > > > > > >
> > > > > > > but have no real interest ever to traverse all topic instance from
> > > >  the
> > > > > > > topic
> > > > > > >
> > > > > > > class (at least not directly.. i do want to know the most recent
> > > > > > addition,
> > > > > > >
> > > > > > > and that's what I use the timeline index for).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Niels
> > > > > > >
> > > > > > >
> > > > > > > > From: [email protected]
> > > > > > > > Date: Wed, 29 Jun 2011 17:50:08 +0200
> > > > > > > > To: [email protected]
> > > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes
> > > > > > > >
> > > > > > > > I think this is the same problem that Angelos is facing, we are
> > > > > > currently
> > > > > > > evaluating options to improve the performance on those highly
> > > > connected
> > > > > > > supernodes.
> > > > > > > >
> > > > > > > > A traditional option is really to split them into group or even
> > > > kind of
> > > > > > > shard their relationships to a second layer.
> > > > > > > >
> > > > > > > > We're looking into storage improvement options as well as
> > > > modifications
> > > > > > > to retrieval of that many relationships at once.
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > > Michael
> > > > > > >
> > > > > > _______________________________________________
> > > > > > Neo4j mailing list
> > > > > > [email protected]
> > > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > > >
> > > > > _______________________________________________
> > > > > Neo4j mailing list
> > > > > [email protected]
> > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > > _______________________________________________
> > > > Neo4j mailing list
> > > > [email protected]
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > _______________________________________________
> > > Neo4j mailing list
> > > [email protected]
> > > https://lists.neo4j.org/mailman/listinfo/user
> >                                       
> > _______________________________________________
> > Neo4j mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
>                                         
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
                                          
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] traversing densely populated nodes

Reply via email to