It seems I was looking in the wrong place. There actually is a Btree implementation that works on the basis of nodes instead of properties. I will test this implementation to see if it has the performance characteristics we are looking for. Niels
> From: [email protected] > To: [email protected] > Date: Fri, 1 Jul 2011 15:24:03 +0200 > Subject: Re: [Neo4j] traversing densely populated nodes > > > I delved a bit deeper into the BTree implementation in graph-collections and > learned that it is essentially a tree based version of a property container. > It is not possible to add nodes to the BTree, only primitive values. The > graph based Timeline uses the BTree to store some entries, but does so by > adding the ID of a node. > The current implementation has a method addEntry(long key, Object value), > where value can be any of the types that can be stored in a property > container, while the key always has to be a long. > > This makes it impossible to use the current implementation of BTree for the > purpose of storing dense relationships, if we want to be able to simply > traverse from the tree root to the entries and vice versa. > I will roll up my sleeves and refactor the current BTree implementation so it > can be used to store relationships to nodes. In phase 2 we can then find a > way to use keys other than long, but that is not required for the purpose of > storing dense relationships. Using the node ID of the node to store as the > key for the dense relationship BTree is almost ideal, since it allows us to > ask the question if there is a dense relationship for a given node. The only > issue here is that in practice node ID's will be monotonically increasing, > leading to many balancing acts while inserting in the BTree. I suggest we use > the reverse of the node ID, which still allows for easy lookups, but gives a > much more homogeneous filling pattern, leading to better insert times. > Niels > Niels > > > > From: [email protected] > > To: [email protected] > > Date: Fri, 1 Jul 2011 02:06:24 +0200 > > Subject: Re: [Neo4j] traversing densely populated nodes > > > > > > So far we have only considered the situation where one side of the > > relationship leads to a densely populated node, while the other side of the > > relationship leads to a non-densely populated node. This is likely to be > > the most prevalent situation (eg. category or class nodes), but it is > > conceivable that both sides of the relationship lead to densely populated > > nodes (eg. social networks with hugely popular users). > > A single Btree per relationship type per direction may not suffice in such > > situation, but instead a double Btree should be used. > > In the dense to non-dense scenario, the dense side will have a relationship > > to the root of the Btree for a particular combination of relationship type > > and direction, and the leave nodes of the Btree will have relations to the > > non-dense nodes. > > A traversal listing all entries in the tree will look > > like:node-(TREE_ROOT)->node-(SUB_TREE)*->node-(KEY_ENTRY)->node > > A traversal from an entry to to the related node is the > > reverse:node<-(KEY_ENTRY)-node<-(SUB_TREE)*-node<-(TREE_ROOT)-node > > In the dense to dense scenario, both sides will have a relationship to the > > root of a Btree for a particular combination of relationship type and > > direction, and the leave nodes of both Btrees will have relations to one > > another. > > The traversal in a dense to dense scenario > > is:node-(TREE_ROOT)->node-(SUB_TREE)*->node-(KEY_ENTRY)->node<-(KEY_ENTRY)-node<-(SUB_TREE)*-node<-(TREE_ROOT)-node > > Note: Relationships that need to be traversed transitively are marked with > > a * > > Niels > > > > > From: [email protected] > > > To: [email protected] > > > Date: Fri, 1 Jul 2011 00:36:32 +0200 > > > Subject: Re: [Neo4j] traversing densely populated nodes > > > > > > > > > The relationship expander approach could certainly work for me. I already > > > have a meta layer containing information about relationship types, so it > > > would be a minor upgrade to store the type of relationship expander per > > > relationship type and to access relationships through the expander > > > instead of the direct API. > > > What we need now is a solid balanced tree to partition those dense > > > relationships. The Btree implementation in graph-collections is the > > > probably our best option right now, though as I said earlier, it needs to > > > attention to make it production ready. We need to make sure the > > > implementation is thread safe and there are some other issues we need to > > > look into. For example, right now the order of the Btree is fixed at 9. > > > We could lift this value into the constructor of the Btree, so we can > > > create Btrees of various different orders and then do performance tests, > > > looking for the optimal order. The larger the order, the fewer the hops, > > > but too large an order may in fact create dense nodes, something we try > > > to avoid in the first place. > > > There will certainly be more issues we need to look into, but the > > > combination of the use of relationship expanders and a balanced tree > > > index looks very promising to me. > > > As to the value for the Btree hash, how about trying the inverse of the > > > node id. Using the node id itself or a time stamp is actually > > > disadvantageous, because those values are in practice monotone (except > > > for the reuse of node ids) leading to lots of balancing acts. The inverse > > > of the node id is equally unique, but has a nicer distribution, making > > > inserts into the Btree probably a bit faster. > > > Niels > > > > > > > > > > Date: Thu, 30 Jun 2011 23:50:24 +0200 > > > > From: [email protected] > > > > To: [email protected] > > > > Subject: Re: [Neo4j] traversing densely populated nodes > > > > > > > > In the amanzi-index I link all indexed nodes into the index tree, so > > > > traversals are straight up the tree. Of course this also means that > > > > there > > > > are at least as many relationships as indexed nodes. > > > > > > > > I was reviewing Michaels code for the relationship expander, and think > > > > that > > > > is a great idea, tranparently using an index instead of the normal > > > > relationships API, and can imagine using the relationship expander to > > > > instead traverse the BTree to the final relationship to the leaf nodes. > > > > > > > > So if we imagine a BTree with perhaps 10 or 20 hops from the root to the > > > > leaf node, the relationship expander Michael described would complete > > > > all > > > > hops and return only the last relationship, giving the illusion of > > > > direct > > > > connections from root to leaf. This would certainly perform well, > > > > especially > > > > for cases where there are factors limiting the number of relationships > > > > we > > > > want returned. I think the request for type and direction is the first > > > > obvious case, but we could be even more explicit than that, if we pass > > > > constraints based on the BTree's consistent hash. > > > > > > > > On Thu, Jun 30, 2011 at 11:36 PM, Niels Hoogeveen > > > > <[email protected] > > > > > wrote: > > > > > > > > > > > > > > In theory the approach I described earlier could work, though there > > > > > are > > > > > some pitfalls to the current implementation that need ironing out > > > > > before > > > > > this can become a recommended approach. > > > > > The choice of Timeline instead of Btree may actually be the wrong > > > > > choice > > > > > after all. I chose Timeline because of my familiarity with this > > > > > particular > > > > > class, but its implementation may actually not be all that suitable > > > > > for this > > > > > particular use case. This has to do with the fact that Timeline is > > > > > not just > > > > > a tree, but a list where entries with an interval of max. 1000 are > > > > > stored > > > > > in a Btree index. This works reasonably well for a Timeline, but > > > > > makes the > > > > > approach less ideal for storing dense relationships. > > > > > The problem with the Timeline implementation is the ability to lookup > > > > > the > > > > > tree root from a particular leave. In an ordinary Btree is would > > > > > simply be a > > > > > traversal from the leave through the layers of block nodes to the > > > > > tree root. > > > > > In Timeline the traversal will be different. It first has to move > > > > > through > > > > > the Timeline list until it finds an entry that is stored in the Btree > > > > > (which > > > > > worst case takes 1000 hops), and then it has to traverse the Btree up > > > > > to the > > > > > tree root. To avoid this complicated traversal I ended up doing a > > > > > lookup > > > > > through Lucene of the timeline URI (which is stored in all timeline > > > > > list > > > > > entries). In fact I might as well have added the URI of the dense > > > > > node as a > > > > > property and do the lookup through Lucene without the Timeline, it > > > > > just > > > > > happens that I like the sort order of Timeline, making it a useful > > > > > approach > > > > > anyway. > > > > > I will experiment using Btree directly (without Timeline) and see if > > > > > that > > > > > leads to a simpler and faster traversal from leave to root node. > > > > > There is one more issue before this can become production ready. > > > > > Btree as > > > > > it is implemented now is not thread safe (per the implementations > > > > > Javadocs), > > > > > so it need some love and attention to make it work properly. > > > > > Niels > > > > > > > > > > > Date: Thu, 30 Jun 2011 13:57:20 +0200 > > > > > > From: [email protected] > > > > > > To: [email protected] > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes > > > > > > > > > > > > This topics has come up before, and the domain level solutions are > > > > > usually > > > > > > very similar, like Norbert's category/proxy nodes (to group by > > > > > > type/direction) and Niels' TimeLineIndex (BTree). I wonder whether > > > > > > we can > > > > > > build a generic user-level solution that can also be wrapped to > > > > > > appear as > > > > > an > > > > > > internal database solution? > > > > > > > > > > > > For example, consider Niels's solution of the TimeLine index. In > > > > > > this > > > > > case > > > > > > we group all the nodes based on a consistent hash. Usually the > > > > > > timeline > > > > > > would use a timestamp, but really any reasonably variable property > > > > > > can > > > > > do, > > > > > > even the node-id itself. Then we have a BTree between the dense > > > > > > nodes and > > > > > > the root node (node with too many relationships). How about this > > > > > > crazy > > > > > idea, > > > > > > create an API that mimics the normal node.getRelationship*() API, > > > > > > but > > > > > > internally traverses the entire tree? And also for creating the > > > > > > relationships? So for most cod we just do the usual > > > > > > node.createRelationshipTo(node,type,direction) and > > > > > > node.traverse(...), > > > > > but > > > > > > internally we actually traverse the b-tree. > > > > > > > > > > > > This would solve the performance bottleneck being observed while > > > > > > keeping > > > > > the > > > > > > 'illusion' of directly connected relationships. The solution would > > > > > > be > > > > > > implemented mostly in the application space, so will not need any > > > > > > changes > > > > > to > > > > > > the core database. I see this as being of the same kind of solution > > > > > > as > > > > > the > > > > > > auto-indexing. We setup some initial configuration that results in > > > > > certain > > > > > > structures being created on demand. With auto-indexing we are > > > > > > talking > > > > > about > > > > > > mostly automatically adding lucene indexes. With this idea we are > > > > > > talking > > > > > > about automatically replacing direct relationships with b-trees to > > > > > resolve a > > > > > > specific performance issue. > > > > > > > > > > > > And when the relationship density is very low, if the b-tree is > > > > > > auto-balancing, it could just be a direct relationship anyway. > > > > > > > > > > > > On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas > > > > > > <[email protected]>wrote: > > > > > > > > > > > > > My problem pattern is exactly the same as Niels's : > > > > > > > > > > > > > > A dense-node has millions of relations of a certain direction & > > > > > > > type, > > > > > > > and only a few (sparse) relations of a different direction and > > > > > > > type. > > > > > > > The traversing is usually following only those sparse > > > > > > > relationships on > > > > > > > those > > > > > > > dense-nodes. > > > > > > > > > > > > > > Now, even when traversing on these sparse relations, neo4j becomes > > > > > > > extremely > > > > > > > slow > > > > > > > on a certainly non linear Order (the big cs O). > > > > > > > > > > > > > > Some tests I run (email me if u want the code) reveal that even > > > > > > > the > > > > > number > > > > > > > of those dense-nodes in the database greatly influences the > > > > > > > results. > > > > > > > > > > > > > > I just reported to Michael the runs with the latest M05 snapshot, > > > > > > > which > > > > > are > > > > > > > not very positive... > > > > > > > I have suggested an (auto) indexing of relationship types / > > > > > > > direction > > > > > that > > > > > > > is used by traversing frameworks, > > > > > > > but I ain't no graphdb-engine expert :-( > > > > > > > > > > > > > > A' > > > > > > > > > > > > > > > > > > > > > Message: 5 > > > > > > > > Date: Wed, 29 Jun 2011 18:19:10 +0200 > > > > > > > > From: Niels Hoogeveen <[email protected]> > > > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes > > > > > > > > To: <[email protected]> > > > > > > > > Message-ID: <[email protected]> > > > > > > > > Content-Type: text/plain; charset="iso-8859-1" > > > > > > > > > > > > > > > > > > > > > > > > Michael, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The issue I am refering to does not pertain to traversing many > > > > > relations > > > > > > > at > > > > > > > > once > > > > > > > > > > > > > > > > but the impact many relationship of one type have on > > > > > > > > relationships > > > > > > > > > > > > > > > > of another type on the same node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Example: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > A topic class has 2 million outgoing relationships of type > > > > > "HAS_INSTANCE" > > > > > > > > and > > > > > > > > > > > > > > > > has 3 outgoing relationships of type "SUB_CLASS_OF". > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Fetching the 3 relations of type "SUB_CLASS_OF" takes very long, > > > > > > > > > > > > > > > > I presume due to the presence of the 2 million other > > > > > > > > relationships. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have no need to ever fetch the "HAS_INSTANCE" relationships > > > > > > > > from > > > > > > > > > > > > > > > > the topic node. That relation is always traversed from the other > > > > > > > direction. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I do want to know the class of a topic instance, leading to he > > > > > > > > topic > > > > > > > class, > > > > > > > > > > > > > > > > but have no real interest ever to traverse all topic instance > > > > > > > > from > > > > > the > > > > > > > > topic > > > > > > > > > > > > > > > > class (at least not directly.. i do want to know the most recent > > > > > > > addition, > > > > > > > > > > > > > > > > and that's what I use the timeline index for). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Niels > > > > > > > > > > > > > > > > > > > > > > > > > From: [email protected] > > > > > > > > > Date: Wed, 29 Jun 2011 17:50:08 +0200 > > > > > > > > > To: [email protected] > > > > > > > > > Subject: Re: [Neo4j] traversing densely populated nodes > > > > > > > > > > > > > > > > > > I think this is the same problem that Angelos is facing, we > > > > > > > > > are > > > > > > > currently > > > > > > > > evaluating options to improve the performance on those highly > > > > > connected > > > > > > > > supernodes. > > > > > > > > > > > > > > > > > > A traditional option is really to split them into group or > > > > > > > > > even > > > > > kind of > > > > > > > > shard their relationships to a second layer. > > > > > > > > > > > > > > > > > > We're looking into storage improvement options as well as > > > > > modifications > > > > > > > > to retrieval of that many relationships at once. > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Neo4j mailing list > > > > > > > [email protected] > > > > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > > > > > > > _______________________________________________ > > > > > > Neo4j mailing list > > > > > > [email protected] > > > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > > > > _______________________________________________ > > > > > Neo4j mailing list > > > > > [email protected] > > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > > > _______________________________________________ > > > > Neo4j mailing list > > > > [email protected] > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > _______________________________________________ > > > Neo4j mailing list > > > [email protected] > > > https://lists.neo4j.org/mailman/listinfo/user > > > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

