Michael, Is it possible/feasible to change the store-format so relationships are stored partitioned per relationshiptype per direction? What would be the impact of such change? Niels
> From: michael.hun...@neotechnology.com > Date: Thu, 30 Jun 2011 14:10:38 +0200 > To: user@lists.neo4j.org > Subject: Re: [Neo4j] traversing densely populated nodes > > The issue being that relationships on disk are not ordered, so that, even > when just accessing the few relationships of the one > type you still have to scan all rels. > > For supporting different approaches you either have to change the > store-format to handle the storage and loading of relationships of supernodes > differently. > > Or: > - if you'd rather want to avoid supernodes you can have an Expander in your > traversers that delays the traversals of supernodes (perhaps > the other nodes give you already enough results for the domain to consume). > (You could also limit/timebox that). > - You can also have Expanders that use an B-Tree or Relationship index > internally (for the "few"-rels). With autoindexing you could actually put a > property on those few "rels" that makes them automatically indexed. And then > in the Expander use > relIndex.get(startNode,null,null) to get all of them for expansion. > > It can be useful to get something like this implemented once and then offered > to all users either as part of the product or perhaps in a component like > neo4j-collections ? (these are just ideas) > > Cheers > > Michael > > Am 30.06.2011 um 13:57 schrieb Craig Taverner: > > > This topics has come up before, and the domain level solutions are usually > > very similar, like Norbert's category/proxy nodes (to group by > > type/direction) and Niels' TimeLineIndex (BTree). I wonder whether we can > > build a generic user-level solution that can also be wrapped to appear as an > > internal database solution? > > > > For example, consider Niels's solution of the TimeLine index. In this case > > we group all the nodes based on a consistent hash. Usually the timeline > > would use a timestamp, but really any reasonably variable property can do, > > even the node-id itself. Then we have a BTree between the dense nodes and > > the root node (node with too many relationships). How about this crazy idea, > > create an API that mimics the normal node.getRelationship*() API, but > > internally traverses the entire tree? And also for creating the > > relationships? So for most cod we just do the usual > > node.createRelationshipTo(node,type,direction) and node.traverse(...), but > > internally we actually traverse the b-tree. > > > > This would solve the performance bottleneck being observed while keeping the > > 'illusion' of directly connected relationships. The solution would be > > implemented mostly in the application space, so will not need any changes to > > the core database. I see this as being of the same kind of solution as the > > auto-indexing. We setup some initial configuration that results in certain > > structures being created on demand. With auto-indexing we are talking about > > mostly automatically adding lucene indexes. With this idea we are talking > > about automatically replacing direct relationships with b-trees to resolve a > > specific performance issue. > > > > And when the relationship density is very low, if the b-tree is > > auto-balancing, it could just be a direct relationship anyway. > > > > On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas > > <agelos.pikou...@gmail.com>wrote: > > > >> My problem pattern is exactly the same as Niels's : > >> > >> A dense-node has millions of relations of a certain direction & type, > >> and only a few (sparse) relations of a different direction and type. > >> The traversing is usually following only those sparse relationships on > >> those > >> dense-nodes. > >> > >> Now, even when traversing on these sparse relations, neo4j becomes > >> extremely > >> slow > >> on a certainly non linear Order (the big cs O). > >> > >> Some tests I run (email me if u want the code) reveal that even the number > >> of those dense-nodes in the database greatly influences the results. > >> > >> I just reported to Michael the runs with the latest M05 snapshot, which are > >> not very positive... > >> I have suggested an (auto) indexing of relationship types / direction that > >> is used by traversing frameworks, > >> but I ain't no graphdb-engine expert :-( > >> > >> A' > >> > >> > >> Message: 5 > >>> Date: Wed, 29 Jun 2011 18:19:10 +0200 > >>> From: Niels Hoogeveen <pd_aficion...@hotmail.com> > >>> Subject: Re: [Neo4j] traversing densely populated nodes > >>> To: <user@lists.neo4j.org> > >>> Message-ID: <col110-w326b152552b8f7fbe1312d8b...@phx.gbl> > >>> Content-Type: text/plain; charset="iso-8859-1" > >>> > >>> > >>> Michael, > >>> > >>> > >>> > >>> The issue I am refering to does not pertain to traversing many relations > >> at > >>> once > >>> > >>> but the impact many relationship of one type have on relationships > >>> > >>> of another type on the same node. > >>> > >>> > >>> > >>> Example: > >>> > >>> > >>> > >>> A topic class has 2 million outgoing relationships of type "HAS_INSTANCE" > >>> and > >>> > >>> has 3 outgoing relationships of type "SUB_CLASS_OF". > >>> > >>> > >>> > >>> Fetching the 3 relations of type "SUB_CLASS_OF" takes very long, > >>> > >>> I presume due to the presence of the 2 million other relationships. > >>> > >>> > >>> > >>> I have no need to ever fetch the "HAS_INSTANCE" relationships from > >>> > >>> the topic node. That relation is always traversed from the other > >> direction. > >>> > >>> > >>> > >>> I do want to know the class of a topic instance, leading to he topic > >> class, > >>> > >>> but have no real interest ever to traverse all topic instance from the > >>> topic > >>> > >>> class (at least not directly.. i do want to know the most recent > >> addition, > >>> > >>> and that's what I use the timeline index for). > >>> > >>> > >>> > >>> Niels > >>> > >>> > >>>> From: michael.hun...@neotechnology.com > >>>> Date: Wed, 29 Jun 2011 17:50:08 +0200 > >>>> To: user@lists.neo4j.org > >>>> Subject: Re: [Neo4j] traversing densely populated nodes > >>>> > >>>> I think this is the same problem that Angelos is facing, we are > >> currently > >>> evaluating options to improve the performance on those highly connected > >>> supernodes. > >>>> > >>>> A traditional option is really to split them into group or even kind of > >>> shard their relationships to a second layer. > >>>> > >>>> We're looking into storage improvement options as well as modifications > >>> to retrieval of that many relationships at once. > >>>> > >>>> Cheers > >>>> > >>>> Michael > >>> > >> _______________________________________________ > >> Neo4j mailing list > >> User@lists.neo4j.org > >> https://lists.neo4j.org/mailman/listinfo/user > >> > > _______________________________________________ > > Neo4j mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user