Re: [Neo4j] traversing densely populated nodes

Niels Hoogeveen Thu, 30 Jun 2011 05:48:06 -0700

Michael,
Is it possible/feasible to change the store-format so relationships are stored 
partitioned per relationshiptype per direction? What would be the impact of 
such change?
Niels


> From: michael.hun...@neotechnology.com
> Date: Thu, 30 Jun 2011 14:10:38 +0200
> To: user@lists.neo4j.org
> Subject: Re: [Neo4j] traversing densely populated nodes
> 
> The issue being that relationships on disk are not ordered, so that, even 
> when just accessing the few relationships of the one
> type you still have to scan all rels.
> 
> For supporting different approaches you either have to change the 
> store-format to handle the storage and loading of relationships of supernodes 
> differently.
> 
> Or: 
> - if you'd rather want to avoid supernodes you can have an Expander in your 
> traversers that delays the traversals of supernodes (perhaps
> the other nodes give you already enough results for the domain to consume). 
> (You could also limit/timebox that).
> - You can also have Expanders that use an B-Tree or Relationship index 
> internally (for the "few"-rels). With autoindexing you could actually put a 
> property on those few "rels" that makes them automatically indexed. And then 
> in the Expander use
> relIndex.get(startNode,null,null) to get all of them for expansion.
> 
> It can be useful to get something like this implemented once and then offered 
> to all users either as part of the product or perhaps in a component like 
> neo4j-collections ? (these are just ideas)
> 
> Cheers
> 
> Michael
> 
> Am 30.06.2011 um 13:57 schrieb Craig Taverner:
> 
> > This topics has come up before, and the domain level solutions are usually
> > very similar, like Norbert's category/proxy nodes (to group by
> > type/direction) and Niels' TimeLineIndex (BTree). I wonder whether we can
> > build a generic user-level solution that can also be wrapped to appear as an
> > internal database solution?
> > 
> > For example, consider Niels's solution of the TimeLine index. In this case
> > we group all the nodes based on a consistent hash. Usually the timeline
> > would use a timestamp, but really any reasonably variable property can do,
> > even the node-id itself. Then we have a BTree between the dense nodes and
> > the root node (node with too many relationships). How about this crazy idea,
> > create an API that mimics the normal node.getRelationship*() API, but
> > internally traverses the entire tree? And also for creating the
> > relationships? So for most cod we just do the usual
> > node.createRelationshipTo(node,type,direction) and node.traverse(...), but
> > internally we actually traverse the b-tree.
> > 
> > This would solve the performance bottleneck being observed while keeping the
> > 'illusion' of directly connected relationships. The solution would be
> > implemented mostly in the application space, so will not need any changes to
> > the core database. I see this as being of the same kind of solution as the
> > auto-indexing. We setup some initial configuration that results in certain
> > structures being created on demand. With auto-indexing we are talking about
> > mostly automatically adding lucene indexes. With this idea we are talking
> > about automatically replacing direct relationships with b-trees to resolve a
> > specific performance issue.
> > 
> > And when the relationship density is very low, if the b-tree is
> > auto-balancing, it could just be a direct relationship anyway.
> > 
> > On Wed, Jun 29, 2011 at 6:56 PM, Agelos Pikoulas
> > <agelos.pikou...@gmail.com>wrote:
> > 
> >> My problem pattern is exactly the same as Niels's :
> >> 
> >> A dense-node has millions of relations of a certain direction & type,
> >> and only a few (sparse) relations of a different direction and type.
> >> The traversing is usually following only those sparse relationships on
> >> those
> >> dense-nodes.
> >> 
> >> Now, even when traversing on these sparse relations, neo4j becomes
> >> extremely
> >> slow
> >> on a certainly non linear Order (the big cs O).
> >> 
> >> Some tests I run (email me if u want the code) reveal that even the number
> >> of those dense-nodes in the database greatly influences the results.
> >> 
> >> I just reported to Michael the runs with the latest M05 snapshot, which are
> >> not very positive...
> >> I have suggested an (auto) indexing of relationship types / direction that
> >> is used by traversing frameworks,
> >> but I ain't no graphdb-engine expert :-(
> >> 
> >> A'
> >> 
> >> 
> >> Message: 5
> >>> Date: Wed, 29 Jun 2011 18:19:10 +0200
> >>> From: Niels Hoogeveen <pd_aficion...@hotmail.com>
> >>> Subject: Re: [Neo4j] traversing densely populated nodes
> >>> To: <user@lists.neo4j.org>
> >>> Message-ID: <col110-w326b152552b8f7fbe1312d8b...@phx.gbl>
> >>> Content-Type: text/plain; charset="iso-8859-1"
> >>> 
> >>> 
> >>> Michael,
> >>> 
> >>> 
> >>> 
> >>> The issue I am refering to does not pertain to traversing many relations
> >> at
> >>> once
> >>> 
> >>> but the impact many relationship of one type have on relationships
> >>> 
> >>> of another type on the same node.
> >>> 
> >>> 
> >>> 
> >>> Example:
> >>> 
> >>> 
> >>> 
> >>> A topic class has 2 million outgoing relationships of type "HAS_INSTANCE"
> >>> and
> >>> 
> >>> has 3 outgoing relationships of type "SUB_CLASS_OF".
> >>> 
> >>> 
> >>> 
> >>> Fetching the 3 relations of type "SUB_CLASS_OF" takes very long,
> >>> 
> >>> I presume due to the presence of the 2 million other relationships.
> >>> 
> >>> 
> >>> 
> >>> I have no need to ever fetch the "HAS_INSTANCE" relationships from
> >>> 
> >>> the topic node. That relation is always traversed from the other
> >> direction.
> >>> 
> >>> 
> >>> 
> >>> I do want to know the class of a topic instance, leading to he topic
> >> class,
> >>> 
> >>> but have no real interest ever to traverse all topic instance from  the
> >>> topic
> >>> 
> >>> class (at least not directly.. i do want to know the most recent
> >> addition,
> >>> 
> >>> and that's what I use the timeline index for).
> >>> 
> >>> 
> >>> 
> >>> Niels
> >>> 
> >>> 
> >>>> From: michael.hun...@neotechnology.com
> >>>> Date: Wed, 29 Jun 2011 17:50:08 +0200
> >>>> To: user@lists.neo4j.org
> >>>> Subject: Re: [Neo4j] traversing densely populated nodes
> >>>> 
> >>>> I think this is the same problem that Angelos is facing, we are
> >> currently
> >>> evaluating options to improve the performance on those highly connected
> >>> supernodes.
> >>>> 
> >>>> A traditional option is really to split them into group or even kind of
> >>> shard their relationships to a second layer.
> >>>> 
> >>>> We're looking into storage improvement options as well as modifications
> >>> to retrieval of that many relationships at once.
> >>>> 
> >>>> Cheers
> >>>> 
> >>>> Michael
> >>> 
> >> _______________________________________________
> >> Neo4j mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >> 
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
                                          
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] traversing densely populated nodes

Reply via email to