Good point :) In training our new devs (who typically have a RDBC background) it often takes a leap of understanding for them to realise that whilst a join is very expensive, creating and traversing a relationship is very cheap
T Sent from my iPhone so please excuse typos adn brevity Tim Langley +44 7989 539363 On 17 Mar 2010, at 21:50, Lincoln <linxbet...@gmail.com> wrote: > Ok I get it. I keep not thinking about relationships flexibly > enough. I > need to take a little time to get used to thinking about problems > this way. > Thanks for your help! > > On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <cr...@amanzi.com> > wrote: > >> I guess I could say that the approach is totally different, but in >> reality >> you start at the same point, working out what query you want. But >> after >> that >> things change. Let's consider two cases, the huge network and the >> paging >> examples you used below. >> >> For the first, keep in mind that if you scale the total database >> size, >> relational db's don't scale as well as graph db's, because in >> relational >> db's performance is related to total table size, whereas graph db's >> performance is not. However, both db's are affected by the result >> set size. >> So for twitter type cases, graph db's scale better with total >> subscriber >> base, but not necessarily for an individual 'mega-user'. But let's >> look at >> that case. I believe a real human user would never be able to process >> millions of messages a day, so I must assume that this is a bot use >> case, >> perhaps a script that subscribes to enormous numbers of people >> scanning for >> particular patterns of information. Both relational and graph db's >> will >> feel >> the strain on a huge resultset. Personally I would probably follow >> twitters >> example and return only paginated result, breaking up the load so >> the bot >> would not negatively impact the network for human users. >> >> So, getting to pagination, the query you want is perhaps 'give me the >> ordered results, as a block of n messages starting at message x'. >> Obviously >> this is easily translatable into SQL for relational databases. For >> the >> graph >> database, it is a little more work, but again you start with that >> verbal >> query and figure out what it really means in a graph. There are two >> key >> components: >> >> - Ordered results >> - Starting point and number of results >> >> For SQL this translates to 'order by', 'where id > x' and 'limit >> n'. For >> graph databases this translates into adding ordering relationships >> like >> message(X)-->PREVIOUS-->message(X+1). On each pagination, you can >> take the >> PREVIOUS message as the begining of the next pagination (possibly >> passing >> the node id in the session). Each pagination is simply a traverser >> that >> runs >> down the PREVIOUS chain, exiting the loop after n messages. >> >> >> On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <linxbet...@gmail.com> >> wrote: >> >>> Wow dude, this is blowing my mind just a little. >>> >>> Ok, sticking with the twitter example, I'm concerned about the edge >> cases. >>> I'd say it's easy to optimize with a relational db or any other >>> storage >>> for >>> that matter if I make the assumption that people only follow a few >> hundred >>> people and only want recent messages. However some people follow >> hundreds >>> of thousands of people. If Guy Kawasaki uses my app, I'd run into a >>> problem >>> quickly. >>> >>> However I see your point that I don't have to limit myself to just >>> the >>> obvious relationships, but can create relationships that serve >>> specific >>> purposes and use-cases such as your day example. I'm not sure how I >> would >>> want to model my use-case to allow for Guy Kawaski, I'll have to >>> think >> more >>> about it. Is there a threshold beyond which adding relationships >>> between >>> nodes causes problems? If not, or if it's high, you could create >>> custom >>> relationships for every type of query you'd want to do. >>> >>> However, a secondary question comes up. If we continue with the >>> twitter >>> example, and I want to be able to page through results, is that >>> directly >>> supported through Neo4j's API? Coming from a more traditional >>> storage >>> background I tend to think of what I'd want as a sort by time and >>> then a >>> skip and limit on the results (so I could say give me messages 1-100 >> sorted >>> by time descending). Is there anything equivalent in Neo4j or is >>> the >>> approach totally different? >>> >>> Thanks, >>> Lincoln >>> >>> >>> On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <cr...@amanzi.com> >> wrote: >>> >>>> Hi Lincoln, >>>> >>>> So it sounds like you don't need the IS_VISIBLE relations after >>>> all. >> The >>>> traverser works by following all relationships of the specified >>>> types >> and >>>> directions from each current node (as you traverse, or walk the >>>> graph). >>> You >>>> can have a complex graph and traverse to high depth very fast >> (thousands >>> of >>>> relationships per second). The traverser will also automatically >>>> check >>> that >>>> the same node is not returned twice. The test for the >>>> relationship type >>> is >>>> efficient. Still reasonable, but less efficient is the custom >>>> test you >>>> might >>>> put in the returnable evaluator, but if the limiting factor is >>>> usually >>> the >>>> number of relationships traversed, and if that is kept managable, >>>> the >>>> evaluator test is no concern. >>>> >>>> I think twitter is a good case in point, even with many millions of >>> users, >>>> you will still only follow perhaps a hundred and they will tweet >> perhaps >>> a >>>> hundred, or a thousand times, so your traverser will find the >>>> 10k-100k >>>> messages quite quickly. This can be speeded up further, but the >>>> right >>>> approach depends again on your use case. The idea with using a >>>> graph >>>> database is that the actual usage probably maps very well to the >>>> graph >>>> structure, so when deciding how to speed up your search, consider >>>> how >> it >>>> will be used. In twitter one normally only cares about recent >>>> messages, >>> so >>>> how about not linking directly from the user to the message, but >>>> link >> to >>> an >>>> intermediate node representing time, for example, a day-node. >>>> Then each >>> new >>>> message is added to the day node for that day, and that will >>> automatically >>>> become yesterday the next day. Then your traversal can have a stop >>>> evaluator >>>> to not follow old messages (unless your query is looking for old >>> messages, >>>> of course). So the 100k messages might drop to only a few >>>> hundred, or >>> even >>>> just a few dozen. Certainly that will be a query of the order of >>>> milliseconds! >>>> >>>> Moving away from the traverser, you also have the option to call >> directly >>>> the getRelationships() methods from the node. If you structure is >>>> predictable, like viewer-->FOLLOWS-->user-->CREATED-->message, >>>> then two >>>> nested for loops would work, the outer iterating over the >>>> followers and >>> the >>>> inner iterating over the messages. If you changed to add a time- >>>> based >>>> interim node (which is a kind of graph-index), then you need to >>>> have >>> three >>>> loops. If you made your time index a deeper tree (months->days- >>>> >hours, >>>> etc.), then you would need to further refactor the code. However, >>>> if >> you >>>> stuck with a traverser, you might not need to change the >>>> traverser even >>> of >>>> the graph structure changed, as long as the same relationship types >> were >>>> maintained. Does that make sense? >>>> >>>> Cheers, Craig >>>> >>>> On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <linxbet...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Craig, >>>>> >>>>> I'd like to clarify my question (I don't think it changes your >>>>> answer >>>>> though). >>>>> >>>>> I wanted all messages visible to me created by users I follow. >>>>> Thus, >>> the >>>>> FOLLOWS relationship is not enough. I'd need to see messages that >> are >>>>> visible to me and then check if they were created by users I >>>>> follow, >> or >>>> I'd >>>>> need to see messages created by users I follow and then see if >> they're >>>>> visible to me. >>>>> >>>>> I assume your last example still yields the result I'm looking >>>>> for. >>>> Could >>>>> you describe what actually happens here though? I'm unclear on >>>>> what >>> the >>>>> traversal looks like. Would it first traverse every outgoing >>>>> FOLLOWS >>>>> relationship from the viewer, yielding other users, and then >>>>> traverse >>> all >>>>> the CREATED relationships to get to messages? >>>>> >>>>> Also, given very large numbers of FOLLOWS and CREATED >>>>> relationships >>> (with >>>>> say, a twitter graph), how is this made efficient? >>>>> >>>>> Sorry for all the basic questions but I couldn't find this >> information >>> in >>>>> the docs. If there's something I should be reading before posting >>> these >>>>> questions, please point me to it. >>>>> >>>>> Thanks! >>>>> >>>>> Lincoln >>>>> >>>>> On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <cr...@amanzi.com> >>>> wrote: >>>>> >>>>>> I'm uncertain about one ambiguity in your model, you are able to >> find >>>>>> messages through FOLLOWS and IS_VISIBLE_BY. These will give two >>>> different >>>>>> sets, and my first impression was that FOLLOWS gives you the >>>>>> right >>>>> answer. >>>>>> In other words you want to query for 'all messages by users I >>> follow'? >>>> In >>>>>> that case you do not need IS_VISIBLE_BY. However, if there are >>> messages >>>>> by >>>>>> people you follow, but are not allowed to see, then you also need >> the >>>>>> IS_VISIBLE_BY. But I would still reconsider linking directly from >> the >>>>>> viewer >>>>>> to the message for that case. I'd rather have the messages linked >> to >>>> some >>>>>> categorization structure for things like 'public', 'private', >>>>>> etc. >>>>>> >>>>>> Anyway, here are some suggestions for the various approaches >>>>>> above: >>>>>> *'all messages by users I follow'* >>>>>> val msgs = viewer.traverse( >>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, >>>>>> (tp: TraversalPosition) => IsMessage(tp.currentNode()), >>>>>> Rels.FOLLOWS, Direction.OUTGOING, >>>>>> Rels.CREATED, Direction.OUTGOING) >>>>>> >>>>>> *'all messages visible to me'* >>>>>> val msgs = viewer.traverse( >>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, >>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, >>>>>> Rels.IS_VISIBLE_BY, Direction.INCOMING) >>>>>> >>>>>> *'all messages, visible to me, by people I follow'* >>>>>> val msgs = viewer.traverse( >>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, >>>>>> (tp: TraversalPosition) => { >>>>>> val msg = tp.currentNode() >>>>>> IsMessage(msg) && IsVisibleBy(msg,viewer) >>>>>> }, >>>>>> Rels.FOLLOWS, Direction.OUTGOING, >>>>>> Rels.CREATED, Direction.OUTGOING) >>>>>> >>>>>> Of course I assume you make the utility functions IsMessage(node: >>> Node) >>>>> and >>>>>> IsVisibleBy(msg: Node, user: Node), and these will test the >> existance >>>> of >>>>>> properties and relations as appropriate to make the decision. >>>>>> >>>>>> >>>>>> On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <linxbet...@gmail.com> >>> wrote: >>>>>> >>>>>>> Hi, I've just started looking at Neo4j and I'm quite intrigued. >>>>> However, >>>>>>> the cognitive dissonance that I've grown so used to in modeling >>>> storage >>>>>> is >>>>>>> proving to be a bit difficult to let go at this early stage :) >>>>>>> >>>>>>> I was hoping that if someone could help me through an example I >>> would >>>>> be >>>>>>> able to grok how to properly structure my data and query it in >>> Neo4j. >>>>>>> >>>>>>> Nodes: >>>>>>> Message( text: String ) >>>>>>> User( id: Long ) >>>>>>> >>>>>>> Relationships: >>>>>>> CREATED >>>>>>> FOLLOWS >>>>>>> IS_VISIBLE_BY >>>>>>> >>>>>>> So I might have a graph with entries like so: >>>>>>> >>>>>>> User(1) --> CREATED --> Message("i woke up late today") >>>>>>> User(2) --> CREATED --> Message("hello") >>>>>>> User(3) --> CREATED --> Message("ugh, i hate mondays") >>>>>>> >>>>>>> User(1) --> FOLLOWS --> User(2) >>>>>>> >>>>>>> Let's also say all messages are visible to User 1. >>>>>>> >>>>>>> Message("i woke up late today") --> IS_VISIBLE_BY --> User(1) >>>>>>> Message("hello") --> IS_VISIBLE_BY --> User(1) >>>>>>> Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1) >>>>>>> >>>>>>> So, I can do a simple traversal for visible: >>>>>>> >>>>>>> val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" ) >>>>>>> val index = new LuceneIndexService( graphDb ) >>>>>>> val viewer = index.getSingleNode("id", 1) >>>>>>> val msgs = viewer.traverse( Order.BREADTH_FIRST, >>>>>>> StopEvaluator.END_OF_GRAPH, >>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY, >>>>>>> Direction.INCOMING) >>>>>>> msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }") // >>>>> assuming >>>>>> i >>>>>>> have the relevant functions >>>>>>> >>>>>>> But let's say that this is going to return too many messages. >> Just >>>>>> because >>>>>>> all the messages are possibly visible to me, doesn't mean I want >> to >>>> see >>>>>>> them >>>>>>> all. So, I'd like to additionally filter by the FOLLOWS >>>> relationship. >>>>>> I'd >>>>>>> like to express "get all messages that are visible and were >> created >>>> by >>>>> a >>>>>>> user that I follow." Can someone show me an example of how to >>>>>>> do >>>> that? >>>>>>> >>>>>>> I'm guessing that you need to implement a custom >>> ReturnableEvaluator, >>>>> but >>>>>> I >>>>>>> don't understand how you traverse multiple relationships at the >>> same >>>>>> time. >>>>>>> >>>>>>> Thanks, >>>>>>> Lincoln >>>>>>> _______________________________________________ >>>>>>> Neo mailing list >>>>>>> User@lists.neo4j.org >>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>>> >>>>>> _______________________________________________ >>>>>> Neo mailing list >>>>>> User@lists.neo4j.org >>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>> >>>>> _______________________________________________ >>>>> Neo mailing list >>>>> User@lists.neo4j.org >>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>> >>>> _______________________________________________ >>>> Neo mailing list >>>> User@lists.neo4j.org >>>> https://lists.neo4j.org/mailman/listinfo/user >>>> >>> _______________________________________________ >>> Neo mailing list >>> User@lists.neo4j.org >>> https://lists.neo4j.org/mailman/listinfo/user >>> >> _______________________________________________ >> Neo mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user >> > _______________________________________________ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user