Hi! Interesting thread :-)
Craig Taverner wrote: > of original data. The developers answered that they would store the node-ids > of the original data in an int[] property. That answer totally took me by > surprise, and it took a few seconds for me to realize that it would work for > the initial use case, but was not a graph-like solution, and obviously more > like a relational database foreign key. > > Not being a graphy solution was not a good enough reason for me to turn down > the idea, so I had to think a bit further to argue my point. Simply > replacing the ids with relationships to the original nodes (as I originally > planned), was not going to give any further functionality. Using a relationship would guarantee that the node is still there as long as the relationship exists. A node id could point to a node that has been deleted, or the id could even have been reused for a new node after that. So when storing node ids somewhere, your application also has to make sure referential integrity isn't lost. /anders > Instead it would > slow down performance slightly (due to increased structure). Then it hit me, > the power of the graph was that we can continue to add structure, coping > with far more use cases than a simple foreign key. And the next key piece of > structure I immediately thought of was a layer of intermediate nodes > representing an intermediate level of statistics. Suddenly we could add any > number of levels of detail to the chain, arbitrarily, without affecting the > user interface or the initial use case. That was certainly not possible with > the foreign key approach. > > But I also thanked the developer for the suggestion, for without it I would > not have been forced to find a better argument for the graphy solution, and > thereby solved a few other requirements :-) > > On Wed, Mar 17, 2010 at 10:53 PM, Tim Langley <timlang...@me.com> wrote: > >> Good point :) >> In training our new devs (who typically have a RDBC background) it >> often takes a leap of understanding for them to realise that whilst a >> join is very expensive, creating and traversing a relationship is very >> cheap >> >> T >> >> Sent from my iPhone so please excuse typos adn brevity >> >> Tim Langley >> +44 7989 539363 >> >> On 17 Mar 2010, at 21:50, Lincoln <linxbet...@gmail.com> wrote: >> >>> Ok I get it. I keep not thinking about relationships flexibly >>> enough. I >>> need to take a little time to get used to thinking about problems >>> this way. >>> Thanks for your help! >>> >>> On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <cr...@amanzi.com> >>> wrote: >>> >>>> I guess I could say that the approach is totally different, but in >>>> reality >>>> you start at the same point, working out what query you want. But >>>> after >>>> that >>>> things change. Let's consider two cases, the huge network and the >>>> paging >>>> examples you used below. >>>> >>>> For the first, keep in mind that if you scale the total database >>>> size, >>>> relational db's don't scale as well as graph db's, because in >>>> relational >>>> db's performance is related to total table size, whereas graph db's >>>> performance is not. However, both db's are affected by the result >>>> set size. >>>> So for twitter type cases, graph db's scale better with total >>>> subscriber >>>> base, but not necessarily for an individual 'mega-user'. But let's >>>> look at >>>> that case. I believe a real human user would never be able to process >>>> millions of messages a day, so I must assume that this is a bot use >>>> case, >>>> perhaps a script that subscribes to enormous numbers of people >>>> scanning for >>>> particular patterns of information. Both relational and graph db's >>>> will >>>> feel >>>> the strain on a huge resultset. Personally I would probably follow >>>> twitters >>>> example and return only paginated result, breaking up the load so >>>> the bot >>>> would not negatively impact the network for human users. >>>> >>>> So, getting to pagination, the query you want is perhaps 'give me the >>>> ordered results, as a block of n messages starting at message x'. >>>> Obviously >>>> this is easily translatable into SQL for relational databases. For >>>> the >>>> graph >>>> database, it is a little more work, but again you start with that >>>> verbal >>>> query and figure out what it really means in a graph. There are two >>>> key >>>> components: >>>> >>>> - Ordered results >>>> - Starting point and number of results >>>> >>>> For SQL this translates to 'order by', 'where id > x' and 'limit >>>> n'. For >>>> graph databases this translates into adding ordering relationships >>>> like >>>> message(X)-->PREVIOUS-->message(X+1). On each pagination, you can >>>> take the >>>> PREVIOUS message as the begining of the next pagination (possibly >>>> passing >>>> the node id in the session). Each pagination is simply a traverser >>>> that >>>> runs >>>> down the PREVIOUS chain, exiting the loop after n messages. >>>> >>>> >>>> On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <linxbet...@gmail.com> >>>> wrote: >>>> >>>>> Wow dude, this is blowing my mind just a little. >>>>> >>>>> Ok, sticking with the twitter example, I'm concerned about the edge >>>> cases. >>>>> I'd say it's easy to optimize with a relational db or any other >>>>> storage >>>>> for >>>>> that matter if I make the assumption that people only follow a few >>>> hundred >>>>> people and only want recent messages. However some people follow >>>> hundreds >>>>> of thousands of people. If Guy Kawasaki uses my app, I'd run into a >>>>> problem >>>>> quickly. >>>>> >>>>> However I see your point that I don't have to limit myself to just >>>>> the >>>>> obvious relationships, but can create relationships that serve >>>>> specific >>>>> purposes and use-cases such as your day example. I'm not sure how I >>>> would >>>>> want to model my use-case to allow for Guy Kawaski, I'll have to >>>>> think >>>> more >>>>> about it. Is there a threshold beyond which adding relationships >>>>> between >>>>> nodes causes problems? If not, or if it's high, you could create >>>>> custom >>>>> relationships for every type of query you'd want to do. >>>>> >>>>> However, a secondary question comes up. If we continue with the >>>>> twitter >>>>> example, and I want to be able to page through results, is that >>>>> directly >>>>> supported through Neo4j's API? Coming from a more traditional >>>>> storage >>>>> background I tend to think of what I'd want as a sort by time and >>>>> then a >>>>> skip and limit on the results (so I could say give me messages 1-100 >>>> sorted >>>>> by time descending). Is there anything equivalent in Neo4j or is >>>>> the >>>>> approach totally different? >>>>> >>>>> Thanks, >>>>> Lincoln >>>>> >>>>> >>>>> On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <cr...@amanzi.com> >>>> wrote: >>>>>> Hi Lincoln, >>>>>> >>>>>> So it sounds like you don't need the IS_VISIBLE relations after >>>>>> all. >>>> The >>>>>> traverser works by following all relationships of the specified >>>>>> types >>>> and >>>>>> directions from each current node (as you traverse, or walk the >>>>>> graph). >>>>> You >>>>>> can have a complex graph and traverse to high depth very fast >>>> (thousands >>>>> of >>>>>> relationships per second). The traverser will also automatically >>>>>> check >>>>> that >>>>>> the same node is not returned twice. The test for the >>>>>> relationship type >>>>> is >>>>>> efficient. Still reasonable, but less efficient is the custom >>>>>> test you >>>>>> might >>>>>> put in the returnable evaluator, but if the limiting factor is >>>>>> usually >>>>> the >>>>>> number of relationships traversed, and if that is kept managable, >>>>>> the >>>>>> evaluator test is no concern. >>>>>> >>>>>> I think twitter is a good case in point, even with many millions of >>>>> users, >>>>>> you will still only follow perhaps a hundred and they will tweet >>>> perhaps >>>>> a >>>>>> hundred, or a thousand times, so your traverser will find the >>>>>> 10k-100k >>>>>> messages quite quickly. This can be speeded up further, but the >>>>>> right >>>>>> approach depends again on your use case. The idea with using a >>>>>> graph >>>>>> database is that the actual usage probably maps very well to the >>>>>> graph >>>>>> structure, so when deciding how to speed up your search, consider >>>>>> how >>>> it >>>>>> will be used. In twitter one normally only cares about recent >>>>>> messages, >>>>> so >>>>>> how about not linking directly from the user to the message, but >>>>>> link >>>> to >>>>> an >>>>>> intermediate node representing time, for example, a day-node. >>>>>> Then each >>>>> new >>>>>> message is added to the day node for that day, and that will >>>>> automatically >>>>>> become yesterday the next day. Then your traversal can have a stop >>>>>> evaluator >>>>>> to not follow old messages (unless your query is looking for old >>>>> messages, >>>>>> of course). So the 100k messages might drop to only a few >>>>>> hundred, or >>>>> even >>>>>> just a few dozen. Certainly that will be a query of the order of >>>>>> milliseconds! >>>>>> >>>>>> Moving away from the traverser, you also have the option to call >>>> directly >>>>>> the getRelationships() methods from the node. If you structure is >>>>>> predictable, like viewer-->FOLLOWS-->user-->CREATED-->message, >>>>>> then two >>>>>> nested for loops would work, the outer iterating over the >>>>>> followers and >>>>> the >>>>>> inner iterating over the messages. If you changed to add a time- >>>>>> based >>>>>> interim node (which is a kind of graph-index), then you need to >>>>>> have >>>>> three >>>>>> loops. If you made your time index a deeper tree (months->days- >>>>>>> hours, >>>>>> etc.), then you would need to further refactor the code. However, >>>>>> if >>>> you >>>>>> stuck with a traverser, you might not need to change the >>>>>> traverser even >>>>> of >>>>>> the graph structure changed, as long as the same relationship types >>>> were >>>>>> maintained. Does that make sense? >>>>>> >>>>>> Cheers, Craig >>>>>> >>>>>> On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <linxbet...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Craig, >>>>>>> >>>>>>> I'd like to clarify my question (I don't think it changes your >>>>>>> answer >>>>>>> though). >>>>>>> >>>>>>> I wanted all messages visible to me created by users I follow. >>>>>>> Thus, >>>>> the >>>>>>> FOLLOWS relationship is not enough. I'd need to see messages that >>>> are >>>>>>> visible to me and then check if they were created by users I >>>>>>> follow, >>>> or >>>>>> I'd >>>>>>> need to see messages created by users I follow and then see if >>>> they're >>>>>>> visible to me. >>>>>>> >>>>>>> I assume your last example still yields the result I'm looking >>>>>>> for. >>>>>> Could >>>>>>> you describe what actually happens here though? I'm unclear on >>>>>>> what >>>>> the >>>>>>> traversal looks like. Would it first traverse every outgoing >>>>>>> FOLLOWS >>>>>>> relationship from the viewer, yielding other users, and then >>>>>>> traverse >>>>> all >>>>>>> the CREATED relationships to get to messages? >>>>>>> >>>>>>> Also, given very large numbers of FOLLOWS and CREATED >>>>>>> relationships >>>>> (with >>>>>>> say, a twitter graph), how is this made efficient? >>>>>>> >>>>>>> Sorry for all the basic questions but I couldn't find this >>>> information >>>>> in >>>>>>> the docs. If there's something I should be reading before posting >>>>> these >>>>>>> questions, please point me to it. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Lincoln >>>>>>> >>>>>>> On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <cr...@amanzi.com> >>>>>> wrote: >>>>>>>> I'm uncertain about one ambiguity in your model, you are able to >>>> find >>>>>>>> messages through FOLLOWS and IS_VISIBLE_BY. These will give two >>>>>> different >>>>>>>> sets, and my first impression was that FOLLOWS gives you the >>>>>>>> right >>>>>>> answer. >>>>>>>> In other words you want to query for 'all messages by users I >>>>> follow'? >>>>>> In >>>>>>>> that case you do not need IS_VISIBLE_BY. However, if there are >>>>> messages >>>>>>> by >>>>>>>> people you follow, but are not allowed to see, then you also need >>>> the >>>>>>>> IS_VISIBLE_BY. But I would still reconsider linking directly from >>>> the >>>>>>>> viewer >>>>>>>> to the message for that case. I'd rather have the messages linked >>>> to >>>>>> some >>>>>>>> categorization structure for things like 'public', 'private', >>>>>>>> etc. >>>>>>>> >>>>>>>> Anyway, here are some suggestions for the various approaches >>>>>>>> above: >>>>>>>> *'all messages by users I follow'* >>>>>>>> val msgs = viewer.traverse( >>>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, >>>>>>>> (tp: TraversalPosition) => IsMessage(tp.currentNode()), >>>>>>>> Rels.FOLLOWS, Direction.OUTGOING, >>>>>>>> Rels.CREATED, Direction.OUTGOING) >>>>>>>> >>>>>>>> *'all messages visible to me'* >>>>>>>> val msgs = viewer.traverse( >>>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, >>>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, >>>>>>>> Rels.IS_VISIBLE_BY, Direction.INCOMING) >>>>>>>> >>>>>>>> *'all messages, visible to me, by people I follow'* >>>>>>>> val msgs = viewer.traverse( >>>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, >>>>>>>> (tp: TraversalPosition) => { >>>>>>>> val msg = tp.currentNode() >>>>>>>> IsMessage(msg) && IsVisibleBy(msg,viewer) >>>>>>>> }, >>>>>>>> Rels.FOLLOWS, Direction.OUTGOING, >>>>>>>> Rels.CREATED, Direction.OUTGOING) >>>>>>>> >>>>>>>> Of course I assume you make the utility functions IsMessage(node: >>>>> Node) >>>>>>> and >>>>>>>> IsVisibleBy(msg: Node, user: Node), and these will test the >>>> existance >>>>>> of >>>>>>>> properties and relations as appropriate to make the decision. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <linxbet...@gmail.com> >>>>> wrote: >>>>>>>>> Hi, I've just started looking at Neo4j and I'm quite intrigued. >>>>>>> However, >>>>>>>>> the cognitive dissonance that I've grown so used to in modeling >>>>>> storage >>>>>>>> is >>>>>>>>> proving to be a bit difficult to let go at this early stage :) >>>>>>>>> >>>>>>>>> I was hoping that if someone could help me through an example I >>>>> would >>>>>>> be >>>>>>>>> able to grok how to properly structure my data and query it in >>>>> Neo4j. >>>>>>>>> Nodes: >>>>>>>>> Message( text: String ) >>>>>>>>> User( id: Long ) >>>>>>>>> >>>>>>>>> Relationships: >>>>>>>>> CREATED >>>>>>>>> FOLLOWS >>>>>>>>> IS_VISIBLE_BY >>>>>>>>> >>>>>>>>> So I might have a graph with entries like so: >>>>>>>>> >>>>>>>>> User(1) --> CREATED --> Message("i woke up late today") >>>>>>>>> User(2) --> CREATED --> Message("hello") >>>>>>>>> User(3) --> CREATED --> Message("ugh, i hate mondays") >>>>>>>>> >>>>>>>>> User(1) --> FOLLOWS --> User(2) >>>>>>>>> >>>>>>>>> Let's also say all messages are visible to User 1. >>>>>>>>> >>>>>>>>> Message("i woke up late today") --> IS_VISIBLE_BY --> User(1) >>>>>>>>> Message("hello") --> IS_VISIBLE_BY --> User(1) >>>>>>>>> Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1) >>>>>>>>> >>>>>>>>> So, I can do a simple traversal for visible: >>>>>>>>> >>>>>>>>> val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" ) >>>>>>>>> val index = new LuceneIndexService( graphDb ) >>>>>>>>> val viewer = index.getSingleNode("id", 1) >>>>>>>>> val msgs = viewer.traverse( Order.BREADTH_FIRST, >>>>>>>>> StopEvaluator.END_OF_GRAPH, >>>>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY, >>>>>>>>> Direction.INCOMING) >>>>>>>>> msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }") // >>>>>>> assuming >>>>>>>> i >>>>>>>>> have the relevant functions >>>>>>>>> >>>>>>>>> But let's say that this is going to return too many messages. >>>> Just >>>>>>>> because >>>>>>>>> all the messages are possibly visible to me, doesn't mean I want >>>> to >>>>>> see >>>>>>>>> them >>>>>>>>> all. So, I'd like to additionally filter by the FOLLOWS >>>>>> relationship. >>>>>>>> I'd >>>>>>>>> like to express "get all messages that are visible and were >>>> created >>>>>> by >>>>>>> a >>>>>>>>> user that I follow." Can someone show me an example of how to >>>>>>>>> do >>>>>> that? >>>>>>>>> I'm guessing that you need to implement a custom >>>>> ReturnableEvaluator, >>>>>>> but >>>>>>>> I >>>>>>>>> don't understand how you traverse multiple relationships at the >>>>> same >>>>>>>> time. >>>>>>>>> Thanks, >>>>>>>>> Lincoln >>>>>>>>> _______________________________________________ >>>>>>>>> Neo mailing list >>>>>>>>> User@lists.neo4j.org >>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Neo mailing list >>>>>>>> User@lists.neo4j.org >>>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Neo mailing list >>>>>>> User@lists.neo4j.org >>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>>> >>>>>> _______________________________________________ >>>>>> Neo mailing list >>>>>> User@lists.neo4j.org >>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>> >>>>> _______________________________________________ >>>>> Neo mailing list >>>>> User@lists.neo4j.org >>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>> >>>> _______________________________________________ >>>> Neo mailing list >>>> User@lists.neo4j.org >>>> https://lists.neo4j.org/mailman/listinfo/user >>>> >>> _______________________________________________ >>> Neo mailing list >>> User@lists.neo4j.org >>> https://lists.neo4j.org/mailman/listinfo/user >> _______________________________________________ >> Neo mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user >> > _______________________________________________ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user