Ok I get it. I keep not thinking about relationships flexibly enough. I need to take a little time to get used to thinking about problems this way. Thanks for your help!
On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <cr...@amanzi.com> wrote: > I guess I could say that the approach is totally different, but in reality > you start at the same point, working out what query you want. But after > that > things change. Let's consider two cases, the huge network and the paging > examples you used below. > > For the first, keep in mind that if you scale the total database size, > relational db's don't scale as well as graph db's, because in relational > db's performance is related to total table size, whereas graph db's > performance is not. However, both db's are affected by the result set size. > So for twitter type cases, graph db's scale better with total subscriber > base, but not necessarily for an individual 'mega-user'. But let's look at > that case. I believe a real human user would never be able to process > millions of messages a day, so I must assume that this is a bot use case, > perhaps a script that subscribes to enormous numbers of people scanning for > particular patterns of information. Both relational and graph db's will > feel > the strain on a huge resultset. Personally I would probably follow twitters > example and return only paginated result, breaking up the load so the bot > would not negatively impact the network for human users. > > So, getting to pagination, the query you want is perhaps 'give me the > ordered results, as a block of n messages starting at message x'. Obviously > this is easily translatable into SQL for relational databases. For the > graph > database, it is a little more work, but again you start with that verbal > query and figure out what it really means in a graph. There are two key > components: > > - Ordered results > - Starting point and number of results > > For SQL this translates to 'order by', 'where id > x' and 'limit n'. For > graph databases this translates into adding ordering relationships like > message(X)-->PREVIOUS-->message(X+1). On each pagination, you can take the > PREVIOUS message as the begining of the next pagination (possibly passing > the node id in the session). Each pagination is simply a traverser that > runs > down the PREVIOUS chain, exiting the loop after n messages. > > > On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <linxbet...@gmail.com> wrote: > > > Wow dude, this is blowing my mind just a little. > > > > Ok, sticking with the twitter example, I'm concerned about the edge > cases. > > I'd say it's easy to optimize with a relational db or any other storage > > for > > that matter if I make the assumption that people only follow a few > hundred > > people and only want recent messages. However some people follow > hundreds > > of thousands of people. If Guy Kawasaki uses my app, I'd run into a > > problem > > quickly. > > > > However I see your point that I don't have to limit myself to just the > > obvious relationships, but can create relationships that serve specific > > purposes and use-cases such as your day example. I'm not sure how I > would > > want to model my use-case to allow for Guy Kawaski, I'll have to think > more > > about it. Is there a threshold beyond which adding relationships between > > nodes causes problems? If not, or if it's high, you could create custom > > relationships for every type of query you'd want to do. > > > > However, a secondary question comes up. If we continue with the twitter > > example, and I want to be able to page through results, is that directly > > supported through Neo4j's API? Coming from a more traditional storage > > background I tend to think of what I'd want as a sort by time and then a > > skip and limit on the results (so I could say give me messages 1-100 > sorted > > by time descending). Is there anything equivalent in Neo4j or is the > > approach totally different? > > > > Thanks, > > Lincoln > > > > > > On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <cr...@amanzi.com> > wrote: > > > > > Hi Lincoln, > > > > > > So it sounds like you don't need the IS_VISIBLE relations after all. > The > > > traverser works by following all relationships of the specified types > and > > > directions from each current node (as you traverse, or walk the graph). > > You > > > can have a complex graph and traverse to high depth very fast > (thousands > > of > > > relationships per second). The traverser will also automatically check > > that > > > the same node is not returned twice. The test for the relationship type > > is > > > efficient. Still reasonable, but less efficient is the custom test you > > > might > > > put in the returnable evaluator, but if the limiting factor is usually > > the > > > number of relationships traversed, and if that is kept managable, the > > > evaluator test is no concern. > > > > > > I think twitter is a good case in point, even with many millions of > > users, > > > you will still only follow perhaps a hundred and they will tweet > perhaps > > a > > > hundred, or a thousand times, so your traverser will find the 10k-100k > > > messages quite quickly. This can be speeded up further, but the right > > > approach depends again on your use case. The idea with using a graph > > > database is that the actual usage probably maps very well to the graph > > > structure, so when deciding how to speed up your search, consider how > it > > > will be used. In twitter one normally only cares about recent messages, > > so > > > how about not linking directly from the user to the message, but link > to > > an > > > intermediate node representing time, for example, a day-node. Then each > > new > > > message is added to the day node for that day, and that will > > automatically > > > become yesterday the next day. Then your traversal can have a stop > > > evaluator > > > to not follow old messages (unless your query is looking for old > > messages, > > > of course). So the 100k messages might drop to only a few hundred, or > > even > > > just a few dozen. Certainly that will be a query of the order of > > > milliseconds! > > > > > > Moving away from the traverser, you also have the option to call > directly > > > the getRelationships() methods from the node. If you structure is > > > predictable, like viewer-->FOLLOWS-->user-->CREATED-->message, then two > > > nested for loops would work, the outer iterating over the followers and > > the > > > inner iterating over the messages. If you changed to add a time-based > > > interim node (which is a kind of graph-index), then you need to have > > three > > > loops. If you made your time index a deeper tree (months->days->hours, > > > etc.), then you would need to further refactor the code. However, if > you > > > stuck with a traverser, you might not need to change the traverser even > > of > > > the graph structure changed, as long as the same relationship types > were > > > maintained. Does that make sense? > > > > > > Cheers, Craig > > > > > > On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <linxbet...@gmail.com> wrote: > > > > > > > Thanks Craig, > > > > > > > > I'd like to clarify my question (I don't think it changes your answer > > > > though). > > > > > > > > I wanted all messages visible to me created by users I follow. Thus, > > the > > > > FOLLOWS relationship is not enough. I'd need to see messages that > are > > > > visible to me and then check if they were created by users I follow, > or > > > I'd > > > > need to see messages created by users I follow and then see if > they're > > > > visible to me. > > > > > > > > I assume your last example still yields the result I'm looking for. > > > Could > > > > you describe what actually happens here though? I'm unclear on what > > the > > > > traversal looks like. Would it first traverse every outgoing FOLLOWS > > > > relationship from the viewer, yielding other users, and then traverse > > all > > > > the CREATED relationships to get to messages? > > > > > > > > Also, given very large numbers of FOLLOWS and CREATED relationships > > (with > > > > say, a twitter graph), how is this made efficient? > > > > > > > > Sorry for all the basic questions but I couldn't find this > information > > in > > > > the docs. If there's something I should be reading before posting > > these > > > > questions, please point me to it. > > > > > > > > Thanks! > > > > > > > > Lincoln > > > > > > > > On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <cr...@amanzi.com> > > > wrote: > > > > > > > > > I'm uncertain about one ambiguity in your model, you are able to > find > > > > > messages through FOLLOWS and IS_VISIBLE_BY. These will give two > > > different > > > > > sets, and my first impression was that FOLLOWS gives you the right > > > > answer. > > > > > In other words you want to query for 'all messages by users I > > follow'? > > > In > > > > > that case you do not need IS_VISIBLE_BY. However, if there are > > messages > > > > by > > > > > people you follow, but are not allowed to see, then you also need > the > > > > > IS_VISIBLE_BY. But I would still reconsider linking directly from > the > > > > > viewer > > > > > to the message for that case. I'd rather have the messages linked > to > > > some > > > > > categorization structure for things like 'public', 'private', etc. > > > > > > > > > > Anyway, here are some suggestions for the various approaches above: > > > > > *'all messages by users I follow'* > > > > > val msgs = viewer.traverse( > > > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, > > > > > (tp: TraversalPosition) => IsMessage(tp.currentNode()), > > > > > Rels.FOLLOWS, Direction.OUTGOING, > > > > > Rels.CREATED, Direction.OUTGOING) > > > > > > > > > > *'all messages visible to me'* > > > > > val msgs = viewer.traverse( > > > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, > > > > > ReturnableEvaluator.ALL_BUT_START_NODE, > > > > > Rels.IS_VISIBLE_BY, Direction.INCOMING) > > > > > > > > > > *'all messages, visible to me, by people I follow'* > > > > > val msgs = viewer.traverse( > > > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH, > > > > > (tp: TraversalPosition) => { > > > > > val msg = tp.currentNode() > > > > > IsMessage(msg) && IsVisibleBy(msg,viewer) > > > > > }, > > > > > Rels.FOLLOWS, Direction.OUTGOING, > > > > > Rels.CREATED, Direction.OUTGOING) > > > > > > > > > > Of course I assume you make the utility functions IsMessage(node: > > Node) > > > > and > > > > > IsVisibleBy(msg: Node, user: Node), and these will test the > existance > > > of > > > > > properties and relations as appropriate to make the decision. > > > > > > > > > > > > > > > On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <linxbet...@gmail.com> > > wrote: > > > > > > > > > > > Hi, I've just started looking at Neo4j and I'm quite intrigued. > > > > However, > > > > > > the cognitive dissonance that I've grown so used to in modeling > > > storage > > > > > is > > > > > > proving to be a bit difficult to let go at this early stage :) > > > > > > > > > > > > I was hoping that if someone could help me through an example I > > would > > > > be > > > > > > able to grok how to properly structure my data and query it in > > Neo4j. > > > > > > > > > > > > Nodes: > > > > > > Message( text: String ) > > > > > > User( id: Long ) > > > > > > > > > > > > Relationships: > > > > > > CREATED > > > > > > FOLLOWS > > > > > > IS_VISIBLE_BY > > > > > > > > > > > > So I might have a graph with entries like so: > > > > > > > > > > > > User(1) --> CREATED --> Message("i woke up late today") > > > > > > User(2) --> CREATED --> Message("hello") > > > > > > User(3) --> CREATED --> Message("ugh, i hate mondays") > > > > > > > > > > > > User(1) --> FOLLOWS --> User(2) > > > > > > > > > > > > Let's also say all messages are visible to User 1. > > > > > > > > > > > > Message("i woke up late today") --> IS_VISIBLE_BY --> User(1) > > > > > > Message("hello") --> IS_VISIBLE_BY --> User(1) > > > > > > Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1) > > > > > > > > > > > > So, I can do a simple traversal for visible: > > > > > > > > > > > > val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" ) > > > > > > val index = new LuceneIndexService( graphDb ) > > > > > > val viewer = index.getSingleNode("id", 1) > > > > > > val msgs = viewer.traverse( Order.BREADTH_FIRST, > > > > > > StopEvaluator.END_OF_GRAPH, > > > > > > ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY, > > > > > > Direction.INCOMING) > > > > > > msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }") // > > > > assuming > > > > > i > > > > > > have the relevant functions > > > > > > > > > > > > But let's say that this is going to return too many messages. > Just > > > > > because > > > > > > all the messages are possibly visible to me, doesn't mean I want > to > > > see > > > > > > them > > > > > > all. So, I'd like to additionally filter by the FOLLOWS > > > relationship. > > > > > I'd > > > > > > like to express "get all messages that are visible and were > created > > > by > > > > a > > > > > > user that I follow." Can someone show me an example of how to do > > > that? > > > > > > > > > > > > I'm guessing that you need to implement a custom > > ReturnableEvaluator, > > > > but > > > > > I > > > > > > don't understand how you traverse multiple relationships at the > > same > > > > > time. > > > > > > > > > > > > Thanks, > > > > > > Lincoln > > > > > > _______________________________________________ > > > > > > Neo mailing list > > > > > > User@lists.neo4j.org > > > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > > > > > _______________________________________________ > > > > > Neo mailing list > > > > > User@lists.neo4j.org > > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > > > _______________________________________________ > > > > Neo mailing list > > > > User@lists.neo4j.org > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > _______________________________________________ > > > Neo mailing list > > > User@lists.neo4j.org > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > _______________________________________________ > > Neo mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user