Re: [Neo] basic questions

Craig Taverner Wed, 17 Mar 2010 14:43:37 -0700

I guess I could say that the approach is totally different, but in reality
you start at the same point, working out what query you want. But after that
things change. Let's consider two cases, the huge network and the paging
examples you used below.


For the first, keep in mind that if you scale the total database size,
relational db's don't scale as well as graph db's, because in relational
db's performance is related to total table size, whereas graph db's
performance is not. However, both db's are affected by the result set size.
So for twitter type cases, graph db's scale better with total subscriber
base, but not necessarily for an individual 'mega-user'. But let's look at
that case. I believe a real human user would never be able to process
millions of messages a day, so I must assume that this is a bot use case,
perhaps a script that subscribes to enormous numbers of people scanning for
particular patterns of information. Both relational and graph db's will feel
the strain on a huge resultset. Personally I would probably follow twitters
example and return only paginated result, breaking up the load so the bot
would not negatively impact the network for human users.

So, getting to pagination, the query you want is perhaps 'give me the
ordered results, as a block of n messages starting at message x'. Obviously
this is easily translatable into SQL for relational databases. For the graph
database, it is a little more work, but again you start with that verbal
query and figure out what it really means in a graph. There are two key
components:

   - Ordered results
   - Starting point and number of results

For SQL this translates to 'order by', 'where id > x' and 'limit n'. For
graph databases this translates into adding ordering relationships like
message(X)-->PREVIOUS-->message(X+1). On each pagination, you can take the
PREVIOUS message as the begining of the next pagination (possibly passing
the node id in the session). Each pagination is simply a traverser that runs
down the PREVIOUS chain, exiting the loop after n messages.


On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <[email protected]> wrote:

> Wow dude, this is blowing my mind just a little.
>
> Ok, sticking with the twitter example, I'm concerned about the edge cases.
>  I'd say it's easy to optimize with a relational db or any other storage
> for
> that matter if I make the assumption that people only follow a few hundred
> people and only want recent messages.  However some people follow hundreds
> of thousands of people.  If Guy Kawasaki uses my app, I'd run into a
> problem
> quickly.
>
> However I see your point that I don't have to limit myself to just the
> obvious relationships, but can create relationships that serve specific
> purposes and use-cases such as your day example.  I'm not sure how I would
> want to model my use-case to allow for Guy Kawaski, I'll have to think more
> about it.  Is there a threshold beyond which adding relationships between
> nodes causes problems?  If not, or if it's high, you could create custom
> relationships for every type of query you'd want to do.
>
> However, a secondary question comes up.  If we continue with the twitter
> example, and I want to be able to page through results, is that directly
> supported through Neo4j's API?  Coming from a more traditional storage
> background I tend to think of what I'd want as a sort by time and then a
> skip and limit on the results (so I could say give me messages 1-100 sorted
> by time descending).  Is there anything equivalent in Neo4j or is the
> approach totally different?
>
> Thanks,
> Lincoln
>
>
> On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <[email protected]> wrote:
>
> > Hi Lincoln,
> >
> > So it sounds like you don't need the IS_VISIBLE relations after all. The
> > traverser works by following all relationships of the specified types and
> > directions from each current node (as you traverse, or walk the graph).
> You
> > can have a complex graph and traverse to high depth very fast (thousands
> of
> > relationships per second). The traverser will also automatically check
> that
> > the same node is not returned twice. The test for the relationship type
> is
> > efficient. Still reasonable, but less efficient is the custom test you
> > might
> > put in the returnable evaluator, but if the limiting factor is usually
> the
> > number of relationships traversed, and if that is kept managable, the
> > evaluator test is no concern.
> >
> > I think twitter is a good case in point, even with many millions of
> users,
> > you will still only follow perhaps a hundred and they will tweet perhaps
> a
> > hundred, or a thousand times, so your traverser will find the 10k-100k
> > messages quite quickly. This can be speeded up further, but the right
> > approach depends again on your use case. The idea with using a graph
> > database is that the actual usage probably maps very well to the graph
> > structure, so when deciding how to speed up your search, consider how it
> > will be used. In twitter one normally only cares about recent messages,
> so
> > how about not linking directly from the user to the message, but link to
> an
> > intermediate node representing time, for example, a day-node. Then each
> new
> > message is added to the day node for that day, and that will
> automatically
> > become yesterday the next day. Then your traversal can have a stop
> > evaluator
> > to not follow old messages (unless your query is looking for old
> messages,
> > of course). So the 100k messages might drop to only a few hundred, or
> even
> > just a few dozen. Certainly that will be a query of the order of
> > milliseconds!
> >
> > Moving away from the traverser, you also have the option to call directly
> > the getRelationships() methods from the node. If you structure is
> > predictable, like viewer-->FOLLOWS-->user-->CREATED-->message, then two
> > nested for loops would work, the outer iterating over the followers and
> the
> > inner iterating over the messages. If you changed to add a time-based
> > interim node (which is a kind of graph-index), then you need to have
> three
> > loops. If you made your time index a deeper tree (months->days->hours,
> > etc.), then you would need to further refactor the code. However, if you
> > stuck with a traverser, you might not need to change the traverser even
> of
> > the graph structure changed, as long as the same relationship types were
> > maintained. Does that make sense?
> >
> > Cheers, Craig
> >
> > On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <[email protected]> wrote:
> >
> > > Thanks Craig,
> > >
> > > I'd like to clarify my question (I don't think it changes your answer
> > > though).
> > >
> > > I wanted all messages visible to me created by users I follow.  Thus,
> the
> > > FOLLOWS relationship is not enough.  I'd need to see messages that are
> > > visible to me and then check if they were created by users I follow, or
> > I'd
> > > need to see messages created by users I follow and then see if they're
> > > visible to me.
> > >
> > > I assume your last example still yields the result I'm looking for.
> >  Could
> > > you describe what actually happens here though?  I'm unclear on what
> the
> > > traversal looks like.  Would it first traverse every outgoing FOLLOWS
> > > relationship from the viewer, yielding other users, and then traverse
> all
> > > the CREATED relationships to get to messages?
> > >
> > > Also, given very large numbers of FOLLOWS and CREATED relationships
> (with
> > > say, a twitter graph), how is this made efficient?
> > >
> > > Sorry for all the basic questions but I couldn't find this information
> in
> > > the docs.  If there's something I should be reading before posting
> these
> > > questions, please point me to it.
> > >
> > > Thanks!
> > >
> > > Lincoln
> > >
> > > On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <[email protected]>
> > wrote:
> > >
> > > > I'm uncertain about one ambiguity in your model, you are able to find
> > > > messages through FOLLOWS and IS_VISIBLE_BY. These will give two
> > different
> > > > sets, and my first impression was that FOLLOWS gives you the right
> > > answer.
> > > > In other words you want to query for 'all messages by users I
> follow'?
> > In
> > > > that case you do not need IS_VISIBLE_BY. However, if there are
> messages
> > > by
> > > > people you follow, but are not allowed to see, then you also need the
> > > > IS_VISIBLE_BY. But I would still reconsider linking directly from the
> > > > viewer
> > > > to the message for that case. I'd rather have the messages linked to
> > some
> > > > categorization structure for things like 'public', 'private', etc.
> > > >
> > > > Anyway, here are some suggestions for the various approaches above:
> > > > *'all messages by users I follow'*
> > > > val msgs = viewer.traverse(
> > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> > > > (tp: TraversalPosition) => IsMessage(tp.currentNode()),
> > > > Rels.FOLLOWS, Direction.OUTGOING,
> > > > Rels.CREATED, Direction.OUTGOING)
> > > >
> > > > *'all messages visible to me'*
> > > > val msgs = viewer.traverse(
> > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> > > > ReturnableEvaluator.ALL_BUT_START_NODE,
> > > > Rels.IS_VISIBLE_BY, Direction.INCOMING)
> > > >
> > > > *'all messages, visible to me, by people I follow'*
> > > > val msgs = viewer.traverse(
> > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> > > > (tp: TraversalPosition) => {
> > > >  val msg = tp.currentNode()
> > > >  IsMessage(msg) && IsVisibleBy(msg,viewer)
> > > > },
> > > > Rels.FOLLOWS, Direction.OUTGOING,
> > > > Rels.CREATED, Direction.OUTGOING)
> > > >
> > > > Of course I assume you make the utility functions IsMessage(node:
> Node)
> > > and
> > > > IsVisibleBy(msg: Node, user: Node), and these will test the existance
> > of
> > > > properties and relations as appropriate to make the decision.
> > > >
> > > >
> > > > On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <[email protected]>
> wrote:
> > > >
> > > > > Hi, I've just started looking at Neo4j and I'm quite intrigued.
> > >  However,
> > > > > the cognitive dissonance that I've grown so used to in modeling
> > storage
> > > > is
> > > > > proving to be a bit difficult to let go at this early stage :)
> > > > >
> > > > > I was hoping that if someone could help me through an example I
> would
> > > be
> > > > > able to grok how to properly structure my data and query it in
> Neo4j.
> > > > >
> > > > > Nodes:
> > > > > Message( text: String )
> > > > > User( id: Long )
> > > > >
> > > > > Relationships:
> > > > > CREATED
> > > > > FOLLOWS
> > > > > IS_VISIBLE_BY
> > > > >
> > > > > So I might have a graph with entries like so:
> > > > >
> > > > > User(1) --> CREATED --> Message("i woke up late today")
> > > > > User(2) --> CREATED --> Message("hello")
> > > > > User(3) --> CREATED --> Message("ugh, i hate mondays")
> > > > >
> > > > > User(1) --> FOLLOWS --> User(2)
> > > > >
> > > > > Let's also say all messages are visible to User 1.
> > > > >
> > > > > Message("i woke up late today") --> IS_VISIBLE_BY --> User(1)
> > > > > Message("hello") --> IS_VISIBLE_BY --> User(1)
> > > > > Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1)
> > > > >
> > > > > So, I can do a simple traversal for visible:
> > > > >
> > > > > val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" )
> > > > > val index = new LuceneIndexService( graphDb )
> > > > > val viewer = index.getSingleNode("id", 1)
> > > > > val msgs = viewer.traverse( Order.BREADTH_FIRST,
> > > > > StopEvaluator.END_OF_GRAPH,
> > > > > ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY,
> > > > > Direction.INCOMING)
> > > > > msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }")  //
> > > assuming
> > > > i
> > > > > have the relevant functions
> > > > >
> > > > > But let's say that this is going to return too many messages.  Just
> > > > because
> > > > > all the messages are possibly visible to me, doesn't mean I want to
> > see
> > > > > them
> > > > > all.  So, I'd like to additionally filter by the FOLLOWS
> > relationship.
> > > >  I'd
> > > > > like to express "get all messages that are visible and were created
> > by
> > > a
> > > > > user that I follow."  Can someone show me an example of how to do
> > that?
> > > > >
> > > > > I'm guessing that you need to implement a custom
> ReturnableEvaluator,
> > > but
> > > > I
> > > > > don't understand how you traverse multiple relationships at the
> same
> > > > time.
> > > > >
> > > > > Thanks,
> > > > > Lincoln
> > > > > _______________________________________________
> > > > > Neo mailing list
> > > > > [email protected]
> > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > >
> > > > _______________________________________________
> > > > Neo mailing list
> > > > [email protected]
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > _______________________________________________
> > > Neo mailing list
> > > [email protected]
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] basic questions

Reply via email to