Re: [Neo] basic questions

Lincoln Wed, 17 Mar 2010 14:50:41 -0700

Ok I get it.  I keep not thinking about relationships flexibly enough.  I
need to take a little time to get used to thinking about problems this way.
 Thanks for your help!


On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <[email protected]> wrote:

> I guess I could say that the approach is totally different, but in reality
> you start at the same point, working out what query you want. But after
> that
> things change. Let's consider two cases, the huge network and the paging
> examples you used below.
>
> For the first, keep in mind that if you scale the total database size,
> relational db's don't scale as well as graph db's, because in relational
> db's performance is related to total table size, whereas graph db's
> performance is not. However, both db's are affected by the result set size.
> So for twitter type cases, graph db's scale better with total subscriber
> base, but not necessarily for an individual 'mega-user'. But let's look at
> that case. I believe a real human user would never be able to process
> millions of messages a day, so I must assume that this is a bot use case,
> perhaps a script that subscribes to enormous numbers of people scanning for
> particular patterns of information. Both relational and graph db's will
> feel
> the strain on a huge resultset. Personally I would probably follow twitters
> example and return only paginated result, breaking up the load so the bot
> would not negatively impact the network for human users.
>
> So, getting to pagination, the query you want is perhaps 'give me the
> ordered results, as a block of n messages starting at message x'. Obviously
> this is easily translatable into SQL for relational databases. For the
> graph
> database, it is a little more work, but again you start with that verbal
> query and figure out what it really means in a graph. There are two key
> components:
>
>   - Ordered results
>   - Starting point and number of results
>
> For SQL this translates to 'order by', 'where id > x' and 'limit n'. For
> graph databases this translates into adding ordering relationships like
> message(X)-->PREVIOUS-->message(X+1). On each pagination, you can take the
> PREVIOUS message as the begining of the next pagination (possibly passing
> the node id in the session). Each pagination is simply a traverser that
> runs
> down the PREVIOUS chain, exiting the loop after n messages.
>
>
> On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <[email protected]> wrote:
>
> > Wow dude, this is blowing my mind just a little.
> >
> > Ok, sticking with the twitter example, I'm concerned about the edge
> cases.
> >  I'd say it's easy to optimize with a relational db or any other storage
> > for
> > that matter if I make the assumption that people only follow a few
> hundred
> > people and only want recent messages.  However some people follow
> hundreds
> > of thousands of people.  If Guy Kawasaki uses my app, I'd run into a
> > problem
> > quickly.
> >
> > However I see your point that I don't have to limit myself to just the
> > obvious relationships, but can create relationships that serve specific
> > purposes and use-cases such as your day example.  I'm not sure how I
> would
> > want to model my use-case to allow for Guy Kawaski, I'll have to think
> more
> > about it.  Is there a threshold beyond which adding relationships between
> > nodes causes problems?  If not, or if it's high, you could create custom
> > relationships for every type of query you'd want to do.
> >
> > However, a secondary question comes up.  If we continue with the twitter
> > example, and I want to be able to page through results, is that directly
> > supported through Neo4j's API?  Coming from a more traditional storage
> > background I tend to think of what I'd want as a sort by time and then a
> > skip and limit on the results (so I could say give me messages 1-100
> sorted
> > by time descending).  Is there anything equivalent in Neo4j or is the
> > approach totally different?
> >
> > Thanks,
> > Lincoln
> >
> >
> > On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <[email protected]>
> wrote:
> >
> > > Hi Lincoln,
> > >
> > > So it sounds like you don't need the IS_VISIBLE relations after all.
> The
> > > traverser works by following all relationships of the specified types
> and
> > > directions from each current node (as you traverse, or walk the graph).
> > You
> > > can have a complex graph and traverse to high depth very fast
> (thousands
> > of
> > > relationships per second). The traverser will also automatically check
> > that
> > > the same node is not returned twice. The test for the relationship type
> > is
> > > efficient. Still reasonable, but less efficient is the custom test you
> > > might
> > > put in the returnable evaluator, but if the limiting factor is usually
> > the
> > > number of relationships traversed, and if that is kept managable, the
> > > evaluator test is no concern.
> > >
> > > I think twitter is a good case in point, even with many millions of
> > users,
> > > you will still only follow perhaps a hundred and they will tweet
> perhaps
> > a
> > > hundred, or a thousand times, so your traverser will find the 10k-100k
> > > messages quite quickly. This can be speeded up further, but the right
> > > approach depends again on your use case. The idea with using a graph
> > > database is that the actual usage probably maps very well to the graph
> > > structure, so when deciding how to speed up your search, consider how
> it
> > > will be used. In twitter one normally only cares about recent messages,
> > so
> > > how about not linking directly from the user to the message, but link
> to
> > an
> > > intermediate node representing time, for example, a day-node. Then each
> > new
> > > message is added to the day node for that day, and that will
> > automatically
> > > become yesterday the next day. Then your traversal can have a stop
> > > evaluator
> > > to not follow old messages (unless your query is looking for old
> > messages,
> > > of course). So the 100k messages might drop to only a few hundred, or
> > even
> > > just a few dozen. Certainly that will be a query of the order of
> > > milliseconds!
> > >
> > > Moving away from the traverser, you also have the option to call
> directly
> > > the getRelationships() methods from the node. If you structure is
> > > predictable, like viewer-->FOLLOWS-->user-->CREATED-->message, then two
> > > nested for loops would work, the outer iterating over the followers and
> > the
> > > inner iterating over the messages. If you changed to add a time-based
> > > interim node (which is a kind of graph-index), then you need to have
> > three
> > > loops. If you made your time index a deeper tree (months->days->hours,
> > > etc.), then you would need to further refactor the code. However, if
> you
> > > stuck with a traverser, you might not need to change the traverser even
> > of
> > > the graph structure changed, as long as the same relationship types
> were
> > > maintained. Does that make sense?
> > >
> > > Cheers, Craig
> > >
> > > On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <[email protected]> wrote:
> > >
> > > > Thanks Craig,
> > > >
> > > > I'd like to clarify my question (I don't think it changes your answer
> > > > though).
> > > >
> > > > I wanted all messages visible to me created by users I follow.  Thus,
> > the
> > > > FOLLOWS relationship is not enough.  I'd need to see messages that
> are
> > > > visible to me and then check if they were created by users I follow,
> or
> > > I'd
> > > > need to see messages created by users I follow and then see if
> they're
> > > > visible to me.
> > > >
> > > > I assume your last example still yields the result I'm looking for.
> > >  Could
> > > > you describe what actually happens here though?  I'm unclear on what
> > the
> > > > traversal looks like.  Would it first traverse every outgoing FOLLOWS
> > > > relationship from the viewer, yielding other users, and then traverse
> > all
> > > > the CREATED relationships to get to messages?
> > > >
> > > > Also, given very large numbers of FOLLOWS and CREATED relationships
> > (with
> > > > say, a twitter graph), how is this made efficient?
> > > >
> > > > Sorry for all the basic questions but I couldn't find this
> information
> > in
> > > > the docs.  If there's something I should be reading before posting
> > these
> > > > questions, please point me to it.
> > > >
> > > > Thanks!
> > > >
> > > > Lincoln
> > > >
> > > > On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <[email protected]>
> > > wrote:
> > > >
> > > > > I'm uncertain about one ambiguity in your model, you are able to
> find
> > > > > messages through FOLLOWS and IS_VISIBLE_BY. These will give two
> > > different
> > > > > sets, and my first impression was that FOLLOWS gives you the right
> > > > answer.
> > > > > In other words you want to query for 'all messages by users I
> > follow'?
> > > In
> > > > > that case you do not need IS_VISIBLE_BY. However, if there are
> > messages
> > > > by
> > > > > people you follow, but are not allowed to see, then you also need
> the
> > > > > IS_VISIBLE_BY. But I would still reconsider linking directly from
> the
> > > > > viewer
> > > > > to the message for that case. I'd rather have the messages linked
> to
> > > some
> > > > > categorization structure for things like 'public', 'private', etc.
> > > > >
> > > > > Anyway, here are some suggestions for the various approaches above:
> > > > > *'all messages by users I follow'*
> > > > > val msgs = viewer.traverse(
> > > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> > > > > (tp: TraversalPosition) => IsMessage(tp.currentNode()),
> > > > > Rels.FOLLOWS, Direction.OUTGOING,
> > > > > Rels.CREATED, Direction.OUTGOING)
> > > > >
> > > > > *'all messages visible to me'*
> > > > > val msgs = viewer.traverse(
> > > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> > > > > ReturnableEvaluator.ALL_BUT_START_NODE,
> > > > > Rels.IS_VISIBLE_BY, Direction.INCOMING)
> > > > >
> > > > > *'all messages, visible to me, by people I follow'*
> > > > > val msgs = viewer.traverse(
> > > > > Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> > > > > (tp: TraversalPosition) => {
> > > > >  val msg = tp.currentNode()
> > > > >  IsMessage(msg) && IsVisibleBy(msg,viewer)
> > > > > },
> > > > > Rels.FOLLOWS, Direction.OUTGOING,
> > > > > Rels.CREATED, Direction.OUTGOING)
> > > > >
> > > > > Of course I assume you make the utility functions IsMessage(node:
> > Node)
> > > > and
> > > > > IsVisibleBy(msg: Node, user: Node), and these will test the
> existance
> > > of
> > > > > properties and relations as appropriate to make the decision.
> > > > >
> > > > >
> > > > > On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi, I've just started looking at Neo4j and I'm quite intrigued.
> > > >  However,
> > > > > > the cognitive dissonance that I've grown so used to in modeling
> > > storage
> > > > > is
> > > > > > proving to be a bit difficult to let go at this early stage :)
> > > > > >
> > > > > > I was hoping that if someone could help me through an example I
> > would
> > > > be
> > > > > > able to grok how to properly structure my data and query it in
> > Neo4j.
> > > > > >
> > > > > > Nodes:
> > > > > > Message( text: String )
> > > > > > User( id: Long )
> > > > > >
> > > > > > Relationships:
> > > > > > CREATED
> > > > > > FOLLOWS
> > > > > > IS_VISIBLE_BY
> > > > > >
> > > > > > So I might have a graph with entries like so:
> > > > > >
> > > > > > User(1) --> CREATED --> Message("i woke up late today")
> > > > > > User(2) --> CREATED --> Message("hello")
> > > > > > User(3) --> CREATED --> Message("ugh, i hate mondays")
> > > > > >
> > > > > > User(1) --> FOLLOWS --> User(2)
> > > > > >
> > > > > > Let's also say all messages are visible to User 1.
> > > > > >
> > > > > > Message("i woke up late today") --> IS_VISIBLE_BY --> User(1)
> > > > > > Message("hello") --> IS_VISIBLE_BY --> User(1)
> > > > > > Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1)
> > > > > >
> > > > > > So, I can do a simple traversal for visible:
> > > > > >
> > > > > > val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" )
> > > > > > val index = new LuceneIndexService( graphDb )
> > > > > > val viewer = index.getSingleNode("id", 1)
> > > > > > val msgs = viewer.traverse( Order.BREADTH_FIRST,
> > > > > > StopEvaluator.END_OF_GRAPH,
> > > > > > ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY,
> > > > > > Direction.INCOMING)
> > > > > > msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }")  //
> > > > assuming
> > > > > i
> > > > > > have the relevant functions
> > > > > >
> > > > > > But let's say that this is going to return too many messages.
>  Just
> > > > > because
> > > > > > all the messages are possibly visible to me, doesn't mean I want
> to
> > > see
> > > > > > them
> > > > > > all.  So, I'd like to additionally filter by the FOLLOWS
> > > relationship.
> > > > >  I'd
> > > > > > like to express "get all messages that are visible and were
> created
> > > by
> > > > a
> > > > > > user that I follow."  Can someone show me an example of how to do
> > > that?
> > > > > >
> > > > > > I'm guessing that you need to implement a custom
> > ReturnableEvaluator,
> > > > but
> > > > > I
> > > > > > don't understand how you traverse multiple relationships at the
> > same
> > > > > time.
> > > > > >
> > > > > > Thanks,
> > > > > > Lincoln
> > > > > > _______________________________________________
> > > > > > Neo mailing list
> > > > > > [email protected]
> > > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > > >
> > > > > _______________________________________________
> > > > > Neo mailing list
> > > > > [email protected]
> > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > >
> > > > _______________________________________________
> > > > Neo mailing list
> > > > [email protected]
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > _______________________________________________
> > > Neo mailing list
> > > [email protected]
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] basic questions

Reply via email to