Re: [Neo] basic questions

Tim Langley Wed, 17 Mar 2010 14:53:49 -0700

Good point :)
In training our new devs (who typically have a RDBC background) it  
often takes a leap of understanding for them to realise that whilst a  
join is very expensive, creating and traversing a relationship is very  
cheap


T

Sent from my iPhone so please excuse typos adn brevity

Tim Langley
+44 7989 539363

On 17 Mar 2010, at 21:50, Lincoln <[email protected]> wrote:

> Ok I get it.  I keep not thinking about relationships flexibly  
> enough.  I
> need to take a little time to get used to thinking about problems  
> this way.
> Thanks for your help!
>
> On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <[email protected]>  
> wrote:
>
>> I guess I could say that the approach is totally different, but in  
>> reality
>> you start at the same point, working out what query you want. But  
>> after
>> that
>> things change. Let's consider two cases, the huge network and the  
>> paging
>> examples you used below.
>>
>> For the first, keep in mind that if you scale the total database  
>> size,
>> relational db's don't scale as well as graph db's, because in  
>> relational
>> db's performance is related to total table size, whereas graph db's
>> performance is not. However, both db's are affected by the result  
>> set size.
>> So for twitter type cases, graph db's scale better with total  
>> subscriber
>> base, but not necessarily for an individual 'mega-user'. But let's  
>> look at
>> that case. I believe a real human user would never be able to process
>> millions of messages a day, so I must assume that this is a bot use  
>> case,
>> perhaps a script that subscribes to enormous numbers of people  
>> scanning for
>> particular patterns of information. Both relational and graph db's  
>> will
>> feel
>> the strain on a huge resultset. Personally I would probably follow  
>> twitters
>> example and return only paginated result, breaking up the load so  
>> the bot
>> would not negatively impact the network for human users.
>>
>> So, getting to pagination, the query you want is perhaps 'give me the
>> ordered results, as a block of n messages starting at message x'.  
>> Obviously
>> this is easily translatable into SQL for relational databases. For  
>> the
>> graph
>> database, it is a little more work, but again you start with that  
>> verbal
>> query and figure out what it really means in a graph. There are two  
>> key
>> components:
>>
>>  - Ordered results
>>  - Starting point and number of results
>>
>> For SQL this translates to 'order by', 'where id > x' and 'limit  
>> n'. For
>> graph databases this translates into adding ordering relationships  
>> like
>> message(X)-->PREVIOUS-->message(X+1). On each pagination, you can  
>> take the
>> PREVIOUS message as the begining of the next pagination (possibly  
>> passing
>> the node id in the session). Each pagination is simply a traverser  
>> that
>> runs
>> down the PREVIOUS chain, exiting the loop after n messages.
>>
>>
>> On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <[email protected]>  
>> wrote:
>>
>>> Wow dude, this is blowing my mind just a little.
>>>
>>> Ok, sticking with the twitter example, I'm concerned about the edge
>> cases.
>>> I'd say it's easy to optimize with a relational db or any other  
>>> storage
>>> for
>>> that matter if I make the assumption that people only follow a few
>> hundred
>>> people and only want recent messages.  However some people follow
>> hundreds
>>> of thousands of people.  If Guy Kawasaki uses my app, I'd run into a
>>> problem
>>> quickly.
>>>
>>> However I see your point that I don't have to limit myself to just  
>>> the
>>> obvious relationships, but can create relationships that serve  
>>> specific
>>> purposes and use-cases such as your day example.  I'm not sure how I
>> would
>>> want to model my use-case to allow for Guy Kawaski, I'll have to  
>>> think
>> more
>>> about it.  Is there a threshold beyond which adding relationships  
>>> between
>>> nodes causes problems?  If not, or if it's high, you could create  
>>> custom
>>> relationships for every type of query you'd want to do.
>>>
>>> However, a secondary question comes up.  If we continue with the  
>>> twitter
>>> example, and I want to be able to page through results, is that  
>>> directly
>>> supported through Neo4j's API?  Coming from a more traditional  
>>> storage
>>> background I tend to think of what I'd want as a sort by time and  
>>> then a
>>> skip and limit on the results (so I could say give me messages 1-100
>> sorted
>>> by time descending).  Is there anything equivalent in Neo4j or is  
>>> the
>>> approach totally different?
>>>
>>> Thanks,
>>> Lincoln
>>>
>>>
>>> On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <[email protected]>
>> wrote:
>>>
>>>> Hi Lincoln,
>>>>
>>>> So it sounds like you don't need the IS_VISIBLE relations after  
>>>> all.
>> The
>>>> traverser works by following all relationships of the specified  
>>>> types
>> and
>>>> directions from each current node (as you traverse, or walk the  
>>>> graph).
>>> You
>>>> can have a complex graph and traverse to high depth very fast
>> (thousands
>>> of
>>>> relationships per second). The traverser will also automatically  
>>>> check
>>> that
>>>> the same node is not returned twice. The test for the  
>>>> relationship type
>>> is
>>>> efficient. Still reasonable, but less efficient is the custom  
>>>> test you
>>>> might
>>>> put in the returnable evaluator, but if the limiting factor is  
>>>> usually
>>> the
>>>> number of relationships traversed, and if that is kept managable,  
>>>> the
>>>> evaluator test is no concern.
>>>>
>>>> I think twitter is a good case in point, even with many millions of
>>> users,
>>>> you will still only follow perhaps a hundred and they will tweet
>> perhaps
>>> a
>>>> hundred, or a thousand times, so your traverser will find the  
>>>> 10k-100k
>>>> messages quite quickly. This can be speeded up further, but the  
>>>> right
>>>> approach depends again on your use case. The idea with using a  
>>>> graph
>>>> database is that the actual usage probably maps very well to the  
>>>> graph
>>>> structure, so when deciding how to speed up your search, consider  
>>>> how
>> it
>>>> will be used. In twitter one normally only cares about recent  
>>>> messages,
>>> so
>>>> how about not linking directly from the user to the message, but  
>>>> link
>> to
>>> an
>>>> intermediate node representing time, for example, a day-node.  
>>>> Then each
>>> new
>>>> message is added to the day node for that day, and that will
>>> automatically
>>>> become yesterday the next day. Then your traversal can have a stop
>>>> evaluator
>>>> to not follow old messages (unless your query is looking for old
>>> messages,
>>>> of course). So the 100k messages might drop to only a few  
>>>> hundred, or
>>> even
>>>> just a few dozen. Certainly that will be a query of the order of
>>>> milliseconds!
>>>>
>>>> Moving away from the traverser, you also have the option to call
>> directly
>>>> the getRelationships() methods from the node. If you structure is
>>>> predictable, like viewer-->FOLLOWS-->user-->CREATED-->message,  
>>>> then two
>>>> nested for loops would work, the outer iterating over the  
>>>> followers and
>>> the
>>>> inner iterating over the messages. If you changed to add a time- 
>>>> based
>>>> interim node (which is a kind of graph-index), then you need to  
>>>> have
>>> three
>>>> loops. If you made your time index a deeper tree (months->days- 
>>>> >hours,
>>>> etc.), then you would need to further refactor the code. However,  
>>>> if
>> you
>>>> stuck with a traverser, you might not need to change the  
>>>> traverser even
>>> of
>>>> the graph structure changed, as long as the same relationship types
>> were
>>>> maintained. Does that make sense?
>>>>
>>>> Cheers, Craig
>>>>
>>>> On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <[email protected]>  
>>>> wrote:
>>>>
>>>>> Thanks Craig,
>>>>>
>>>>> I'd like to clarify my question (I don't think it changes your  
>>>>> answer
>>>>> though).
>>>>>
>>>>> I wanted all messages visible to me created by users I follow.   
>>>>> Thus,
>>> the
>>>>> FOLLOWS relationship is not enough.  I'd need to see messages that
>> are
>>>>> visible to me and then check if they were created by users I  
>>>>> follow,
>> or
>>>> I'd
>>>>> need to see messages created by users I follow and then see if
>> they're
>>>>> visible to me.
>>>>>
>>>>> I assume your last example still yields the result I'm looking  
>>>>> for.
>>>> Could
>>>>> you describe what actually happens here though?  I'm unclear on  
>>>>> what
>>> the
>>>>> traversal looks like.  Would it first traverse every outgoing  
>>>>> FOLLOWS
>>>>> relationship from the viewer, yielding other users, and then  
>>>>> traverse
>>> all
>>>>> the CREATED relationships to get to messages?
>>>>>
>>>>> Also, given very large numbers of FOLLOWS and CREATED  
>>>>> relationships
>>> (with
>>>>> say, a twitter graph), how is this made efficient?
>>>>>
>>>>> Sorry for all the basic questions but I couldn't find this
>> information
>>> in
>>>>> the docs.  If there's something I should be reading before posting
>>> these
>>>>> questions, please point me to it.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lincoln
>>>>>
>>>>> On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <[email protected]>
>>>> wrote:
>>>>>
>>>>>> I'm uncertain about one ambiguity in your model, you are able to
>> find
>>>>>> messages through FOLLOWS and IS_VISIBLE_BY. These will give two
>>>> different
>>>>>> sets, and my first impression was that FOLLOWS gives you the  
>>>>>> right
>>>>> answer.
>>>>>> In other words you want to query for 'all messages by users I
>>> follow'?
>>>> In
>>>>>> that case you do not need IS_VISIBLE_BY. However, if there are
>>> messages
>>>>> by
>>>>>> people you follow, but are not allowed to see, then you also need
>> the
>>>>>> IS_VISIBLE_BY. But I would still reconsider linking directly from
>> the
>>>>>> viewer
>>>>>> to the message for that case. I'd rather have the messages linked
>> to
>>>> some
>>>>>> categorization structure for things like 'public', 'private',  
>>>>>> etc.
>>>>>>
>>>>>> Anyway, here are some suggestions for the various approaches  
>>>>>> above:
>>>>>> *'all messages by users I follow'*
>>>>>> val msgs = viewer.traverse(
>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
>>>>>> (tp: TraversalPosition) => IsMessage(tp.currentNode()),
>>>>>> Rels.FOLLOWS, Direction.OUTGOING,
>>>>>> Rels.CREATED, Direction.OUTGOING)
>>>>>>
>>>>>> *'all messages visible to me'*
>>>>>> val msgs = viewer.traverse(
>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE,
>>>>>> Rels.IS_VISIBLE_BY, Direction.INCOMING)
>>>>>>
>>>>>> *'all messages, visible to me, by people I follow'*
>>>>>> val msgs = viewer.traverse(
>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
>>>>>> (tp: TraversalPosition) => {
>>>>>> val msg = tp.currentNode()
>>>>>> IsMessage(msg) && IsVisibleBy(msg,viewer)
>>>>>> },
>>>>>> Rels.FOLLOWS, Direction.OUTGOING,
>>>>>> Rels.CREATED, Direction.OUTGOING)
>>>>>>
>>>>>> Of course I assume you make the utility functions IsMessage(node:
>>> Node)
>>>>> and
>>>>>> IsVisibleBy(msg: Node, user: Node), and these will test the
>> existance
>>>> of
>>>>>> properties and relations as appropriate to make the decision.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <[email protected]>
>>> wrote:
>>>>>>
>>>>>>> Hi, I've just started looking at Neo4j and I'm quite intrigued.
>>>>> However,
>>>>>>> the cognitive dissonance that I've grown so used to in modeling
>>>> storage
>>>>>> is
>>>>>>> proving to be a bit difficult to let go at this early stage :)
>>>>>>>
>>>>>>> I was hoping that if someone could help me through an example I
>>> would
>>>>> be
>>>>>>> able to grok how to properly structure my data and query it in
>>> Neo4j.
>>>>>>>
>>>>>>> Nodes:
>>>>>>> Message( text: String )
>>>>>>> User( id: Long )
>>>>>>>
>>>>>>> Relationships:
>>>>>>> CREATED
>>>>>>> FOLLOWS
>>>>>>> IS_VISIBLE_BY
>>>>>>>
>>>>>>> So I might have a graph with entries like so:
>>>>>>>
>>>>>>> User(1) --> CREATED --> Message("i woke up late today")
>>>>>>> User(2) --> CREATED --> Message("hello")
>>>>>>> User(3) --> CREATED --> Message("ugh, i hate mondays")
>>>>>>>
>>>>>>> User(1) --> FOLLOWS --> User(2)
>>>>>>>
>>>>>>> Let's also say all messages are visible to User 1.
>>>>>>>
>>>>>>> Message("i woke up late today") --> IS_VISIBLE_BY --> User(1)
>>>>>>> Message("hello") --> IS_VISIBLE_BY --> User(1)
>>>>>>> Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1)
>>>>>>>
>>>>>>> So, I can do a simple traversal for visible:
>>>>>>>
>>>>>>> val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" )
>>>>>>> val index = new LuceneIndexService( graphDb )
>>>>>>> val viewer = index.getSingleNode("id", 1)
>>>>>>> val msgs = viewer.traverse( Order.BREADTH_FIRST,
>>>>>>> StopEvaluator.END_OF_GRAPH,
>>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY,
>>>>>>> Direction.INCOMING)
>>>>>>> msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }")  //
>>>>> assuming
>>>>>> i
>>>>>>> have the relevant functions
>>>>>>>
>>>>>>> But let's say that this is going to return too many messages.
>> Just
>>>>>> because
>>>>>>> all the messages are possibly visible to me, doesn't mean I want
>> to
>>>> see
>>>>>>> them
>>>>>>> all.  So, I'd like to additionally filter by the FOLLOWS
>>>> relationship.
>>>>>> I'd
>>>>>>> like to express "get all messages that are visible and were
>> created
>>>> by
>>>>> a
>>>>>>> user that I follow."  Can someone show me an example of how to  
>>>>>>> do
>>>> that?
>>>>>>>
>>>>>>> I'm guessing that you need to implement a custom
>>> ReturnableEvaluator,
>>>>> but
>>>>>> I
>>>>>>> don't understand how you traverse multiple relationships at the
>>> same
>>>>>> time.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lincoln
>>>>>>> _______________________________________________
>>>>>>> Neo mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Neo mailing list
>>>>>> [email protected]
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>
>>>>> _______________________________________________
>>>>> Neo mailing list
>>>>> [email protected]
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>
>>>> _______________________________________________
>>>> Neo mailing list
>>>> [email protected]
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>
>>> _______________________________________________
>>> Neo mailing list
>>> [email protected]
>>> https://lists.neo4j.org/mailman/listinfo/user
>>>
>> _______________________________________________
>> Neo mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>>
> _______________________________________________
> Neo mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
_______________________________________________
Neo mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] basic questions

Reply via email to