Re: [Neo] basic questions

Anders Nawroth Thu, 18 Mar 2010 03:39:06 -0700

Hi!

Interesting thread :-)


Craig Taverner wrote:
> of original data. The developers answered that they would store the node-ids
> of the original data in an int[] property. That answer totally took me by
> surprise, and it took a few seconds for me to realize that it would work for
> the initial use case, but was not a graph-like solution, and obviously more
> like a relational database foreign key.
> 
> Not being a graphy solution was not a good enough reason for me to turn down
> the idea, so I had to think a bit further to argue my point. Simply
> replacing the ids with relationships to the original nodes (as I originally
> planned), was not going to give any further functionality.

Using a relationship would guarantee that the node is still there as 
long as the relationship exists. A node id could point to a node that 
has been deleted, or the id could even have been reused for a new node 
after that. So when storing node ids somewhere, your application also 
has to make sure referential integrity isn't lost.


/anders

> Instead it would
> slow down performance slightly (due to increased structure). Then it hit me,
> the power of the graph was that we can continue to add structure, coping
> with far more use cases than a simple foreign key. And the next key piece of
> structure I immediately thought of was a layer of intermediate nodes
> representing an intermediate level of statistics. Suddenly we could add any
> number of levels of detail to the chain, arbitrarily, without affecting the
> user interface or the initial use case. That was certainly not possible with
> the foreign key approach.
> 
> But I also thanked the developer for the suggestion, for without it I would
> not have been forced to find a better argument for the graphy solution, and
> thereby solved a few other requirements :-)
> 
> On Wed, Mar 17, 2010 at 10:53 PM, Tim Langley <timlang...@me.com> wrote:
> 
>> Good point :)
>> In training our new devs (who typically have a RDBC background) it
>> often takes a leap of understanding for them to realise that whilst a
>> join is very expensive, creating and traversing a relationship is very
>> cheap
>>
>> T
>>
>> Sent from my iPhone so please excuse typos adn brevity
>>
>> Tim Langley
>> +44 7989 539363
>>
>> On 17 Mar 2010, at 21:50, Lincoln <linxbet...@gmail.com> wrote:
>>
>>> Ok I get it.  I keep not thinking about relationships flexibly
>>> enough.  I
>>> need to take a little time to get used to thinking about problems
>>> this way.
>>> Thanks for your help!
>>>
>>> On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <cr...@amanzi.com>
>>> wrote:
>>>
>>>> I guess I could say that the approach is totally different, but in
>>>> reality
>>>> you start at the same point, working out what query you want. But
>>>> after
>>>> that
>>>> things change. Let's consider two cases, the huge network and the
>>>> paging
>>>> examples you used below.
>>>>
>>>> For the first, keep in mind that if you scale the total database
>>>> size,
>>>> relational db's don't scale as well as graph db's, because in
>>>> relational
>>>> db's performance is related to total table size, whereas graph db's
>>>> performance is not. However, both db's are affected by the result
>>>> set size.
>>>> So for twitter type cases, graph db's scale better with total
>>>> subscriber
>>>> base, but not necessarily for an individual 'mega-user'. But let's
>>>> look at
>>>> that case. I believe a real human user would never be able to process
>>>> millions of messages a day, so I must assume that this is a bot use
>>>> case,
>>>> perhaps a script that subscribes to enormous numbers of people
>>>> scanning for
>>>> particular patterns of information. Both relational and graph db's
>>>> will
>>>> feel
>>>> the strain on a huge resultset. Personally I would probably follow
>>>> twitters
>>>> example and return only paginated result, breaking up the load so
>>>> the bot
>>>> would not negatively impact the network for human users.
>>>>
>>>> So, getting to pagination, the query you want is perhaps 'give me the
>>>> ordered results, as a block of n messages starting at message x'.
>>>> Obviously
>>>> this is easily translatable into SQL for relational databases. For
>>>> the
>>>> graph
>>>> database, it is a little more work, but again you start with that
>>>> verbal
>>>> query and figure out what it really means in a graph. There are two
>>>> key
>>>> components:
>>>>
>>>>  - Ordered results
>>>>  - Starting point and number of results
>>>>
>>>> For SQL this translates to 'order by', 'where id > x' and 'limit
>>>> n'. For
>>>> graph databases this translates into adding ordering relationships
>>>> like
>>>> message(X)-->PREVIOUS-->message(X+1). On each pagination, you can
>>>> take the
>>>> PREVIOUS message as the begining of the next pagination (possibly
>>>> passing
>>>> the node id in the session). Each pagination is simply a traverser
>>>> that
>>>> runs
>>>> down the PREVIOUS chain, exiting the loop after n messages.
>>>>
>>>>
>>>> On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <linxbet...@gmail.com>
>>>> wrote:
>>>>
>>>>> Wow dude, this is blowing my mind just a little.
>>>>>
>>>>> Ok, sticking with the twitter example, I'm concerned about the edge
>>>> cases.
>>>>> I'd say it's easy to optimize with a relational db or any other
>>>>> storage
>>>>> for
>>>>> that matter if I make the assumption that people only follow a few
>>>> hundred
>>>>> people and only want recent messages.  However some people follow
>>>> hundreds
>>>>> of thousands of people.  If Guy Kawasaki uses my app, I'd run into a
>>>>> problem
>>>>> quickly.
>>>>>
>>>>> However I see your point that I don't have to limit myself to just
>>>>> the
>>>>> obvious relationships, but can create relationships that serve
>>>>> specific
>>>>> purposes and use-cases such as your day example.  I'm not sure how I
>>>> would
>>>>> want to model my use-case to allow for Guy Kawaski, I'll have to
>>>>> think
>>>> more
>>>>> about it.  Is there a threshold beyond which adding relationships
>>>>> between
>>>>> nodes causes problems?  If not, or if it's high, you could create
>>>>> custom
>>>>> relationships for every type of query you'd want to do.
>>>>>
>>>>> However, a secondary question comes up.  If we continue with the
>>>>> twitter
>>>>> example, and I want to be able to page through results, is that
>>>>> directly
>>>>> supported through Neo4j's API?  Coming from a more traditional
>>>>> storage
>>>>> background I tend to think of what I'd want as a sort by time and
>>>>> then a
>>>>> skip and limit on the results (so I could say give me messages 1-100
>>>> sorted
>>>>> by time descending).  Is there anything equivalent in Neo4j or is
>>>>> the
>>>>> approach totally different?
>>>>>
>>>>> Thanks,
>>>>> Lincoln
>>>>>
>>>>>
>>>>> On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <cr...@amanzi.com>
>>>> wrote:
>>>>>> Hi Lincoln,
>>>>>>
>>>>>> So it sounds like you don't need the IS_VISIBLE relations after
>>>>>> all.
>>>> The
>>>>>> traverser works by following all relationships of the specified
>>>>>> types
>>>> and
>>>>>> directions from each current node (as you traverse, or walk the
>>>>>> graph).
>>>>> You
>>>>>> can have a complex graph and traverse to high depth very fast
>>>> (thousands
>>>>> of
>>>>>> relationships per second). The traverser will also automatically
>>>>>> check
>>>>> that
>>>>>> the same node is not returned twice. The test for the
>>>>>> relationship type
>>>>> is
>>>>>> efficient. Still reasonable, but less efficient is the custom
>>>>>> test you
>>>>>> might
>>>>>> put in the returnable evaluator, but if the limiting factor is
>>>>>> usually
>>>>> the
>>>>>> number of relationships traversed, and if that is kept managable,
>>>>>> the
>>>>>> evaluator test is no concern.
>>>>>>
>>>>>> I think twitter is a good case in point, even with many millions of
>>>>> users,
>>>>>> you will still only follow perhaps a hundred and they will tweet
>>>> perhaps
>>>>> a
>>>>>> hundred, or a thousand times, so your traverser will find the
>>>>>> 10k-100k
>>>>>> messages quite quickly. This can be speeded up further, but the
>>>>>> right
>>>>>> approach depends again on your use case. The idea with using a
>>>>>> graph
>>>>>> database is that the actual usage probably maps very well to the
>>>>>> graph
>>>>>> structure, so when deciding how to speed up your search, consider
>>>>>> how
>>>> it
>>>>>> will be used. In twitter one normally only cares about recent
>>>>>> messages,
>>>>> so
>>>>>> how about not linking directly from the user to the message, but
>>>>>> link
>>>> to
>>>>> an
>>>>>> intermediate node representing time, for example, a day-node.
>>>>>> Then each
>>>>> new
>>>>>> message is added to the day node for that day, and that will
>>>>> automatically
>>>>>> become yesterday the next day. Then your traversal can have a stop
>>>>>> evaluator
>>>>>> to not follow old messages (unless your query is looking for old
>>>>> messages,
>>>>>> of course). So the 100k messages might drop to only a few
>>>>>> hundred, or
>>>>> even
>>>>>> just a few dozen. Certainly that will be a query of the order of
>>>>>> milliseconds!
>>>>>>
>>>>>> Moving away from the traverser, you also have the option to call
>>>> directly
>>>>>> the getRelationships() methods from the node. If you structure is
>>>>>> predictable, like viewer-->FOLLOWS-->user-->CREATED-->message,
>>>>>> then two
>>>>>> nested for loops would work, the outer iterating over the
>>>>>> followers and
>>>>> the
>>>>>> inner iterating over the messages. If you changed to add a time-
>>>>>> based
>>>>>> interim node (which is a kind of graph-index), then you need to
>>>>>> have
>>>>> three
>>>>>> loops. If you made your time index a deeper tree (months->days-
>>>>>>> hours,
>>>>>> etc.), then you would need to further refactor the code. However,
>>>>>> if
>>>> you
>>>>>> stuck with a traverser, you might not need to change the
>>>>>> traverser even
>>>>> of
>>>>>> the graph structure changed, as long as the same relationship types
>>>> were
>>>>>> maintained. Does that make sense?
>>>>>>
>>>>>> Cheers, Craig
>>>>>>
>>>>>> On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <linxbet...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Craig,
>>>>>>>
>>>>>>> I'd like to clarify my question (I don't think it changes your
>>>>>>> answer
>>>>>>> though).
>>>>>>>
>>>>>>> I wanted all messages visible to me created by users I follow.
>>>>>>> Thus,
>>>>> the
>>>>>>> FOLLOWS relationship is not enough.  I'd need to see messages that
>>>> are
>>>>>>> visible to me and then check if they were created by users I
>>>>>>> follow,
>>>> or
>>>>>> I'd
>>>>>>> need to see messages created by users I follow and then see if
>>>> they're
>>>>>>> visible to me.
>>>>>>>
>>>>>>> I assume your last example still yields the result I'm looking
>>>>>>> for.
>>>>>> Could
>>>>>>> you describe what actually happens here though?  I'm unclear on
>>>>>>> what
>>>>> the
>>>>>>> traversal looks like.  Would it first traverse every outgoing
>>>>>>> FOLLOWS
>>>>>>> relationship from the viewer, yielding other users, and then
>>>>>>> traverse
>>>>> all
>>>>>>> the CREATED relationships to get to messages?
>>>>>>>
>>>>>>> Also, given very large numbers of FOLLOWS and CREATED
>>>>>>> relationships
>>>>> (with
>>>>>>> say, a twitter graph), how is this made efficient?
>>>>>>>
>>>>>>> Sorry for all the basic questions but I couldn't find this
>>>> information
>>>>> in
>>>>>>> the docs.  If there's something I should be reading before posting
>>>>> these
>>>>>>> questions, please point me to it.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Lincoln
>>>>>>>
>>>>>>> On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <cr...@amanzi.com>
>>>>>> wrote:
>>>>>>>> I'm uncertain about one ambiguity in your model, you are able to
>>>> find
>>>>>>>> messages through FOLLOWS and IS_VISIBLE_BY. These will give two
>>>>>> different
>>>>>>>> sets, and my first impression was that FOLLOWS gives you the
>>>>>>>> right
>>>>>>> answer.
>>>>>>>> In other words you want to query for 'all messages by users I
>>>>> follow'?
>>>>>> In
>>>>>>>> that case you do not need IS_VISIBLE_BY. However, if there are
>>>>> messages
>>>>>>> by
>>>>>>>> people you follow, but are not allowed to see, then you also need
>>>> the
>>>>>>>> IS_VISIBLE_BY. But I would still reconsider linking directly from
>>>> the
>>>>>>>> viewer
>>>>>>>> to the message for that case. I'd rather have the messages linked
>>>> to
>>>>>> some
>>>>>>>> categorization structure for things like 'public', 'private',
>>>>>>>> etc.
>>>>>>>>
>>>>>>>> Anyway, here are some suggestions for the various approaches
>>>>>>>> above:
>>>>>>>> *'all messages by users I follow'*
>>>>>>>> val msgs = viewer.traverse(
>>>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
>>>>>>>> (tp: TraversalPosition) => IsMessage(tp.currentNode()),
>>>>>>>> Rels.FOLLOWS, Direction.OUTGOING,
>>>>>>>> Rels.CREATED, Direction.OUTGOING)
>>>>>>>>
>>>>>>>> *'all messages visible to me'*
>>>>>>>> val msgs = viewer.traverse(
>>>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
>>>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE,
>>>>>>>> Rels.IS_VISIBLE_BY, Direction.INCOMING)
>>>>>>>>
>>>>>>>> *'all messages, visible to me, by people I follow'*
>>>>>>>> val msgs = viewer.traverse(
>>>>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
>>>>>>>> (tp: TraversalPosition) => {
>>>>>>>> val msg = tp.currentNode()
>>>>>>>> IsMessage(msg) && IsVisibleBy(msg,viewer)
>>>>>>>> },
>>>>>>>> Rels.FOLLOWS, Direction.OUTGOING,
>>>>>>>> Rels.CREATED, Direction.OUTGOING)
>>>>>>>>
>>>>>>>> Of course I assume you make the utility functions IsMessage(node:
>>>>> Node)
>>>>>>> and
>>>>>>>> IsVisibleBy(msg: Node, user: Node), and these will test the
>>>> existance
>>>>>> of
>>>>>>>> properties and relations as appropriate to make the decision.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <linxbet...@gmail.com>
>>>>> wrote:
>>>>>>>>> Hi, I've just started looking at Neo4j and I'm quite intrigued.
>>>>>>> However,
>>>>>>>>> the cognitive dissonance that I've grown so used to in modeling
>>>>>> storage
>>>>>>>> is
>>>>>>>>> proving to be a bit difficult to let go at this early stage :)
>>>>>>>>>
>>>>>>>>> I was hoping that if someone could help me through an example I
>>>>> would
>>>>>>> be
>>>>>>>>> able to grok how to properly structure my data and query it in
>>>>> Neo4j.
>>>>>>>>> Nodes:
>>>>>>>>> Message( text: String )
>>>>>>>>> User( id: Long )
>>>>>>>>>
>>>>>>>>> Relationships:
>>>>>>>>> CREATED
>>>>>>>>> FOLLOWS
>>>>>>>>> IS_VISIBLE_BY
>>>>>>>>>
>>>>>>>>> So I might have a graph with entries like so:
>>>>>>>>>
>>>>>>>>> User(1) --> CREATED --> Message("i woke up late today")
>>>>>>>>> User(2) --> CREATED --> Message("hello")
>>>>>>>>> User(3) --> CREATED --> Message("ugh, i hate mondays")
>>>>>>>>>
>>>>>>>>> User(1) --> FOLLOWS --> User(2)
>>>>>>>>>
>>>>>>>>> Let's also say all messages are visible to User 1.
>>>>>>>>>
>>>>>>>>> Message("i woke up late today") --> IS_VISIBLE_BY --> User(1)
>>>>>>>>> Message("hello") --> IS_VISIBLE_BY --> User(1)
>>>>>>>>> Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1)
>>>>>>>>>
>>>>>>>>> So, I can do a simple traversal for visible:
>>>>>>>>>
>>>>>>>>> val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" )
>>>>>>>>> val index = new LuceneIndexService( graphDb )
>>>>>>>>> val viewer = index.getSingleNode("id", 1)
>>>>>>>>> val msgs = viewer.traverse( Order.BREADTH_FIRST,
>>>>>>>>> StopEvaluator.END_OF_GRAPH,
>>>>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY,
>>>>>>>>> Direction.INCOMING)
>>>>>>>>> msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }")  //
>>>>>>> assuming
>>>>>>>> i
>>>>>>>>> have the relevant functions
>>>>>>>>>
>>>>>>>>> But let's say that this is going to return too many messages.
>>>> Just
>>>>>>>> because
>>>>>>>>> all the messages are possibly visible to me, doesn't mean I want
>>>> to
>>>>>> see
>>>>>>>>> them
>>>>>>>>> all.  So, I'd like to additionally filter by the FOLLOWS
>>>>>> relationship.
>>>>>>>> I'd
>>>>>>>>> like to express "get all messages that are visible and were
>>>> created
>>>>>> by
>>>>>>> a
>>>>>>>>> user that I follow."  Can someone show me an example of how to
>>>>>>>>> do
>>>>>> that?
>>>>>>>>> I'm guessing that you need to implement a custom
>>>>> ReturnableEvaluator,
>>>>>>> but
>>>>>>>> I
>>>>>>>>> don't understand how you traverse multiple relationships at the
>>>>> same
>>>>>>>> time.
>>>>>>>>> Thanks,
>>>>>>>>> Lincoln
>>>>>>>>> _______________________________________________
>>>>>>>>> Neo mailing list
>>>>>>>>> User@lists.neo4j.org
>>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Neo mailing list
>>>>>>>> User@lists.neo4j.org
>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Neo mailing list
>>>>>>> User@lists.neo4j.org
>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Neo mailing list
>>>>>> User@lists.neo4j.org
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>
>>>>> _______________________________________________
>>>>> Neo mailing list
>>>>> User@lists.neo4j.org
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>
>>>> _______________________________________________
>>>> Neo mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>
>>> _______________________________________________
>>> Neo mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> _______________________________________________
>> Neo mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>>
> _______________________________________________
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] basic questions

Reply via email to