Re: [Neo4j] Activity Streams and Twitter Sample App
Rene, you mean the fork that Mattias has been providing you with for handling millions of relationship types? It is possible to take that into master, however, I think Johan said we need to test and make sure all cases are covered. Don't have an ETA on that ATM. I think you could raise an issue and point to Mattias branch to keep track of the feature! Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/- Öresund - Innovation happens HERE. 2011/11/26 René Pickhardt r.pickha...@googlemail.com @ qcon: I can't find the twitter videos on http://qconlondon.com/london-2011/ rick could you please point to further rescources? @ aseem: I gave a detailed answer to your questions on http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/#comment-1295 but I think we have the same oppinion. The approach that you wrote down here is exactly what my baseline STOU does. you have your friendship relations as a (S)tar and the events as (T)emporal (O)rdered (U)pdeats ( = STOU ) The merging traversal is the same in STOU and Graphity. The only difference is that in STOU I have to sort the egonetwork O(d log d) where d is infact the outdegree. Once we understand this everything else is easy: You look at your kind of application and decide what your ratio of read to write operations is. Once reading happens far more frequently you better use graphity in the other case your method seems more sufficiant. What I have learnt from your comments that not only the ratio of reads to writes is interesting but the distribution of outdegrees should be considered when deciding which algorithm to use.. If the outdegree really stays small your approach will be more efficiently. @maxdemermazi: performance with bigger k drops in both cases. retrieval with graphity is O(k log k) where STOU is O(d log d + k log k) and for small k the d part is just dominating. So what's happening here is that graphity will always be faster since no sorting of the egonetwork is involved. Besides. Why would you want to have such a big k. Even if your twitter stream genereated that many updates when you come to the website you will only get to see a small k of them and as soon as you scroll down you will see the next k + l tweets... @custom neo4j: well let's ask Peter if neotechnology will include support to handle that many relationship types again to neo4j. I certainly would be happy about that. 2011/11/22 maxdemarzi maxdema...@gmail.com A couple of things bothered me about Rene's approach. 1. Custom Neo4j that could handle lots of relationship types. Not 100% sure where this fits? 2. Performance drops like a bomb when K (the number of items retrieved) increases. So it kind of works to get the top 15 items... A twitter stream could have 500 items easy per update (on my iphone every few hours). -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3526543.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- -- mobile: +49 (0)176 6433 2481 Skype: +49 (0)6131 / 4958926 Skype: rene.pickhardt www.rene-pickhardt.de http://www.beijing-china-blog.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
@ qcon: I can't find the twitter videos on http://qconlondon.com/london-2011/ rick could you please point to further rescources? @ aseem: I gave a detailed answer to your questions on http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/#comment-1295 but I think we have the same oppinion. The approach that you wrote down here is exactly what my baseline STOU does. you have your friendship relations as a (S)tar and the events as (T)emporal (O)rdered (U)pdeats ( = STOU ) The merging traversal is the same in STOU and Graphity. The only difference is that in STOU I have to sort the egonetwork O(d log d) where d is infact the outdegree. Once we understand this everything else is easy: You look at your kind of application and decide what your ratio of read to write operations is. Once reading happens far more frequently you better use graphity in the other case your method seems more sufficiant. What I have learnt from your comments that not only the ratio of reads to writes is interesting but the distribution of outdegrees should be considered when deciding which algorithm to use.. If the outdegree really stays small your approach will be more efficiently. @maxdemermazi: performance with bigger k drops in both cases. retrieval with graphity is O(k log k) where STOU is O(d log d + k log k) and for small k the d part is just dominating. So what's happening here is that graphity will always be faster since no sorting of the egonetwork is involved. Besides. Why would you want to have such a big k. Even if your twitter stream genereated that many updates when you come to the website you will only get to see a small k of them and as soon as you scroll down you will see the next k + l tweets... @custom neo4j: well let's ask Peter if neotechnology will include support to handle that many relationship types again to neo4j. I certainly would be happy about that. 2011/11/22 maxdemarzi maxdema...@gmail.com A couple of things bothered me about Rene's approach. 1. Custom Neo4j that could handle lots of relationship types. Not 100% sure where this fits? 2. Performance drops like a bomb when K (the number of items retrieved) increases. So it kind of works to get the top 15 items... A twitter stream could have 500 items easy per update (on my iphone every few hours). -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3526543.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- -- mobile: +49 (0)176 6433 2481 Skype: +49 (0)6131 / 4958926 Skype: rene.pickhardt www.rene-pickhardt.de http://www.beijing-china-blog.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
You might even be interested in Rene Pickards work on a full solution (albeit with some write-time overhead), see http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/ Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/ - Öresund - Innovation happens HERE. On Thu, Nov 3, 2011 at 10:36 PM, maxdemarzi maxdema...@gmail.com wrote: I had not considered imperfect solutions, and in some activity stream scenarios a sampling of the last few messages could work. The sample would have to be taken from all Person nodes because if we sample from the Tweets in general and we encounter a chatty person node early on, it would take up all the sample space. Person1 - Person 2 - 10 tweets Person1 - Person3 - 1000 tweets = Sample Size reached, traversal stops Person1 - Person4 - 10 tweets Person1 - PersonX - 10 tweets Person3 would prevent Person4 to PersonX's tweets from ever making it to the sample. Some applications in domains like Financials or Network Monitoring may require the last known status and sampling might not be acceptable. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
This is very interesting -- thanks Peter for the link, and thanks maxdemarzi for starting this conversation. In our social network -- which has extremely little load, it's just in beta -- we currently use basically (a), and it works just fine. We use Cypher to do the sorting/trimming on the server, e.g. roughly this query: START user=(123) MATCH (user) -[:FOLLOWS]- (friend) -[:EVENT]- (event) RETURN event ORDER BY event.timestamp DESC LIMIT 20 This obviously will not scale, since Neo4j has to read every single potential event from disk and process it in memory. We too have been thinking about this in the back of our minds, and the obvious solution in our minds is indeed to use a linked list timeline for events, and have a custom traverser that merge sorts in realtime. Since this'd be a linked list where you want the newest events at the head of the list, there will indeed be a small write-time cost of having to update two relationships whenever a new event is created (i.e. inserting it at the head of the list). Rene mentions this approach as the first (baseline) approach he tried. It's at 3:08 in the video in that post. He decides that approach isn't good enough, however, because merge sorting depends on the number of friends you have (# users you follow), which he wants to avoid at read-time. IMHO however, that's not worth optimizing for. It's rare that a user follows a very high number of other people. E.g. it's usually around let's say 100-1000 people. The flip side -- the # of *followers* -- can be very high. And interestingly, Rene's solution suffers from *that* at write-time. E.g. a new event has to update every *followers'* Graphity index. I'll provide Rene this feedback, but FWIW, I have the feeling that a simple timeline will easily be good enough, in fact optimal, for a social network on Neo4j: - O(1) writes - O(d) reads in space performance - O(d log d) reads in time performance Where d is the out-degree -- # of users you follow -- which is generally small. Cheers, Aseem On Mon, Nov 21, 2011 at 6:07 AM, Peter Neubauer peter.neuba...@neotechnology.com wrote: You might even be interested in Rene Pickards work on a full solution (albeit with some write-time overhead), see http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/ Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/- Öresund - Innovation happens HERE. On Thu, Nov 3, 2011 at 10:36 PM, maxdemarzi maxdema...@gmail.com wrote: I had not considered imperfect solutions, and in some activity stream scenarios a sampling of the last few messages could work. The sample would have to be taken from all Person nodes because if we sample from the Tweets in general and we encounter a chatty person node early on, it would take up all the sample space. Person1 - Person 2 - 10 tweets Person1 - Person3 - 1000 tweets = Sample Size reached, traversal stops Person1 - Person4 - 10 tweets Person1 - PersonX - 10 tweets Person3 would prevent Person4 to PersonX's tweets from ever making it to the sample. Some applications in domains like Financials or Network Monitoring may require the last known status and sampling might not be acceptable. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
You might also want to take a look at the videos from the Twitter presentations from QCon London. -Original Message- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Aseem Kishore Sent: Monday, November 21, 2011 4:43 PM To: Neo4j user discussions Subject: Re: [Neo4j] Activity Streams and Twitter Sample App This is very interesting -- thanks Peter for the link, and thanks maxdemarzi for starting this conversation. In our social network -- which has extremely little load, it's just in beta -- we currently use basically (a), and it works just fine. We use Cypher to do the sorting/trimming on the server, e.g. roughly this query: START user=(123) MATCH (user) -[:FOLLOWS]- (friend) -[:EVENT]- (event) RETURN event ORDER BY event.timestamp DESC LIMIT 20 This obviously will not scale, since Neo4j has to read every single potential event from disk and process it in memory. We too have been thinking about this in the back of our minds, and the obvious solution in our minds is indeed to use a linked list timeline for events, and have a custom traverser that merge sorts in realtime. Since this'd be a linked list where you want the newest events at the head of the list, there will indeed be a small write-time cost of having to update two relationships whenever a new event is created (i.e. inserting it at the head of the list). Rene mentions this approach as the first (baseline) approach he tried. It's at 3:08 in the video in that post. He decides that approach isn't good enough, however, because merge sorting depends on the number of friends you have (# users you follow), which he wants to avoid at read-time. IMHO however, that's not worth optimizing for. It's rare that a user follows a very high number of other people. E.g. it's usually around let's say 100-1000 people. The flip side -- the # of *followers* -- can be very high. And interestingly, Rene's solution suffers from *that* at write-time. E.g. a new event has to update every *followers'* Graphity index. I'll provide Rene this feedback, but FWIW, I have the feeling that a simple timeline will easily be good enough, in fact optimal, for a social network on Neo4j: - O(1) writes - O(d) reads in space performance - O(d log d) reads in time performance Where d is the out-degree -- # of users you follow -- which is generally small. Cheers, Aseem On Mon, Nov 21, 2011 at 6:07 AM, Peter Neubauer peter.neuba...@neotechnology.com wrote: You might even be interested in Rene Pickards work on a full solution (albeit with some write-time overhead), see http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/ Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/- Öresund - Innovation happens HERE. On Thu, Nov 3, 2011 at 10:36 PM, maxdemarzi maxdema...@gmail.com wrote: I had not considered imperfect solutions, and in some activity stream scenarios a sampling of the last few messages could work. The sample would have to be taken from all Person nodes because if we sample from the Tweets in general and we encounter a chatty person node early on, it would take up all the sample space. Person1 - Person 2 - 10 tweets Person1 - Person3 - 1000 tweets = Sample Size reached, traversal stops Person1 - Person4 - 10 tweets Person1 - PersonX - 10 tweets Person3 would prevent Person4 to PersonX's tweets from ever making it to the sample. Some applications in domains like Financials or Network Monitoring may require the last known status and sampling might not be acceptable. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
A couple of things bothered me about Rene's approach. 1. Custom Neo4j that could handle lots of relationship types. Not 100% sure where this fits? 2. Performance drops like a bomb when K (the number of items retrieved) increases. So it kind of works to get the top 15 items... A twitter stream could have 500 items easy per update (on my iphone every few hours). -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3526543.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Activity Streams and Twitter Sample App
Andreas Ronge created a new sample app called kvitter @ https://github.com/andreasronge/kvitter . This got me thinking about the Twitter clone done in Redis @ http://redis.io/topics/twitter-clone If you scroll down 2/3's of the way down you'll read this piece: After we create a post we obtain the post id. We need to LPUSH this post id in every user that's following the author of the post, and of course in the list of posts of the author. This is the bottleneck of the application (if you can call anything on Redis a bottleneck since it's so freaking fast). A tweet by http://twitter.com/#!/APLUSK will have to do 8 Million writes to each of Ashton's Followers. In Neo4j, we shouldn't need to do that since we can express that by relationships: # this should return the tweets of all the people I follow. Person1.outgoing(:follows).outgoing(:tweeted).depth(2).filter(position.length() == 2;) I don't want to get every tweet of every person I follow, just 100 of them. Person1.outgoing(:follows).outgoing(:tweeted).depth(2).filter(position.length() == 2;) .prune(position.returnedNodesCount() 100) But what about ordering them so I get the Latest 100 tweets from all of the tweets a persons followers have tweeted? What are some options here? A) Return all them, and then filter them so only the top 100 are displayed (See getLatestEvents from https://trac.neo4j.org/browser/examples/activity-stream/src/main/java/org/neo4j/examples/activitystream/ActivityStreamExample.java?rev=3888 ) B) A custom Evaluator that adds tweets to an ordered set and trims off the old tweets as it goes based on a tweet date property. (Will have to check every tweet property, like http://sujitpal.blogspot.com/2009/06/custom-traverser-for-neo4j.html ) C) Put the tweet date on the Tweeted Relationship, so we only check at the relationship property level D) Claim defeat and use a tweeted_recently relationship added from every new tweet to every follower (with trimming) sort of like Redis E) Claim defeat and use an index to store the latest tweets like Redis F) (assuming no id reuse) A custom Breadth First Traverser that evaluates just the last 100 Tweeted relationships ordered by id in conjunction with B. (is this possible?) Other ideas? How scalable are these solutions? -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3477669.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
Came up with another possibility: G) Store the Latest 100 tweeted relationship ids with dates as a property of the User Node, and a custom Breadth First Traversal that evaluates the list of every follower's latest 100 tweets before deciding which relationships to follow. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3477693.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
one more solution. set a sampling_ratio, say 10: Person1.outgoing(:follows).outgoing(:tweeted).depth(2).filter(position.length()== 2;) .prune(position.returnedNodesCount() 100 * sampling_ratio) then do a sort based on timestamp. the goal is not to get the perfect result but *good enough* ones depends on your experience. On Thu, Nov 3, 2011 at 4:48 PM, maxdemarzi maxdema...@gmail.com wrote: Came up with another possibility: G) Store the Latest 100 tweeted relationship ids with dates as a property of the User Node, and a custom Breadth First Traversal that evaluates the list of every follower's latest 100 tweets before deciding which relationships to follow. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3477693.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Best wishes, Linan Wang Architect, Programmer, PhD ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Activity Streams and Twitter Sample App
I had not considered imperfect solutions, and in some activity stream scenarios a sampling of the last few messages could work. The sample would have to be taken from all Person nodes because if we sample from the Tweets in general and we encounter a chatty person node early on, it would take up all the sample space. Person1 - Person 2 - 10 tweets Person1 - Person3 - 1000 tweets = Sample Size reached, traversal stops Person1 - Person4 - 10 tweets Person1 - PersonX - 10 tweets Person3 would prevent Person4 to PersonX's tweets from ever making it to the sample. Some applications in domains like Financials or Network Monitoring may require the last known status and sampling might not be acceptable. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user