Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-27 Thread Peter Neubauer
Rene,
you mean the fork that Mattias has been providing you with for handling
millions of relationship types? It is possible to take that into master,
however, I think Johan said we need to test and make sure all cases are
covered. Don't have an ETA on that ATM. I think you could raise an issue
and point to Mattias branch to keep track of the feature!

Cheers,

/peter neubauer

GTalk:  neubauer.peter
Skype   peter.neubauer
Phone   +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter  http://twitter.com/peterneubauer

http://www.neo4j.org  - NOSQL for the Enterprise.
http://startupbootcamp.org/- Öresund - Innovation happens HERE.


2011/11/26 René Pickhardt r.pickha...@googlemail.com

 @ qcon: I can't find the twitter videos on
 http://qconlondon.com/london-2011/ rick could you please point to further
 rescources?

 @ aseem: I gave a detailed answer to your questions on

 http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/#comment-1295

 but I think we have the same oppinion. The approach that you wrote down
 here is exactly what my baseline STOU does. you have your friendship
 relations as a (S)tar and the events as (T)emporal (O)rdered (U)pdeats ( =
 STOU ) The merging traversal is the same in STOU and Graphity. The only
 difference is that in STOU I have to sort the egonetwork O(d log d) where d
 is infact the outdegree.

 Once we understand this everything else is easy: You look at your kind of
 application and decide what your ratio of read to write operations is. Once
 reading happens far more frequently you better use graphity in the other
 case your method seems more sufficiant.

 What I have learnt from your comments that not only the ratio of reads to
 writes is interesting but the distribution of outdegrees should be
 considered when deciding which algorithm to use.. If the outdegree really
 stays small your approach will be more efficiently.

 @maxdemermazi:
 performance with bigger k drops in both cases. retrieval with graphity is
 O(k log k) where STOU is O(d log d + k log k) and for small k the d part is
 just dominating. So what's happening here is that graphity will always be
 faster since no sorting of the egonetwork is involved.
 Besides. Why would you want to have such a big k. Even if your twitter
 stream genereated that many updates when you come to the website you will
 only get to see a small k of them and as soon as you scroll down you will
 see the next k + l tweets...

 @custom neo4j: well let's ask Peter if neotechnology will include support
 to handle that many relationship types again to neo4j. I certainly would be
 happy about that.

 2011/11/22 maxdemarzi maxdema...@gmail.com

  A couple of things bothered me about Rene's approach.
 
  1. Custom Neo4j that could handle lots of relationship types.  Not 100%
  sure
  where this fits?
  2. Performance drops like a bomb when K (the number of items retrieved)
  increases.
 
  So it kind of works to get the top 15 items... A twitter stream could
 have
  500 items easy per update (on my iphone every few hours).
 
 
 
  --
  View this message in context:
 
 http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3526543.html
  Sent from the Neo4j Community Discussions mailing list archive at
  Nabble.com.
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 



 --
 --
 mobile: +49 (0)176 6433 2481

 Skype: +49 (0)6131 / 4958926

 Skype: rene.pickhardt

 www.rene-pickhardt.de
  http://www.beijing-china-blog.com
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-26 Thread René Pickhardt
@ qcon: I can't find the twitter videos on
http://qconlondon.com/london-2011/ rick could you please point to further
rescources?

@ aseem: I gave a detailed answer to your questions on
http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/#comment-1295

but I think we have the same oppinion. The approach that you wrote down
here is exactly what my baseline STOU does. you have your friendship
relations as a (S)tar and the events as (T)emporal (O)rdered (U)pdeats ( =
STOU ) The merging traversal is the same in STOU and Graphity. The only
difference is that in STOU I have to sort the egonetwork O(d log d) where d
is infact the outdegree.

Once we understand this everything else is easy: You look at your kind of
application and decide what your ratio of read to write operations is. Once
reading happens far more frequently you better use graphity in the other
case your method seems more sufficiant.

What I have learnt from your comments that not only the ratio of reads to
writes is interesting but the distribution of outdegrees should be
considered when deciding which algorithm to use.. If the outdegree really
stays small your approach will be more efficiently.

@maxdemermazi:
performance with bigger k drops in both cases. retrieval with graphity is
O(k log k) where STOU is O(d log d + k log k) and for small k the d part is
just dominating. So what's happening here is that graphity will always be
faster since no sorting of the egonetwork is involved.
Besides. Why would you want to have such a big k. Even if your twitter
stream genereated that many updates when you come to the website you will
only get to see a small k of them and as soon as you scroll down you will
see the next k + l tweets...

@custom neo4j: well let's ask Peter if neotechnology will include support
to handle that many relationship types again to neo4j. I certainly would be
happy about that.

2011/11/22 maxdemarzi maxdema...@gmail.com

 A couple of things bothered me about Rene's approach.

 1. Custom Neo4j that could handle lots of relationship types.  Not 100%
 sure
 where this fits?
 2. Performance drops like a bomb when K (the number of items retrieved)
 increases.

 So it kind of works to get the top 15 items... A twitter stream could have
 500 items easy per update (on my iphone every few hours).



 --
 View this message in context:
 http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3526543.html
 Sent from the Neo4j Community Discussions mailing list archive at
 Nabble.com.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
--
mobile: +49 (0)176 6433 2481

Skype: +49 (0)6131 / 4958926

Skype: rene.pickhardt

www.rene-pickhardt.de
 http://www.beijing-china-blog.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-21 Thread Peter Neubauer
You might even be interested in Rene Pickards work on a full solution
(albeit with some write-time overhead), see
http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org              - NOSQL for the Enterprise.
http://startupbootcamp.org/    - Öresund - Innovation happens HERE.



On Thu, Nov 3, 2011 at 10:36 PM, maxdemarzi maxdema...@gmail.com wrote:
 I had not considered imperfect solutions, and in some activity stream
 scenarios a sampling of the last few messages could work.  The sample would
 have to be taken from all Person nodes because if we sample from the Tweets
 in general and we encounter a chatty person node early on, it would take
 up all the sample space.

 Person1 - Person 2 - 10 tweets
 Person1 - Person3 - 1000 tweets  = Sample Size reached, traversal stops
 Person1 - Person4 - 10 tweets
 Person1 - PersonX - 10 tweets

 Person3 would prevent Person4 to PersonX's tweets from ever making it to the
 sample.

 Some applications in domains like Financials or Network Monitoring may
 require the last known status and sampling might not be acceptable.

 --
 View this message in context: 
 http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html
 Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-21 Thread Aseem Kishore
This is very interesting -- thanks Peter for the link, and thanks
maxdemarzi for starting this conversation.

In our social network -- which has extremely little load, it's just in beta
-- we currently use basically (a), and it works just fine. We use Cypher to
do the sorting/trimming on the server, e.g. roughly this query:

START user=(123)
MATCH (user) -[:FOLLOWS]- (friend) -[:EVENT]- (event)
RETURN event
ORDER BY event.timestamp DESC
LIMIT 20

This obviously will not scale, since Neo4j has to read every single
potential event from disk and process it in memory.

We too have been thinking about this in the back of our minds, and the
obvious solution in our minds is indeed to use a linked list timeline for
events, and have a custom traverser that merge sorts in realtime.

Since this'd be a linked list where you want the newest events at the head
of the list, there will indeed be a small write-time cost of having to
update two relationships whenever a new event is created (i.e. inserting it
at the head of the list).

Rene mentions this approach as the first (baseline) approach he tried.
It's at 3:08 in the video in that post.

He decides that approach isn't good enough, however, because merge
sorting depends on the number of friends you have (# users you follow),
which he wants to avoid at read-time.

IMHO however, that's not worth optimizing for. It's rare that a user
follows a very high number of other people. E.g. it's usually around let's
say 100-1000 people.

The flip side -- the # of *followers* -- can be very high. And
interestingly, Rene's solution suffers from *that* at write-time. E.g. a
new event has to update every *followers'* Graphity index.

I'll provide Rene this feedback, but FWIW, I have the feeling that a simple
timeline will easily be good enough, in fact optimal, for a social network
on Neo4j:

- O(1) writes
- O(d) reads in space performance
- O(d log d) reads in time performance

Where d is the out-degree -- # of users you follow -- which is generally
small.

Cheers,
Aseem


On Mon, Nov 21, 2011 at 6:07 AM, Peter Neubauer 
peter.neuba...@neotechnology.com wrote:

 You might even be interested in Rene Pickards work on a full solution
 (albeit with some write-time overhead), see

 http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

 Cheers,

 /peter neubauer

 GTalk:  neubauer.peter
 Skype   peter.neubauer
 Phone   +46 704 106975
 LinkedIn   http://www.linkedin.com/in/neubauer
 Twitter  http://twitter.com/peterneubauer

 http://www.neo4j.org  - NOSQL for the Enterprise.
 http://startupbootcamp.org/- Öresund - Innovation happens HERE.



 On Thu, Nov 3, 2011 at 10:36 PM, maxdemarzi maxdema...@gmail.com wrote:
  I had not considered imperfect solutions, and in some activity stream
  scenarios a sampling of the last few messages could work.  The sample
 would
  have to be taken from all Person nodes because if we sample from the
 Tweets
  in general and we encounter a chatty person node early on, it would
 take
  up all the sample space.
 
  Person1 - Person 2 - 10 tweets
  Person1 - Person3 - 1000 tweets  = Sample Size reached, traversal stops
  Person1 - Person4 - 10 tweets
  Person1 - PersonX - 10 tweets
 
  Person3 would prevent Person4 to PersonX's tweets from ever making it to
 the
  sample.
 
  Some applications in domains like Financials or Network Monitoring may
  require the last known status and sampling might not be acceptable.
 
  --
  View this message in context:
 http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html
  Sent from the Neo4j Community Discussions mailing list archive at
 Nabble.com.
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-21 Thread Rick Bullotta
You might also want to take a look at the videos from the Twitter presentations 
from QCon London.

-Original Message-
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On 
Behalf Of Aseem Kishore
Sent: Monday, November 21, 2011 4:43 PM
To: Neo4j user discussions
Subject: Re: [Neo4j] Activity Streams and Twitter Sample App

This is very interesting -- thanks Peter for the link, and thanks
maxdemarzi for starting this conversation.

In our social network -- which has extremely little load, it's just in beta
-- we currently use basically (a), and it works just fine. We use Cypher to
do the sorting/trimming on the server, e.g. roughly this query:

START user=(123)
MATCH (user) -[:FOLLOWS]- (friend) -[:EVENT]- (event)
RETURN event
ORDER BY event.timestamp DESC
LIMIT 20

This obviously will not scale, since Neo4j has to read every single
potential event from disk and process it in memory.

We too have been thinking about this in the back of our minds, and the
obvious solution in our minds is indeed to use a linked list timeline for
events, and have a custom traverser that merge sorts in realtime.

Since this'd be a linked list where you want the newest events at the head
of the list, there will indeed be a small write-time cost of having to
update two relationships whenever a new event is created (i.e. inserting it
at the head of the list).

Rene mentions this approach as the first (baseline) approach he tried.
It's at 3:08 in the video in that post.

He decides that approach isn't good enough, however, because merge
sorting depends on the number of friends you have (# users you follow),
which he wants to avoid at read-time.

IMHO however, that's not worth optimizing for. It's rare that a user
follows a very high number of other people. E.g. it's usually around let's
say 100-1000 people.

The flip side -- the # of *followers* -- can be very high. And
interestingly, Rene's solution suffers from *that* at write-time. E.g. a
new event has to update every *followers'* Graphity index.

I'll provide Rene this feedback, but FWIW, I have the feeling that a simple
timeline will easily be good enough, in fact optimal, for a social network
on Neo4j:

- O(1) writes
- O(d) reads in space performance
- O(d log d) reads in time performance

Where d is the out-degree -- # of users you follow -- which is generally
small.

Cheers,
Aseem


On Mon, Nov 21, 2011 at 6:07 AM, Peter Neubauer 
peter.neuba...@neotechnology.com wrote:

 You might even be interested in Rene Pickards work on a full solution
 (albeit with some write-time overhead), see

 http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

 Cheers,

 /peter neubauer

 GTalk:  neubauer.peter
 Skype   peter.neubauer
 Phone   +46 704 106975
 LinkedIn   http://www.linkedin.com/in/neubauer
 Twitter  http://twitter.com/peterneubauer

 http://www.neo4j.org  - NOSQL for the Enterprise.
 http://startupbootcamp.org/- Öresund - Innovation happens HERE.



 On Thu, Nov 3, 2011 at 10:36 PM, maxdemarzi maxdema...@gmail.com wrote:
  I had not considered imperfect solutions, and in some activity stream
  scenarios a sampling of the last few messages could work.  The sample
 would
  have to be taken from all Person nodes because if we sample from the
 Tweets
  in general and we encounter a chatty person node early on, it would
 take
  up all the sample space.
 
  Person1 - Person 2 - 10 tweets
  Person1 - Person3 - 1000 tweets  = Sample Size reached, traversal stops
  Person1 - Person4 - 10 tweets
  Person1 - PersonX - 10 tweets
 
  Person3 would prevent Person4 to PersonX's tweets from ever making it to
 the
  sample.
 
  Some applications in domains like Financials or Network Monitoring may
  require the last known status and sampling might not be acceptable.
 
  --
  View this message in context:
 http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html
  Sent from the Neo4j Community Discussions mailing list archive at
 Nabble.com.
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-21 Thread maxdemarzi
A couple of things bothered me about Rene's approach.

1. Custom Neo4j that could handle lots of relationship types.  Not 100% sure
where this fits?
2. Performance drops like a bomb when K (the number of items retrieved)
increases.

So it kind of works to get the top 15 items... A twitter stream could have
500 items easy per update (on my iphone every few hours).



--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3526543.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Activity Streams and Twitter Sample App

2011-11-03 Thread maxdemarzi
Andreas Ronge created a new sample app called kvitter @
https://github.com/andreasronge/kvitter .

This got me thinking about the Twitter clone done in Redis @
http://redis.io/topics/twitter-clone

If you scroll down 2/3's of the way down you'll read this piece:

After we create a post we obtain the post id. We need to LPUSH this post id
in every user that's following the author of the post, and of course in the
list of posts of the author.

This is the bottleneck of the application (if you can call anything on
Redis a bottleneck since it's so freaking fast).  A tweet by
http://twitter.com/#!/APLUSK will have to do 8 Million writes to each of
Ashton's Followers.

In Neo4j, we shouldn't need to do that since we can express that by
relationships:

# this should return the tweets of all the people I follow.
Person1.outgoing(:follows).outgoing(:tweeted).depth(2).filter(position.length()
== 2;) 

I don't want to get every tweet of every person I follow, just 100 of them.

Person1.outgoing(:follows).outgoing(:tweeted).depth(2).filter(position.length()
== 2;) .prune(position.returnedNodesCount()  100)

But what about ordering them so I get the Latest 100 tweets from all of the
tweets a persons followers have tweeted?

What are some options here?

A) Return all them, and then filter them so only the top 100 are displayed
(See getLatestEvents from
https://trac.neo4j.org/browser/examples/activity-stream/src/main/java/org/neo4j/examples/activitystream/ActivityStreamExample.java?rev=3888
)

B) A custom Evaluator that adds tweets to an ordered set and trims off the
old tweets as it goes based on a tweet date property.  (Will have to check
every tweet property, like
http://sujitpal.blogspot.com/2009/06/custom-traverser-for-neo4j.html )

C) Put the tweet date on the Tweeted Relationship, so we only check at the
relationship property level

D) Claim defeat and use a tweeted_recently relationship added from every
new tweet to every follower (with trimming) sort of like Redis

E) Claim defeat and use an index to store the latest tweets like Redis

F) (assuming no id reuse) A custom Breadth First Traverser that evaluates
just the last 100 Tweeted relationships ordered by id in conjunction with B.
(is this possible?)

Other ideas?

How scalable are these solutions?

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3477669.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-03 Thread maxdemarzi
Came up with another possibility:

G) Store the Latest 100 tweeted relationship ids with dates as a property of
the User Node, and a custom Breadth First Traversal that evaluates the list
of every follower's latest 100 tweets before deciding which relationships to
follow.

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3477693.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-03 Thread Linan Wang
one more solution. set a sampling_ratio, say 10:
Person1.outgoing(:follows).outgoing(:tweeted).depth(2).filter(position.length()==
2;) .prune(position.returnedNodesCount()  100 * sampling_ratio)
then do a sort based on timestamp.
the goal is not to get the perfect result but *good enough* ones
depends on your experience.

On Thu, Nov 3, 2011 at 4:48 PM, maxdemarzi maxdema...@gmail.com wrote:
 Came up with another possibility:

 G) Store the Latest 100 tweeted relationship ids with dates as a property of
 the User Node, and a custom Breadth First Traversal that evaluates the list
 of every follower's latest 100 tweets before deciding which relationships to
 follow.

 --
 View this message in context: 
 http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3477693.html
 Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Best wishes,

Linan Wang
Architect, Programmer, PhD
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Activity Streams and Twitter Sample App

2011-11-03 Thread maxdemarzi
I had not considered imperfect solutions, and in some activity stream
scenarios a sampling of the last few messages could work.  The sample would
have to be taken from all Person nodes because if we sample from the Tweets
in general and we encounter a chatty person node early on, it would take
up all the sample space.

Person1 - Person 2 - 10 tweets
Person1 - Person3 - 1000 tweets  = Sample Size reached, traversal stops
Person1 - Person4 - 10 tweets
Person1 - PersonX - 10 tweets

Person3 would prevent Person4 to PersonX's tweets from ever making it to the
sample.

Some applications in domains like Financials or Network Monitoring may
require the last known status and sampling might not be acceptable.

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Activity-Streams-and-Twitter-Sample-App-tp3477669p3478477.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user