Hi,

I have a proposal for a very different solution.
First a disclaimer: It's math and I'm not a mathematician and I've never
tried something like it.

Perhaps you could use Principal Component
Analysis<http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_PCA_using_the_covariance_method>to
reduce the problem.
With this approach, you could dump all data into a matrix and calculate a
number of "stereotypes"
Each stereotype has its typical likes and every actual user is a linear
composition of stereotypes.

So the results can be stored in the form of:
each user links to its most significant stereotypes, the link holds the
importance

if you want to know  the similarity between two users, see if they match the
same stereotypes, with the same strength


This approach has a few advantages I think:

   1. easy tuning: you can chose how much stereotypes per user you store
   2. nice global statistic: from this martix you can estimate how accurate
   you are
   3. easy update: the singular value decomposition used for this method can
   reuse the result of the previous run as a starting value and "correct" it.
   Only the first run will take a lot of CPU time.
   4. you can use existing math libraries which are faster than anything you
   can write yourself

The big disadvantage is that it requires non trivial math

Wouter



On Wed, Jul 28, 2010 at 6:36 PM, David Montag <
[email protected]> wrote:

> Hi Alberto,
>
> On Wed, Jul 28, 2010 at 5:02 PM, Alberto Perdomo
> <[email protected]>wrote:
>
> > Hi David,
> >
> >
> > > But then you need to store the result. You can store these metrics as
> > > relationships in neo4j, and then just update them for each user when
> > > you recompute. You can find the user nodes via indexing. Maybe it's
> > > acceptable that some metrics are out of date, so you can just
> > > background process them continuously.
> >
> > I already have background processes that go through all users and
> > calculate new new pairs. But then in order to do that I do need to
> > exclude the pairs I already have... because it would be silly and as
> > the relationship density grows the probablity of calculating a pair
> > again would be higher and higher...
> > Would I be able to do that kind of query using indexing?
> >
>
> From your description it sounds like the factors that influence the metric
> don't change, so a single calculation per pair is enough. In this case, you
> could just determine the pairs in some way and then do the computation,
> storing the relationship in Neo4j. You can do it all in one go, nothing
> fancy. You would of course have to compute the metric to N peers for each
> new user.
>
> In other scenarios, the factors that influence the metric might change over
> time, e.g. a user's city or favorite movie. Then you actually need to keep
> recomputing the metric between existing users, and yes, then you probably
> want some scheme to make sure that you don't starve some users. You might
> for example want to prioritize the most active users first. Again, I don't
> know if this applies to your case though.
>
> As for the indexing, I'm not sure how you would use it here. Like, what
> kind
> of querying were you picturing?
>
>
> >
> > > Depending on your scenario, if your users know each other, it might be
> > > interesting to start computing in a foaf style order (breadth first).
> > > Remember, the power is in the relationships. Isolated nodes are not
> > > interesting.
> >
> > You mean I look first for possible pairs with users that are friends
> > of friends instead of randomly? We are also interesting in storing
> > friendship relationship so that sounds interesting.
> > That would be a different type of query: Traverse the graph from node
> > A to nodes which are friends of friends of A and have no match
> > relationship with A. I guess that is not difficult to implement using
> > Neo4j?
> >
>
> Exactly, so you might want to start with the most relevant other people,
> i.e. people you can realistically meet IRL via friends. Don't know if
> that's
> relevant to your application though.
>
> Neo4j would be a perfect fit for storing friendship relationships between
> users. It opens up all kinds of interesting data mining possibilities.
>
> The FOAF query would be easy to write using the Neo4j APIs, or some other
> tool such as Gremlin on top of Neo4j.
>
> So you could combine the friendship relationships with your processing step
> and prioritize active users, and start by checking people close to them in
> their social network. Again, if it's relevant. And, as Mattias suggested,
> if
> you can leverage friendship relationships between users, you might be able
> to calculate your metric on the fly, given that you limit the search to the
> user's extended social network. Of course, if you go deep enough, you might
> reach all users this way too.
>
>
> >
> > Thanks for your input David!
> >
>
> Glad to be of service. Ask as much as you like! We're all learning here :)
>
>
> > _______________________________________________
> > Neo4j mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to