The Pearson correlation is not a distance metric, it goes the wrong way. Distance needs to be 0 when two things are identical instead of 1. "1 - correlation" can work. I am not sure that is a proper distance metric but you often don't need it to be proper and obey the triangle inequality.
A point in space defines a vector / direction. You can project all your data onto that vector and then you have put everything on a 1-D axis. It sounds like you want to start with one of those points like "the centroid of all Star Trek movies". You can do this by dotting everyone's item pref vector with the vector that has a "1" for all Star Trek movies only. This will tell you, well, the extent to which people have rated the Star Trek movies. It probably gets more interesting when you project the data intelligently into a lower-dimensional space. There you might pick up that people who rate Star Wars end up projecting near to the Star Trek watchers. The SVD (more or less what PCA is) is good, even overkill if this is all you want; ALS is fast and simple for a smart-ish projection into lower dimensions. The things that project far from Star Trek may, or may not, be thematically coherent. There are many "things not like Star Trek". For example, if you operated in the original space, you'd find a huge clump of people at 0 -- these are people that have just never watched a Star Trek movie, and there are plenty of different types of those. But, in some cases, you may find an interesting thematic clump at the opposite end sure. So far there's nothing here that requires even more than one axis or dimension. It sounds like you want to discover clusters, which is just clustering and not something that necessarily involves any projection or matrix factorization. Then see what things look like when you project onto the axis defined by that cluster in feature space. Just cluster with k-means to start and use the distance metric or similar above. On Tue, Nov 27, 2012 at 1:55 AM, Lance Norskog <[email protected]> wrote: > There is a problem I've wanted to solve for a long time. Suppose you want to > find antipodes in preferences: "axes of interest". In movie preferences, Star > Trek movies (male nerds) v.s. Sex In The City (middle class women) might be > one axis v.s. historical documentaries v.s. 1950's Douglas Sirk melodramas > (don't ask). These axes are not orthogonal. (I saw this analysis in a > presentation by one of the Netflix Competition finalists. Unfortunately, I > did not ask him how to make it.) > > Thank you for this hint. Negative correlations make this possible. Given an > item-item matrix of Pearson distances, how would you isolate these axes? The > minimum and maximum movies are easy to find. Each axis endpoint is a small > cluster inside a genre. How would you find these small clusters? They're not > orthogonal, so a naive SVD would not help. What is a good algorithm for this? > > Lance > > ----- Original Message ----- > | From: "Paulo Villegas" <[email protected]> > | To: [email protected] > | Sent: Monday, November 26, 2012 2:03:59 PM > | Subject: Re: Recommender's formula > | > | [...] > | > | They can be negative for certain similarity metrics, most notably > | Pearson (which has sign, negative similarities express negative > | correlations), other similarity metrics are strictly positive and > | therefore do not present that problem. > | > | [...]
