"Also, if your original matrix is A, then it is usually a shorter path to results to analyze the word (item) cooccurrence matrix A'A. The methods below work either way."
The cooccurrence definitions I'm finding only use the summation-based one in wikipedia. Are there any involving inverting the matrix instead? Or, as I suspect, are the two exactly the same but I don't know enough linear algebra? On Mon, Aug 29, 2011 at 3:28 PM, Ted Dunning <[email protected]> wrote: > Jeff, > > I think that this is a much simpler exposition: > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html > > It makes the connection with entropy clear and allows a very simple > implementation for more than 2x2 situations. > > More comments in-line: > > On Mon, Aug 29, 2011 at 1:34 PM, Jeff Hansen <[email protected]> wrote: > > > ... > > If we have a item/feature matrix (document/words, user/movies) then we > can > > consider each row or column to be a sample from larger reference > population > > of the full matrix. > > > Well, sort of. If you have a document (user) and a word (item), then you > have a joint probability that any given interaction will be between this > document and word. We pretend in this case that each interaction is > independent of every other which is patently not true, but very helpful. > > This joint probability can be approximated more or less accurately as the > product of document (user) and word (item) popularities. In matrix terms, > this is the same as approximating the joint probability as a rank-1 matrix > that is the outer produced of user and item probability distributions, each > of which is a vector. The point of what we are trying to do is to find a > good and economical approximation of the joint probability that captures > important deviations from this rank-1 approximation. There are two > problems > that come up in trying to find this approximation. The first is that we > see > counts rather than probabilities and the second is that the matrix is big. > Sometimes, it is very big. > > > > In that case the log likelihood significance for any > > given observation within the sample will be based on the observation > itself > > (1), the count of observations for the column (documents with this > > word/users that watched this movie), the count of observations for the > row > > (words in this document, movies this user has watched) and the total > number > > of any observation within the reference population. Given those numbers, > > G2 > > or log likelihood is just a calculation of a, b, c and d (values > described > > above respectively) > > > > As the blog I mentioned above makes more clear than original paper, you > need > to reduce the counts before taking them as entries in the contingency > table. > If you are looking at the typical problem of vocabulary in documents, then > the discrepancy won't be huge, but if you have any very popular items, this > could be a problem. > > Given the log likelihood for each individual observation, is the best way > to > > apply it simply to remove observations that don't have a high enough > level > > of significance? > > > Roughly. It is a bit misleading to talk about "significance" here since we > aren't trying to do a classic frequentist hypothesis test. Instead, what > we > are trying to do is prioritize which joint terms we need to use to get a > usefully better approximation of the underlying joint distribution. > > > > Or is there a better way to normalize the values rather > > than removing less significant ones. > > > There are a lot of ways of approaching this. Most don't do any better than > the simple expedient of just keeping the highest scoring k words for each > document (or items for each user). Having k be the same for all users is > not a bad thing at all from a number of angles. I often use k = 50 to 300. > > Also, in this case I feel like it makes more sense to treat Users like > > Documents and movies like observations of words within a document, so > > compare each User and their observations to > > the reference population rather than each Movie and the users that > reviewed > > it. > > > I think that you are on the right track here, but your second sentence made > a distinction I didn't understand. User is to item as Document is to Word > is a fine metaphor for recording your observations. I strongly prefer > binary > observations, so I usually either count all ratings as 1 or all ratings > above a threshold as 1 with everything else being a zero. > > Whether you subsequently try to find users similar to each other or items > similar to each other doesn't much matter. All are reasonable things to > try. Whether they are useful is up to you and your customer to decide. > > Also, if your original matrix is A, then it is usually a shorter path to > results to analyze the word (item) cooccurrence matrix A'A. The methods > below work either way. > > > > I realize this is a Mahout user list, so I feel somewhat guilty > constantly > > pasting in R code, but here's how I ended up implementing it. > > > I do it. Use the right tools for the right task. R is great for > prototyping. > > > > #rough equation that calculates log likelihood given a contingency table > of > > counts > > llr <- function(a,b,c,d){ > > r=(a+b)/(c+d) > > e1=c*r > > e2=d*r > > g2=2*(a*log(a/e1)+b*log(b/e2)) > > return(g2) > > } > > > > So this code implies that you are arranging the elements of the contingency > table this way: > > a b | a+b > c d | c+d > ---------+-------- > a+c b+d | a+b+c+d > > But your later code you pass in data that uses this convention > > a b-a | b > c d-b-c | d-b > -----------+----- > a+c d-a-c | d > > I don't like this form because it makes code more complex overall and > breaks > important symmetries. > > #get counts to calculate log likelihood for function above > > a <- 1 > > b <- apply(A,2,sum) > > c <- apply(A,1,sum) > > > > you can also use rowSums and columnSums > > d <- sum(A) > > > > #export sparse matrix A to a dataframe for calculating llr > > triples <- summary(A) > > #create b and c value vectors that sync up with the dataframe columns > > bx <- b[triples$j] > > cx <- c[triples$i] > > #caculate the log likelihood for each entry in the dataframe > > ll <- llr(a,bx,cx,d) > > > > This probably needs to be something more like > > llr(a, bx-a, d-cx, d-bx-cx+a) > > (I think) > > Also, summary produces different values for different kinds of matrices. > As > you point out, it is very handy for sparse matrices, but it is nice to have > code that works on sparse or dense matrices. > > > > #create a logical operator vector that indicates whether each entry in > > dataframe is significant > > significant <- ll>3 > > > > I think that a much higher cutoff is better. Commonly a cutoff of 10-30 is > useful. Also, you may be better off if you can take the k-highest. > > > > #build a new sparse vector that only entries with significant enough log > > likelihood > > B <- sparseMatrix(i=triples$i[significant],j=triples$j[significant],x=1) > > > > > > > > > By the way, I always used to struggle with matrix multiplication (I could > > never quite remember which side was rows and which side was columns with > > out > > little mnemonics, but then again I've always struggled with remembering > my > > left from my right). I recently realized it makes much more sense if you > > picture it as a cube in 3space -- if you match up the lower left hand > > corner > > of the right matrix with the upper right hand corner of the left matrix > and > > put the product in between them so you get a backwards capital L shape, > > then > > fold back the right and left matrices you get three faces of a cube (or > > cubic rectangle). Then you just project inwards from the original > matrices > > and multiply where ever the values cross and sum them up those products > as > > you project them toward the face which is the resulting matrix. Once you > > visualize it this way it makes slicing and dicing the operation into > > blockwise formulas trivial and obvious. If only somebody had explained > it > > this way I would have found matrices much less intimidating -- has > anybody > > ever seen it explained this way? > > > > > > > > > > On Fri, Aug 26, 2011 at 10:21 PM, Lance Norskog <[email protected]> > wrote: > > > > > > > > > > > http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html > > > > > > Do not fear demographics. Yes, some people rent movies with all-black > > > casts, > > > and other people rent movies with all-white casts. And the Walmarts in > > the > > > SF East Bay have palettes full of Tyler Perry videos, while most of the > > SF > > > Peninsula don't know his name. > > > > > > And sometimes a bunch of Army Rangers have a great time watching > > > "Confessions of a Shopaholic" in a tent, 120o farenheit, in Qatar. > > > > > > On Fri, Aug 26, 2011 at 8:29 AM, Jeff Hansen <[email protected]> > wrote: > > > > > > > Thanks for the math Ted -- that was very helpful. > > > > > > > > I've been using sparseMatrix() from libray(Matrix) -- largely based > on > > > your > > > > response to somebody elses email. I've been playing with smaller > > > matrices > > > > mainly for my own learning purposes -- it's much easier to read > through > > > 200 > > > > movies (most of which I've heard of) and get a gut feel, than 10,000 > > > > movies. > > > > It also means I don't have to shut down my session quite as often > > (I've > > > > been using rstudio) when I run something over the wrong > > dimension(column > > > vs > > > > row). I was running into some limitations as well, but I think some > of > > > > those may have had more to do with my own typos and misunderstanding > of > > > > that > > > > language. > > > > > > > > There's a second reason I've been avoiding the tail -- while working > on > > > > this > > > > as a possible project to propose, I had a bit of a "duh" moment. > > > Sometimes > > > > it's important to realize the real world constraints. Picture a > > company > > > > with very physical locations and very limited shelf space (say a > > > > convenience > > > > store or even a vending machine...) Netflix and Amazon can make a > lot > > of > > > > money in the long tail because they have centralized and or digital > > > > inventories that service a large customer base. Imagine a situation > > > where > > > > you have a small physical distributed inventory with an equally > > > distributed > > > > customer base. In that situation it doesn't pay to chase after the > > tail > > > -- > > > > you just need to reduce your costs and target the head where you get > > more > > > > volume with less items. So you really end up with a three tier > market > > > > segmentation -- one strategy works best for the head, another for the > > > body, > > > > and a third for the tail. > > > > > > > > As far as clusters go -- I really wasn't finding any clusters at the > > > edges > > > > of the data, but that could have more to do with not including the > tail > > > > (and > > > > not normalizing appropriately for popularity). > > > > > > > > Incidentally, there was one amusingly cautionary anecdote I'd share > -- > > I > > > > don't remember the exact spread, but at one point I had two extremes > > that > > > > included something like "the johnson family vacation", "my baby's > > daddy", > > > > and "barbershop 2" on the one side, and then titles like "mean > girls", > > > "50 > > > > first dates" and "along came polly" on the other side. You may have > to > > > > look > > > > up those titles to see what I'm talking about, but I'd say the > > > distinction > > > > will be pretty clear when you do -- and if you were simply automating > > all > > > > of > > > > this and not reviewing it with common sense, you could end up > offending > > > > some > > > > of your users... All I'm saying is there may be trends in the real > > world > > > > that some people aren't comfortable having pointed out. > > > > > > > > On Fri, Aug 26, 2011 at 4:08 AM, Lance Norskog <[email protected]> > > > wrote: > > > > > > > > > This is a little meditation on user v.s. item matrix density. The > > > > > heavy users and heavy items can be subsampled, once they are > > > > > identified. Hadoop's built-in sort does give a very simple > > > > > "map-increase" way to do this sort. > > > > > > > > > > > > http://ultrawhizbang.blogspot.com/2011/08/sorted-recommender-data.html > > > > > > > > > > On Thu, Aug 25, 2011 at 5:57 PM, Ted Dunning < > [email protected]> > > > > > wrote: > > > > > > In matrix terms the binary user x item matrix maps a set of items > > to > > > > > users > > > > > > (A h = users who interacted with items in h). Similarly A' maps > > > users > > > > to > > > > > > items. Thus A' (A h) is the classic "users who x'ed this also > x'ed > > > > that" > > > > > > sort of operation. This can be rearranged to be (A'A) h. > > > > > > > > > > > > This is where the singular values come in. If > > > > > > > > > > > > A \approx U S V' > > > > > > > > > > > > and we know that U'U = V'V = I from the definition of the SVD. > > This > > > > > means > > > > > > that > > > > > > > > > > > > A' A \approx (U S V')' (U S V') = V S U' U S V' = V S^2 V' > > > > > > > > > > > > But V' transforms an item vector into the reduced dimensional > space > > > so > > > > we > > > > > > can view this as transforming the item vector into the reduced > > > > > dimensional > > > > > > space, scaling element-wise using S^2 and then transforming back > to > > > > item > > > > > > space. > > > > > > > > > > > > As you noted, the first right singular vector corresponds to > > > > popularity. > > > > > It > > > > > > is often not appropriate to just recommend popular things (since > > > users > > > > > have > > > > > > other means of discovering them) so you might want to drop that > > > > dimension > > > > > > from the analysis. This can be done in the SVD results or it can > > be > > > > done > > > > > > using something like a log-likelihood ratio test so that we only > > > model > > > > > > anomalously large values in A'A. > > > > > > > > > > > > Btw, if you continue to use R for experimentation, you should be > > able > > > > to > > > > > > dramatically increase the size of the analysis you do by using > > sparse > > > > > > matrices and a random projection algorithm. I was able to > compute > > > > SVD's > > > > > of > > > > > > matrices with millions of rows and columns and a few million > > non-zero > > > > > > elements in a few seconds. Holler if you would like details > about > > > how > > > > to > > > > > do > > > > > > this. > > > > > > > > > > > > On Thu, Aug 25, 2011 at 4:07 PM, Jeff Hansen <[email protected] > > > > > > wrote: > > > > > > > > > > > >> One thing I found interesting (but not particularly surprising) > is > > > > that > > > > > the > > > > > >> biggest singular value/vector was pretty much tied directly to > > > volume. > > > > > >> That > > > > > >> makes sense because the best predictor of whether a given fields > > > value > > > > > was > > > > > >> 1 > > > > > >> was whether it belonged to a row with lots of 1s or a column > with > > > lots > > > > > of > > > > > >> 1s > > > > > >> (I haven't quite figured out the best method to normalize the > > values > > > > > with > > > > > >> yet). When I plot the values of the largest singular vectors > > > against > > > > > the > > > > > >> sum of values in the corresponding row or column, the > correlation > > is > > > > > very > > > > > >> linear. That's not the case for any of the other singular > vectors > > > > > (which > > > > > >> actually makes me wonder if just removing that first singular > > vector > > > > > from > > > > > >> the prediction might not be one of the best ways to normalize > the > > > > data). > > > > > >> > > > > > >> I understand what you're saying about each singular vector > > > > corresponding > > > > > to > > > > > >> a feature though. Each left singular vector represents some > > > abstract > > > > > >> aspect > > > > > >> of a movie and each right singular vector represents users > > leanings > > > or > > > > > >> inclinations with regards to that aspect of the movie. The > > singular > > > > > value > > > > > >> itself just seems to indicate how good a predictor the > combination > > > of > > > > > one > > > > > >> users inclination toward that aspect of a movie is for coming up > > > with > > > > > the > > > > > >> actual value. The issue I mentioned above is that popularity of > a > > > > movie > > > > > as > > > > > >> well as how often a user watches movies tend to be the best > > > predictors > > > > > of > > > > > >> whether a user has seen or will see a movie. > > > > > >> > > > > > >> I had been picturing this with the idea of one k dimensional > space > > > -- > > > > > one > > > > > >> where a users location corresponds to their ideal prototypical > > > movies > > > > > >> location. Not that there would be a movie right there, but there > > > would > > > > > be > > > > > >> ones nearby, and the nearer they were, the more enjoyable they > > would > > > > be. > > > > > >> That's a naive model, but that doesn't mean it wouldn't work > well > > > > > enough. > > > > > >> My problem is I don't quite get how to map the user space over > to > > > the > > > > > item > > > > > >> space. > > > > > >> > > > > > >> I think that may be what Lance is trying to describe in his last > > > > > response, > > > > > >> but I'm falling short on reconstructing the math from his > > > description. > > > > > >> > > > > > >> I get the following > > > > > >> > > > > > >> U S V* = A > > > > > >> U S = A V > > > > > >> U = A V 1/S > > > > > >> > > > > > >> I think that last line is what Lance was describing. Of course > my > > > > > problem > > > > > >> was bootstrapping in a user for whom I don't know A or V. > > > > > >> > > > > > >> I also think I may have missed a big step of the puzzle. For > some > > > > > reason I > > > > > >> thought that just by loosening the rank, you could recompose the > > > > Matrix > > > > > A > > > > > >> from the truncated SVD values/vectors and use the recomposed > > values > > > > > >> themselves as the recommendation. I thought one of the ideas > was > > > that > > > > > the > > > > > >> recomposed matrix had less "noise" and could be a better > > > > representation > > > > > of > > > > > >> the underlying nature of the matrix than the original matrix > > itself. > > > > > But > > > > > >> that may have just been wishful thinking... > > > > > >> > > > > > >> On Thu, Aug 25, 2011 at 4:50 PM, Sean Owen <[email protected]> > > > wrote: > > > > > >> > > > > > >> > The 200x10 matrix is indeed a matrix of 10 singular vectors, > > which > > > > are > > > > > >> > eigenvectors of AA'. It's the columns, not rows, that are > > > > > >> > eigenvectors. > > > > > >> > > > > > > >> > The rows do mean something. I think it's fair to interpret the > > 10 > > > > > >> > singular values / vectors as corresponding to some underlying > > > > features > > > > > >> > of tastes. The rows say how much each user expresses those 10 > > > > tastes. > > > > > >> > The matrix of right singular vectors on the other side tells > you > > > the > > > > > >> > same thing about items. The diagonal matrix of singular values > > in > > > > the > > > > > >> > middle also comes into play -- it's like a set of multipliers > > that > > > > say > > > > > >> > how important each feature is. (This is why we cut out the > > > singular > > > > > >> > vectors / values that have the smallest singular values; it's > > like > > > > > >> > removing the least-important features.) So really you'd have > to > > > > stick > > > > > >> > those values somewhere; Ted says it's conventional to put > "half" > > > of > > > > > >> > each (their square roots) with each side if anything. > > > > > >> > > > > > > >> > I don't have as good a grasp on an intuition for the columns > as > > > > > >> > eigenvectors. They're also a set of basis vectors, and I had > > > > > >> > understood them as like the "real" bases of the reduced > feature > > > > space > > > > > >> > expressed in user-item space. But I'd have to go back and > think > > > that > > > > > >> > intuition through again since I can't really justify it from > > > scratch > > > > > >> > again in my head just now. > > > > > >> > > > > > > >> > > > > > > >> > On Thu, Aug 25, 2011 at 10:21 PM, Jeff Hansen < > > [email protected] > > > > > > > > > >> wrote: > > > > > >> > > Well, I think my problem may have had more to do with what I > > was > > > > > >> calling > > > > > >> > the > > > > > >> > > eigenvector... I was referring to the rows rather than the > > > > columns > > > > > of > > > > > >> U > > > > > >> > and > > > > > >> > > V. While the columns may be characteristic of the overall > > > matrix, > > > > > the > > > > > >> > rows > > > > > >> > > are characteristic of the user or item (in that they are a > > rank > > > > > reduced > > > > > >> > > representation of that person or thing). I guess you could > say > > I > > > > > just > > > > > >> had > > > > > >> > to > > > > > >> > > tilt my head to the side and change my perspective 90 > degrees > > =) > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Lance Norskog > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > -- > > > Lance Norskog > > > [email protected] > > > > > > -- Lance Norskog [email protected]
