In this text-only notation, I though apostrophe meant inverse. What then is matrix inversion?
I see a fair amount of stuff here in what I think is MathML, but is displays raw in gmail. On Wed, Aug 31, 2011 at 8:04 AM, Ted Dunning <[email protected]> wrote: > Uhh... > > A' is the transpose of A. Not the inverse. > > A' A *is* the summation version. > > On Wed, Aug 31, 2011 at 1:24 AM, Lance Norskog <[email protected]> wrote: > > > "Also, if your original matrix is A, then it is usually a shorter path to > > results to analyze the word (item) cooccurrence matrix A'A. The methods > > below work either way." > > > > The cooccurrence definitions I'm finding only use the summation-based one > > in > > wikipedia. Are there any involving inverting the matrix instead? Or, as I > > suspect, are the two exactly the same but I don't know enough linear > > algebra? > > > > On Mon, Aug 29, 2011 at 3:28 PM, Ted Dunning <[email protected]> > > wrote: > > > > > Jeff, > > > > > > I think that this is a much simpler exposition: > > > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html > > > > > > It makes the connection with entropy clear and allows a very simple > > > implementation for more than 2x2 situations. > > > > > > More comments in-line: > > > > > > On Mon, Aug 29, 2011 at 1:34 PM, Jeff Hansen <[email protected]> > wrote: > > > > > > > ... > > > > If we have a item/feature matrix (document/words, user/movies) then > we > > > can > > > > consider each row or column to be a sample from larger reference > > > population > > > > of the full matrix. > > > > > > > > > Well, sort of. If you have a document (user) and a word (item), then > you > > > have a joint probability that any given interaction will be between > this > > > document and word. We pretend in this case that each interaction is > > > independent of every other which is patently not true, but very > helpful. > > > > > > This joint probability can be approximated more or less accurately as > the > > > product of document (user) and word (item) popularities. In matrix > > terms, > > > this is the same as approximating the joint probability as a rank-1 > > matrix > > > that is the outer produced of user and item probability distributions, > > each > > > of which is a vector. The point of what we are trying to do is to find > a > > > good and economical approximation of the joint probability that > captures > > > important deviations from this rank-1 approximation. There are two > > > problems > > > that come up in trying to find this approximation. The first is that > we > > > see > > > counts rather than probabilities and the second is that the matrix is > > big. > > > Sometimes, it is very big. > > > > > > > > > > In that case the log likelihood significance for any > > > > given observation within the sample will be based on the observation > > > itself > > > > (1), the count of observations for the column (documents with this > > > > word/users that watched this movie), the count of observations for > the > > > row > > > > (words in this document, movies this user has watched) and the total > > > number > > > > of any observation within the reference population. Given those > > numbers, > > > > G2 > > > > or log likelihood is just a calculation of a, b, c and d (values > > > described > > > > above respectively) > > > > > > > > > > As the blog I mentioned above makes more clear than original paper, you > > > need > > > to reduce the counts before taking them as entries in the contingency > > > table. > > > If you are looking at the typical problem of vocabulary in documents, > > then > > > the discrepancy won't be huge, but if you have any very popular items, > > this > > > could be a problem. > > > > > > Given the log likelihood for each individual observation, is the best > way > > > to > > > > apply it simply to remove observations that don't have a high enough > > > level > > > > of significance? > > > > > > > > > Roughly. It is a bit misleading to talk about "significance" here > since > > we > > > aren't trying to do a classic frequentist hypothesis test. Instead, > what > > > we > > > are trying to do is prioritize which joint terms we need to use to get > a > > > usefully better approximation of the underlying joint distribution. > > > > > > > > > > Or is there a better way to normalize the values rather > > > > than removing less significant ones. > > > > > > > > > There are a lot of ways of approaching this. Most don't do any better > > than > > > the simple expedient of just keeping the highest scoring k words for > each > > > document (or items for each user). Having k be the same for all users > is > > > not a bad thing at all from a number of angles. I often use k = 50 to > > 300. > > > > > > Also, in this case I feel like it makes more sense to treat Users like > > > > Documents and movies like observations of words within a document, so > > > > compare each User and their observations to > > > > the reference population rather than each Movie and the users that > > > reviewed > > > > it. > > > > > > > > > I think that you are on the right track here, but your second sentence > > made > > > a distinction I didn't understand. User is to item as Document is to > > Word > > > is a fine metaphor for recording your observations. I strongly prefer > > > binary > > > observations, so I usually either count all ratings as 1 or all ratings > > > above a threshold as 1 with everything else being a zero. > > > > > > Whether you subsequently try to find users similar to each other or > items > > > similar to each other doesn't much matter. All are reasonable things > to > > > try. Whether they are useful is up to you and your customer to decide. > > > > > > Also, if your original matrix is A, then it is usually a shorter path > to > > > results to analyze the word (item) cooccurrence matrix A'A. The > methods > > > below work either way. > > > > > > > > > > I realize this is a Mahout user list, so I feel somewhat guilty > > > constantly > > > > pasting in R code, but here's how I ended up implementing it. > > > > > > > > > I do it. Use the right tools for the right task. R is great for > > > prototyping. > > > > > > > > > > #rough equation that calculates log likelihood given a contingency > > table > > > of > > > > counts > > > > llr <- function(a,b,c,d){ > > > > r=(a+b)/(c+d) > > > > e1=c*r > > > > e2=d*r > > > > g2=2*(a*log(a/e1)+b*log(b/e2)) > > > > return(g2) > > > > } > > > > > > > > > > So this code implies that you are arranging the elements of the > > contingency > > > table this way: > > > > > > a b | a+b > > > c d | c+d > > > ---------+-------- > > > a+c b+d | a+b+c+d > > > > > > But your later code you pass in data that uses this convention > > > > > > a b-a | b > > > c d-b-c | d-b > > > -----------+----- > > > a+c d-a-c | d > > > > > > I don't like this form because it makes code more complex overall and > > > breaks > > > important symmetries. > > > > > > #get counts to calculate log likelihood for function above > > > > a <- 1 > > > > b <- apply(A,2,sum) > > > > > > c <- apply(A,1,sum) > > > > > > > > > > you can also use rowSums and columnSums > > > > > > d <- sum(A) > > > > > > > > #export sparse matrix A to a dataframe for calculating llr > > > > triples <- summary(A) > > > > #create b and c value vectors that sync up with the dataframe columns > > > > bx <- b[triples$j] > > > > cx <- c[triples$i] > > > > #caculate the log likelihood for each entry in the dataframe > > > > ll <- llr(a,bx,cx,d) > > > > > > > > > > This probably needs to be something more like > > > > > > llr(a, bx-a, d-cx, d-bx-cx+a) > > > > > > (I think) > > > > > > Also, summary produces different values for different kinds of > matrices. > > > As > > > you point out, it is very handy for sparse matrices, but it is nice to > > have > > > code that works on sparse or dense matrices. > > > > > > > > > > #create a logical operator vector that indicates whether each entry > in > > > > dataframe is significant > > > > significant <- ll>3 > > > > > > > > > > I think that a much higher cutoff is better. Commonly a cutoff of > 10-30 > > is > > > useful. Also, you may be better off if you can take the k-highest. > > > > > > > > > > #build a new sparse vector that only entries with significant enough > > log > > > > likelihood > > > > B <- > > sparseMatrix(i=triples$i[significant],j=triples$j[significant],x=1) > > > > > > > > > > > > > > > > > > > > > By the way, I always used to struggle with matrix multiplication (I > > could > > > > never quite remember which side was rows and which side was columns > > with > > > > out > > > > little mnemonics, but then again I've always struggled with > remembering > > > my > > > > left from my right). I recently realized it makes much more sense if > > you > > > > picture it as a cube in 3space -- if you match up the lower left hand > > > > corner > > > > of the right matrix with the upper right hand corner of the left > matrix > > > and > > > > put the product in between them so you get a backwards capital L > shape, > > > > then > > > > fold back the right and left matrices you get three faces of a cube > (or > > > > cubic rectangle). Then you just project inwards from the original > > > matrices > > > > and multiply where ever the values cross and sum them up those > products > > > as > > > > you project them toward the face which is the resulting matrix. Once > > you > > > > visualize it this way it makes slicing and dicing the operation into > > > > blockwise formulas trivial and obvious. If only somebody had > explained > > > it > > > > this way I would have found matrices much less intimidating -- has > > > anybody > > > > ever seen it explained this way? > > > > > > > > > > > > > > > > > > > > On Fri, Aug 26, 2011 at 10:21 PM, Lance Norskog <[email protected]> > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html > > > > > > > > > > Do not fear demographics. Yes, some people rent movies with > all-black > > > > > casts, > > > > > and other people rent movies with all-white casts. And the Walmarts > > in > > > > the > > > > > SF East Bay have palettes full of Tyler Perry videos, while most of > > the > > > > SF > > > > > Peninsula don't know his name. > > > > > > > > > > And sometimes a bunch of Army Rangers have a great time watching > > > > > "Confessions of a Shopaholic" in a tent, 120o farenheit, in Qatar. > > > > > > > > > > On Fri, Aug 26, 2011 at 8:29 AM, Jeff Hansen <[email protected]> > > > wrote: > > > > > > > > > > > Thanks for the math Ted -- that was very helpful. > > > > > > > > > > > > I've been using sparseMatrix() from libray(Matrix) -- largely > based > > > on > > > > > your > > > > > > response to somebody elses email. I've been playing with smaller > > > > > matrices > > > > > > mainly for my own learning purposes -- it's much easier to read > > > through > > > > > 200 > > > > > > movies (most of which I've heard of) and get a gut feel, than > > 10,000 > > > > > > movies. > > > > > > It also means I don't have to shut down my session quite as > often > > > > (I've > > > > > > been using rstudio) when I run something over the wrong > > > > dimension(column > > > > > vs > > > > > > row). I was running into some limitations as well, but I think > > some > > > of > > > > > > those may have had more to do with my own typos and > > misunderstanding > > > of > > > > > > that > > > > > > language. > > > > > > > > > > > > There's a second reason I've been avoiding the tail -- while > > working > > > on > > > > > > this > > > > > > as a possible project to propose, I had a bit of a "duh" moment. > > > > > Sometimes > > > > > > it's important to realize the real world constraints. Picture a > > > > company > > > > > > with very physical locations and very limited shelf space (say a > > > > > > convenience > > > > > > store or even a vending machine...) Netflix and Amazon can make > a > > > lot > > > > of > > > > > > money in the long tail because they have centralized and or > digital > > > > > > inventories that service a large customer base. Imagine a > > situation > > > > > where > > > > > > you have a small physical distributed inventory with an equally > > > > > distributed > > > > > > customer base. In that situation it doesn't pay to chase after > the > > > > tail > > > > > -- > > > > > > you just need to reduce your costs and target the head where you > > get > > > > more > > > > > > volume with less items. So you really end up with a three tier > > > market > > > > > > segmentation -- one strategy works best for the head, another for > > the > > > > > body, > > > > > > and a third for the tail. > > > > > > > > > > > > As far as clusters go -- I really wasn't finding any clusters at > > the > > > > > edges > > > > > > of the data, but that could have more to do with not including > the > > > tail > > > > > > (and > > > > > > not normalizing appropriately for popularity). > > > > > > > > > > > > Incidentally, there was one amusingly cautionary anecdote I'd > share > > > -- > > > > I > > > > > > don't remember the exact spread, but at one point I had two > > extremes > > > > that > > > > > > included something like "the johnson family vacation", "my baby's > > > > daddy", > > > > > > and "barbershop 2" on the one side, and then titles like "mean > > > girls", > > > > > "50 > > > > > > first dates" and "along came polly" on the other side. You may > > have > > > to > > > > > > look > > > > > > up those titles to see what I'm talking about, but I'd say the > > > > > distinction > > > > > > will be pretty clear when you do -- and if you were simply > > automating > > > > all > > > > > > of > > > > > > this and not reviewing it with common sense, you could end up > > > offending > > > > > > some > > > > > > of your users... All I'm saying is there may be trends in the > real > > > > world > > > > > > that some people aren't comfortable having pointed out. > > > > > > > > > > > > On Fri, Aug 26, 2011 at 4:08 AM, Lance Norskog < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > This is a little meditation on user v.s. item matrix density. > The > > > > > > > heavy users and heavy items can be subsampled, once they are > > > > > > > identified. Hadoop's built-in sort does give a very simple > > > > > > > "map-increase" way to do this sort. > > > > > > > > > > > > > > > > > > > http://ultrawhizbang.blogspot.com/2011/08/sorted-recommender-data.html > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 5:57 PM, Ted Dunning < > > > [email protected]> > > > > > > > wrote: > > > > > > > > In matrix terms the binary user x item matrix maps a set of > > items > > > > to > > > > > > > users > > > > > > > > (A h = users who interacted with items in h). Similarly A' > > maps > > > > > users > > > > > > to > > > > > > > > items. Thus A' (A h) is the classic "users who x'ed this > also > > > x'ed > > > > > > that" > > > > > > > > sort of operation. This can be rearranged to be (A'A) h. > > > > > > > > > > > > > > > > This is where the singular values come in. If > > > > > > > > > > > > > > > > A \approx U S V' > > > > > > > > > > > > > > > > and we know that U'U = V'V = I from the definition of the > SVD. > > > > This > > > > > > > means > > > > > > > > that > > > > > > > > > > > > > > > > A' A \approx (U S V')' (U S V') = V S U' U S V' = V S^2 V' > > > > > > > > > > > > > > > > But V' transforms an item vector into the reduced dimensional > > > space > > > > > so > > > > > > we > > > > > > > > can view this as transforming the item vector into the > reduced > > > > > > > dimensional > > > > > > > > space, scaling element-wise using S^2 and then transforming > > back > > > to > > > > > > item > > > > > > > > space. > > > > > > > > > > > > > > > > As you noted, the first right singular vector corresponds to > > > > > > popularity. > > > > > > > It > > > > > > > > is often not appropriate to just recommend popular things > > (since > > > > > users > > > > > > > have > > > > > > > > other means of discovering them) so you might want to drop > that > > > > > > dimension > > > > > > > > from the analysis. This can be done in the SVD results or it > > can > > > > be > > > > > > done > > > > > > > > using something like a log-likelihood ratio test so that we > > only > > > > > model > > > > > > > > anomalously large values in A'A. > > > > > > > > > > > > > > > > Btw, if you continue to use R for experimentation, you should > > be > > > > able > > > > > > to > > > > > > > > dramatically increase the size of the analysis you do by > using > > > > sparse > > > > > > > > matrices and a random projection algorithm. I was able to > > > compute > > > > > > SVD's > > > > > > > of > > > > > > > > matrices with millions of rows and columns and a few million > > > > non-zero > > > > > > > > elements in a few seconds. Holler if you would like details > > > about > > > > > how > > > > > > to > > > > > > > do > > > > > > > > this. > > > > > > > > > > > > > > > > On Thu, Aug 25, 2011 at 4:07 PM, Jeff Hansen < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> One thing I found interesting (but not particularly > > surprising) > > > is > > > > > > that > > > > > > > the > > > > > > > >> biggest singular value/vector was pretty much tied directly > to > > > > > volume. > > > > > > > >> That > > > > > > > >> makes sense because the best predictor of whether a given > > fields > > > > > value > > > > > > > was > > > > > > > >> 1 > > > > > > > >> was whether it belonged to a row with lots of 1s or a column > > > with > > > > > lots > > > > > > > of > > > > > > > >> 1s > > > > > > > >> (I haven't quite figured out the best method to normalize > the > > > > values > > > > > > > with > > > > > > > >> yet). When I plot the values of the largest singular > vectors > > > > > against > > > > > > > the > > > > > > > >> sum of values in the corresponding row or column, the > > > correlation > > > > is > > > > > > > very > > > > > > > >> linear. That's not the case for any of the other singular > > > vectors > > > > > > > (which > > > > > > > >> actually makes me wonder if just removing that first > singular > > > > vector > > > > > > > from > > > > > > > >> the prediction might not be one of the best ways to > normalize > > > the > > > > > > data). > > > > > > > >> > > > > > > > >> I understand what you're saying about each singular vector > > > > > > corresponding > > > > > > > to > > > > > > > >> a feature though. Each left singular vector represents some > > > > > abstract > > > > > > > >> aspect > > > > > > > >> of a movie and each right singular vector represents users > > > > leanings > > > > > or > > > > > > > >> inclinations with regards to that aspect of the movie. The > > > > singular > > > > > > > value > > > > > > > >> itself just seems to indicate how good a predictor the > > > combination > > > > > of > > > > > > > one > > > > > > > >> users inclination toward that aspect of a movie is for > coming > > up > > > > > with > > > > > > > the > > > > > > > >> actual value. The issue I mentioned above is that > popularity > > of > > > a > > > > > > movie > > > > > > > as > > > > > > > >> well as how often a user watches movies tend to be the best > > > > > predictors > > > > > > > of > > > > > > > >> whether a user has seen or will see a movie. > > > > > > > >> > > > > > > > >> I had been picturing this with the idea of one k dimensional > > > space > > > > > -- > > > > > > > one > > > > > > > >> where a users location corresponds to their ideal > prototypical > > > > > movies > > > > > > > >> location. Not that there would be a movie right there, but > > there > > > > > would > > > > > > > be > > > > > > > >> ones nearby, and the nearer they were, the more enjoyable > they > > > > would > > > > > > be. > > > > > > > >> That's a naive model, but that doesn't mean it wouldn't > work > > > well > > > > > > > enough. > > > > > > > >> My problem is I don't quite get how to map the user space > > over > > > to > > > > > the > > > > > > > item > > > > > > > >> space. > > > > > > > >> > > > > > > > >> I think that may be what Lance is trying to describe in his > > last > > > > > > > response, > > > > > > > >> but I'm falling short on reconstructing the math from his > > > > > description. > > > > > > > >> > > > > > > > >> I get the following > > > > > > > >> > > > > > > > >> U S V* = A > > > > > > > >> U S = A V > > > > > > > >> U = A V 1/S > > > > > > > >> > > > > > > > >> I think that last line is what Lance was describing. Of > > course > > > my > > > > > > > problem > > > > > > > >> was bootstrapping in a user for whom I don't know A or V. > > > > > > > >> > > > > > > > >> I also think I may have missed a big step of the puzzle. > For > > > some > > > > > > > reason I > > > > > > > >> thought that just by loosening the rank, you could recompose > > the > > > > > > Matrix > > > > > > > A > > > > > > > >> from the truncated SVD values/vectors and use the recomposed > > > > values > > > > > > > >> themselves as the recommendation. I thought one of the > ideas > > > was > > > > > that > > > > > > > the > > > > > > > >> recomposed matrix had less "noise" and could be a better > > > > > > representation > > > > > > > of > > > > > > > >> the underlying nature of the matrix than the original matrix > > > > itself. > > > > > > > But > > > > > > > >> that may have just been wishful thinking... > > > > > > > >> > > > > > > > >> On Thu, Aug 25, 2011 at 4:50 PM, Sean Owen < > [email protected]> > > > > > wrote: > > > > > > > >> > > > > > > > >> > The 200x10 matrix is indeed a matrix of 10 singular > vectors, > > > > which > > > > > > are > > > > > > > >> > eigenvectors of AA'. It's the columns, not rows, that are > > > > > > > >> > eigenvectors. > > > > > > > >> > > > > > > > > >> > The rows do mean something. I think it's fair to interpret > > the > > > > 10 > > > > > > > >> > singular values / vectors as corresponding to some > > underlying > > > > > > features > > > > > > > >> > of tastes. The rows say how much each user expresses those > > 10 > > > > > > tastes. > > > > > > > >> > The matrix of right singular vectors on the other side > tells > > > you > > > > > the > > > > > > > >> > same thing about items. The diagonal matrix of singular > > values > > > > in > > > > > > the > > > > > > > >> > middle also comes into play -- it's like a set of > > multipliers > > > > that > > > > > > say > > > > > > > >> > how important each feature is. (This is why we cut out the > > > > > singular > > > > > > > >> > vectors / values that have the smallest singular values; > > it's > > > > like > > > > > > > >> > removing the least-important features.) So really you'd > have > > > to > > > > > > stick > > > > > > > >> > those values somewhere; Ted says it's conventional to put > > > "half" > > > > > of > > > > > > > >> > each (their square roots) with each side if anything. > > > > > > > >> > > > > > > > > >> > I don't have as good a grasp on an intuition for the > columns > > > as > > > > > > > >> > eigenvectors. They're also a set of basis vectors, and I > had > > > > > > > >> > understood them as like the "real" bases of the reduced > > > feature > > > > > > space > > > > > > > >> > expressed in user-item space. But I'd have to go back and > > > think > > > > > that > > > > > > > >> > intuition through again since I can't really justify it > from > > > > > scratch > > > > > > > >> > again in my head just now. > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > On Thu, Aug 25, 2011 at 10:21 PM, Jeff Hansen < > > > > [email protected] > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > Well, I think my problem may have had more to do with > what > > I > > > > was > > > > > > > >> calling > > > > > > > >> > the > > > > > > > >> > > eigenvector... I was referring to the rows rather than > > the > > > > > > columns > > > > > > > of > > > > > > > >> U > > > > > > > >> > and > > > > > > > >> > > V. While the columns may be characteristic of the > overall > > > > > matrix, > > > > > > > the > > > > > > > >> > rows > > > > > > > >> > > are characteristic of the user or item (in that they are > a > > > > rank > > > > > > > reduced > > > > > > > >> > > representation of that person or thing). I guess you > could > > > say > > > > I > > > > > > > just > > > > > > > >> had > > > > > > > >> > to > > > > > > > >> > > tilt my head to the side and change my perspective 90 > > > degrees > > > > =) > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Lance Norskog > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Lance Norskog > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > -- > > Lance Norskog > > [email protected] > > > -- Lance Norskog [email protected]
