Friday I finally got around to reading Ted's paper "accurate methods for
statistics of surprise and coincidence" for a better understanding of how to
apply log likelihood.

Can somebody validate if I'm understanding/applying the idea correctly in
this case?

If we have a item/feature matrix (document/words, user/movies) then we can
consider each row or column to be a sample from larger reference population
of the full matrix.  In that case the log likelihood significance for any
given observation within the sample will be based on the observation itself
(1), the count of observations for the column (documents with this
word/users that watched this movie), the count of observations for the row
(words in this document, movies this user has watched) and the total number
of any observation within the reference population.  Given those numbers, G2
or log likelihood is just a calculation of a, b, c and d (values described
above respectively)

Given the log likelihood for each individual observation, is the best way to
apply it simply to remove observations that don't have a high enough level
of significance?  Or is there a better way to normalize the values rather
than removing less significant ones.  Also, in this case I feel like it
makes more sense to treat Users like Documents and movies like observations
of words within a document, so compare each User and their observations to
the reference population rather than each Movie and the users that reviewed
it.  It doesn't seem to make a very big difference, but I was wondering if
there's a reason that explains when and why one should be used over the
other.

I realize this is a Mahout user list, so I feel somewhat guilty constantly
pasting in R code, but here's how I ended up implementing it.  I'm sure
there's a better way to prune the matrix, but this was the best I could come
up with.

#rough equation that calculates log likelihood given a contingency table of
counts
llr <- function(a,b,c,d){
    r=(a+b)/(c+d)
    e1=c*r
    e2=d*r
    g2=2*(a*log(a/e1)+b*log(b/e2))
    return(g2)
}

#get counts to calculate log likelihood for function above
a <- 1
b <- apply(A,2,sum)
c <- apply(A,1,sum)
d <- sum(A)

#export sparse matrix A to a dataframe for calculating llr
triples <- summary(A)
#create b and c value vectors that sync up with the dataframe columns
bx <- b[triples$j]
cx <- c[triples$i]
#caculate the log likelihood for each entry in the dataframe
ll <- llr(a,bx,cx,d)
#create a logical operator vector that indicates whether each entry in
dataframe is significant
significant <- ll>3
#build a new sparse vector that only entries with significant enough log
likelihood
B <- sparseMatrix(i=triples$i[significant],j=triples$j[significant],x=1)

By the way, I always used to struggle with matrix multiplication (I could
never quite remember which side was rows and which side was columns with out
little mnemonics, but then again I've always struggled with remembering my
left from my right).  I recently realized it makes much more sense if you
picture it as a cube in 3space -- if you match up the lower left hand corner
of the right matrix with the upper right hand corner of the left matrix and
put the product in between them so you get a backwards capital L shape, then
fold back the right and left matrices you get three faces of a cube (or
cubic rectangle).  Then you just project inwards from the original matrices
and multiply where ever the values cross and sum them up those products as
you project them toward the face which is the resulting matrix.  Once you
visualize it this way it makes slicing and dicing the operation into
blockwise formulas trivial and obvious.  If only somebody had explained it
this way I would have found matrices much less intimidating -- has anybody
ever seen it explained this way?




On Fri, Aug 26, 2011 at 10:21 PM, Lance Norskog <[email protected]> wrote:

>
> http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html
>
> Do not fear demographics. Yes, some people rent movies with all-black
> casts,
> and other people rent movies with all-white casts. And the Walmarts in the
> SF East Bay have palettes full of Tyler Perry videos, while most of the SF
> Peninsula don't know his name.
>
> And sometimes a bunch of Army Rangers have a great time watching
> "Confessions of a Shopaholic" in a tent, 120o farenheit, in Qatar.
>
> On Fri, Aug 26, 2011 at 8:29 AM, Jeff Hansen <[email protected]> wrote:
>
> > Thanks for the math Ted -- that was very helpful.
> >
> > I've been using sparseMatrix() from libray(Matrix) -- largely based on
> your
> > response to somebody elses email.  I've been playing with smaller
> matrices
> > mainly for my own learning purposes -- it's much easier to read through
> 200
> > movies (most of which I've heard of) and get a gut feel, than 10,000
> > movies.
> >  It also means I don't have to shut down my session quite as often (I've
> > been using rstudio) when I run something over the wrong dimension(column
> vs
> > row).  I was running into some limitations as well, but I think some of
> > those may have had more to do with my own typos and misunderstanding of
> > that
> > language.
> >
> > There's a second reason I've been avoiding the tail -- while working on
> > this
> > as a possible project to propose, I had a bit of a "duh" moment.
>  Sometimes
> > it's important to realize the real world constraints.  Picture a company
> > with very physical locations and very limited shelf space (say a
> > convenience
> > store or even a vending machine...)  Netflix and Amazon can make a lot of
> > money in the long tail because they have centralized and or digital
> > inventories that service a large customer base.  Imagine a situation
> where
> > you have a small physical distributed inventory with an equally
> distributed
> > customer base.  In that situation it doesn't pay to chase after the tail
> --
> > you just need to reduce your costs and target the head where you get more
> > volume with less items.  So you really end up with a three tier market
> > segmentation -- one strategy works best for the head, another for the
> body,
> > and a third for the tail.
> >
> > As far as clusters go -- I really wasn't finding any clusters at the
> edges
> > of the data, but that could have more to do with not including the tail
> > (and
> > not normalizing appropriately for popularity).
> >
> > Incidentally, there was one amusingly cautionary anecdote I'd share -- I
> > don't remember the exact spread, but at one point I had two extremes that
> > included something like "the johnson family vacation", "my baby's daddy",
> > and "barbershop 2" on the one side, and then titles like "mean girls",
> "50
> > first dates" and "along came polly" on the other side.  You may have to
> > look
> > up those titles to see what I'm talking about, but I'd say the
> distinction
> > will be pretty clear when you do -- and if you were simply automating all
> > of
> > this and not reviewing it with common sense, you could end up offending
> > some
> > of your users...  All I'm saying is there may be trends in the real world
> > that some people aren't comfortable having pointed out.
> >
> > On Fri, Aug 26, 2011 at 4:08 AM, Lance Norskog <[email protected]>
> wrote:
> >
> > > This is a little meditation on user v.s. item matrix density. The
> > > heavy users and heavy items can be subsampled, once they are
> > > identified. Hadoop's built-in sort does give a very simple
> > > "map-increase" way to do this sort.
> > >
> > > http://ultrawhizbang.blogspot.com/2011/08/sorted-recommender-data.html
> > >
> > > On Thu, Aug 25, 2011 at 5:57 PM, Ted Dunning <[email protected]>
> > > wrote:
> > > > In matrix terms the binary user x item matrix maps a set of items to
> > > users
> > > > (A h = users who interacted with items in h).  Similarly A' maps
> users
> > to
> > > > items.  Thus A' (A h) is the classic "users who x'ed this also x'ed
> > that"
> > > > sort of operation.  This can be rearranged to be (A'A) h.
> > > >
> > > > This is where the singular values come in.  If
> > > >
> > > >     A \approx U S V'
> > > >
> > > > and we know that U'U = V'V = I from the definition of the SVD.  This
> > > means
> > > > that
> > > >
> > > >    A' A \approx (U S V')' (U S V') = V S U' U S V' = V S^2 V'
> > > >
> > > > But V' transforms an item vector into the reduced dimensional space
> so
> > we
> > > > can view this as transforming the item vector into the reduced
> > > dimensional
> > > > space, scaling element-wise using S^2 and then transforming back to
> > item
> > > > space.
> > > >
> > > > As you noted, the first right singular vector corresponds to
> > popularity.
> > >  It
> > > > is often not appropriate to just recommend popular things (since
> users
> > > have
> > > > other means of discovering them) so you might want to drop that
> > dimension
> > > > from the analysis.  This can be done in the SVD results or it can be
> > done
> > > > using something like a log-likelihood ratio test so that we only
> model
> > > > anomalously large values in A'A.
> > > >
> > > > Btw, if you continue to use R for experimentation, you should be able
> > to
> > > > dramatically increase the size of the analysis you do by using sparse
> > > > matrices and a random projection algorithm.  I was able to compute
> > SVD's
> > > of
> > > > matrices with millions of rows and columns and a few million non-zero
> > > > elements in a few seconds.  Holler if you would like details about
> how
> > to
> > > do
> > > > this.
> > > >
> > > > On Thu, Aug 25, 2011 at 4:07 PM, Jeff Hansen <[email protected]>
> > wrote:
> > > >
> > > >> One thing I found interesting (but not particularly surprising) is
> > that
> > > the
> > > >> biggest singular value/vector was pretty much tied directly to
> volume.
> > > >>  That
> > > >> makes sense because the best predictor of whether a given fields
> value
> > > was
> > > >> 1
> > > >> was whether it belonged to a row with lots of 1s or a column with
> lots
> > > of
> > > >> 1s
> > > >> (I haven't quite figured out the best method to normalize the values
> > > with
> > > >> yet).  When I plot the values of the largest singular vectors
> against
> > > the
> > > >> sum of values in the corresponding row or column, the correlation is
> > > very
> > > >> linear.  That's not the case for any of the other singular vectors
> > > (which
> > > >> actually makes me wonder if just removing that first singular vector
> > > from
> > > >> the prediction might not be one of the best ways to normalize the
> > data).
> > > >>
> > > >> I understand what you're saying about each singular vector
> > corresponding
> > > to
> > > >> a feature though.  Each left singular vector represents some
> abstract
> > > >> aspect
> > > >> of a movie and each right singular vector represents users leanings
> or
> > > >> inclinations with regards to that aspect of the movie.  The singular
> > > value
> > > >> itself just seems to indicate how good a predictor the combination
> of
> > > one
> > > >> users inclination toward that aspect of a movie is for coming up
> with
> > > the
> > > >> actual value.  The issue I mentioned above is that popularity of a
> > movie
> > > as
> > > >> well as how often a user watches movies tend to be the best
> predictors
> > > of
> > > >> whether a user has seen or will see a movie.
> > > >>
> > > >> I had been picturing this with the idea of one k dimensional space
> --
> > > one
> > > >> where a users location corresponds to their ideal prototypical
> movies
> > > >> location. Not that there would be a movie right there, but there
> would
> > > be
> > > >> ones nearby, and the nearer they were, the more enjoyable they would
> > be.
> > > >>  That's a naive model, but that doesn't mean it wouldn't work well
> > > enough.
> > > >>  My problem is I don't quite get how to map the user space over to
> the
> > > item
> > > >> space.
> > > >>
> > > >> I think that may be what Lance is trying to describe in his last
> > > response,
> > > >> but I'm falling short on reconstructing the math from his
> description.
> > > >>
> > > >> I get the following
> > > >>
> > > >> U S V* = A
> > > >> U S = A V
> > > >> U = A V 1/S
> > > >>
> > > >> I think that last line is what Lance was describing.  Of course my
> > > problem
> > > >> was bootstrapping in a user for whom I don't know A or V.
> > > >>
> > > >> I also think I may have missed a big step of the puzzle.  For some
> > > reason I
> > > >> thought that just by loosening the rank, you could recompose the
> > Matrix
> > > A
> > > >> from the truncated SVD values/vectors and use the recomposed values
> > > >> themselves as the recommendation.  I thought one of the ideas was
> that
> > > the
> > > >> recomposed matrix had less "noise" and could be a better
> > representation
> > > of
> > > >> the underlying nature of the matrix than the original matrix itself.
> > >  But
> > > >> that may have just been wishful thinking...
> > > >>
> > > >> On Thu, Aug 25, 2011 at 4:50 PM, Sean Owen <[email protected]>
> wrote:
> > > >>
> > > >> > The 200x10 matrix is indeed a matrix of 10 singular vectors, which
> > are
> > > >> > eigenvectors of AA'. It's the columns, not rows, that are
> > > >> > eigenvectors.
> > > >> >
> > > >> > The rows do mean something. I think it's fair to interpret the 10
> > > >> > singular values / vectors as corresponding to some underlying
> > features
> > > >> > of tastes. The rows say how much each user expresses those 10
> > tastes.
> > > >> > The matrix of right singular vectors on the other side tells you
> the
> > > >> > same thing about items. The diagonal matrix of singular values in
> > the
> > > >> > middle also comes into play -- it's like a set of multipliers that
> > say
> > > >> > how important each feature is. (This is why we cut out the
> singular
> > > >> > vectors / values that have the smallest singular values; it's like
> > > >> > removing the least-important features.) So really you'd have to
> > stick
> > > >> > those values somewhere; Ted says it's conventional to put "half"
> of
> > > >> > each (their square roots) with each side if anything.
> > > >> >
> > > >> > I don't have as good a grasp on an intuition for the columns as
> > > >> > eigenvectors. They're also a set of basis vectors, and I had
> > > >> > understood them as like the "real" bases of the reduced feature
> > space
> > > >> > expressed in user-item space. But I'd have to go back and think
> that
> > > >> > intuition through again since I can't really justify it from
> scratch
> > > >> > again in my head just now.
> > > >> >
> > > >> >
> > > >> > On Thu, Aug 25, 2011 at 10:21 PM, Jeff Hansen <[email protected]
> >
> > > >> wrote:
> > > >> > > Well, I think my problem may have had more to do with what I was
> > > >> calling
> > > >> > the
> > > >> > > eigenvector...  I was referring to the rows rather than the
> > columns
> > > of
> > > >> U
> > > >> > and
> > > >> > > V.  While the columns may be characteristic of the overall
> matrix,
> > > the
> > > >> > rows
> > > >> > > are characteristic of the user or item (in that they are a rank
> > > reduced
> > > >> > > representation of that person or thing). I guess you could say I
> > > just
> > > >> had
> > > >> > to
> > > >> > > tilt my head to the side and change my perspective 90 degrees =)
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > [email protected]
> > >
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to