Yes, Y' * Y becomes Y' * C * Y, where C is a diagonal matrix of weights. See the preso I sent over for where it shows up. See the paper I referenced there for how you still compute this sparse-ly -- the way you've written it here will be dense and won't scale.
On Wed, Jan 9, 2013 at 1:28 PM, Koobas <[email protected]> wrote: > Drilling just a bit more. > If I just use simple Tikhonov regularization, > I set both lambdas to identity, and iterate like this (MATLAB): > > rank = 50; > for i=1:6, > Y = inv(X'*X+eye(rank))'*X'*A; > X = A*Y'*inv(Y*Y'+eye(rank)); > end > > Now, can I use weighted regularization and preserve the matrix notation? > Because it seems to me that I have to go one row of X, (one column of Y) at > a time. > Is that really so, or am I missing something? > > > On Wed, Jan 9, 2013 at 10:13 AM, Koobas <[email protected]> wrote: > >> >> >> On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen <[email protected]> wrote: >> >>> I think the model you're referring to can use explicit or implicit >>> feedback. It's using the values -- however they are derived -- as >>> weights in the loss function rather than values to be approximated >>> directly. So you still use P even with implicit feedback. >>> >>> Of course you can also use ALS to factor R directly if you wanted, also. >>> >>> Yes, I see it now. >> It is weighted regression, whether explicit or implicit data. >> Thank you so much. >> I think I finally got the picture. >> >> >>> Overfitting is as much an issue as in any ML algorithm. Hard to >>> quantify it more than that but you certainly don't want to use lambda >>> = 0. >>> >>> The right value of lambda depends on the data -- depends even more on >>> what you mean by lambda! there are different usages in different >>> papers. More data means you need less lambda. The effective weight on >>> the overfitting / Tikhonov terms is about 1 in my experience -- these >>> terms should be weighted roughly like the loss function terms. But >>> that can mean using values for lambda much smaller than 1, since >>> lambda is just one multiplier of those terms in many formulations. >>> >>> The rank has to be greater than the effective rank of the data (of >>> course). It's also something you have to fit to the data >>> experimentally. For normal-ish data sets of normal-ish size, the right >>> number of features is probably 20 - 100. I'd test in that range to >>> start. >>> >>> More features tends to let the model overfit more, so in theory you >>> need more lambda with more features, all else equal. >>> >>> It's *really* something you just have to fit to representative sample >>> data. The optimal answer is way too dependent on the nature, >>> distribution and size of the data to say more than the above. >>> >>> >>> On Tue, Jan 8, 2013 at 8:54 PM, Koobas <[email protected]> wrote: >>> >> Okay, I got a little bit further in my understanding. >>> > The matrix of ratings R is replaced with the binary matrix P. >>> > Then R is used again in regularization. >>> > I get it. >>> > This takes care of the situations when you have user-item interactions, >>> > but you don't have the rating. >>> > So, it can handle explicit feedback, implicit feedback, and mixed >>> (partial >>> > / missing feedback). >>> > If I have implicit feedback, I just drop R altogether, right? >>> > >>> > Now the only remaining "trick" is Tikhonov regularization, >>> > which leads to a couple of questions: >>> > 1) How much of a problem overfitting is? >>> > 2) How do I pick lambda? >>> > 3) How do I pick the rank of the approximation in the first place? >>> > How does the overfitting problem depend on the rank of the >>> > approximation? >>> >> >>
