Re: alternating least squares

Sean Owen Wed, 09 Jan 2013 11:49:07 -0800

Yes, Y' * Y becomes Y' * C * Y, where C is a diagonal matrix of
weights. See the preso I sent over for where it shows up.
See the paper I referenced there for how you still compute this
sparse-ly -- the way you've written it here will be dense and won't
scale.


On Wed, Jan 9, 2013 at 1:28 PM, Koobas <[email protected]> wrote:
> Drilling just a bit more.
> If I just use simple Tikhonov regularization,
> I set both lambdas to identity, and iterate like this (MATLAB):
>
> rank = 50;
> for i=1:6,
>   Y = inv(X'*X+eye(rank))'*X'*A;
>   X = A*Y'*inv(Y*Y'+eye(rank));
> end
>
> Now, can I use weighted regularization and preserve the matrix notation?
> Because it seems to me that I have to go one row of X, (one column of Y) at
> a time.
> Is that really so, or am I missing something?
>
>
> On Wed, Jan 9, 2013 at 10:13 AM, Koobas <[email protected]> wrote:
>
>>
>>
>> On Wed, Jan 9, 2013 at 12:40 AM, Sean Owen <[email protected]> wrote:
>>
>>> I think the model you're referring to can use explicit or implicit
>>> feedback. It's using the values -- however they are derived -- as
>>> weights in the loss function rather than values to be approximated
>>> directly. So you still use P even with implicit feedback.
>>>
>>> Of course you can also use ALS to factor R directly if you wanted, also.
>>>
>>> Yes, I see it now.
>> It is weighted regression, whether explicit or implicit data.
>> Thank you so much.
>> I think I finally got the picture.
>>
>>
>>> Overfitting is as much an issue as in any ML algorithm. Hard to
>>> quantify it more than that but you certainly don't want to use lambda
>>> = 0.
>>>
>>> The right value of lambda depends on the data -- depends even more on
>>> what you mean by lambda! there are different usages in different
>>> papers. More data means you need less lambda. The effective weight on
>>> the overfitting / Tikhonov terms is about 1 in my experience -- these
>>> terms should be weighted roughly like the loss function terms. But
>>> that can mean using values for lambda much smaller than 1, since
>>> lambda is just one multiplier of those terms in many formulations.
>>>
>>> The rank has to be greater than the effective rank of the data (of
>>> course). It's also something you have to fit to the data
>>> experimentally. For normal-ish data sets of normal-ish size, the right
>>> number of features is probably 20 - 100. I'd test in that range to
>>> start.
>>>
>>> More features tends to let the model overfit more, so in theory you
>>> need more lambda with more features, all else equal.
>>>
>>> It's *really* something you just have to fit to representative sample
>>> data. The optimal answer is way too dependent on the nature,
>>> distribution and size of the data to say more than the above.
>>>
>>>
>>> On Tue, Jan 8, 2013 at 8:54 PM, Koobas <[email protected]> wrote:
>>> >> Okay, I got a little bit further in my understanding.
>>> > The matrix of ratings R is replaced with the binary matrix P.
>>> > Then R is used again in regularization.
>>> > I get it.
>>> > This takes care of the situations when you have user-item interactions,
>>> > but you don't have the rating.
>>> > So, it can handle explicit feedback, implicit feedback, and mixed
>>> (partial
>>> > / missing feedback).
>>> > If I have implicit feedback, I just drop R altogether, right?
>>> >
>>> > Now the only remaining "trick" is Tikhonov regularization,
>>> > which leads to a couple of questions:
>>> > 1) How much of a problem overfitting is?
>>> > 2) How do I pick lambda?
>>> > 3) How do I pick the rank of the approximation in the first place?
>>> >     How does the overfitting problem depend on the rank of the
>>> > approximation?
>>>
>>
>>

Re: alternating least squares

Reply via email to