Points from across several e-mails --

The initial item-feature matrix can be just random unit vectors too. I
have slightly better results with that.

You are finding the least-squares solution of A = U M' for U given A
and M. Yes you can derive that analytically as the zero of the
derivative of the error function.

With m users and n items, and k features, where k=n, then I suppose
you don't need any iterations at all since there is a trivial
solution: U = A, M = I(n) (the nxn identity matrix). You would not
find this on the first iteration, however, if you followed the
algorithm, because you would be starting from some random starting
point. But if you initialized M to the identity matrix, yes you'd find
the exact solution immediately.

Yes it is an iterative algorithm and you have to define a convergence
criterion. I use average absolute difference in (U M') entries from
one iteration to the next. (Well, a sample.) It's possible that you
reach your criterion in 1 iteration, or, not. It depends on the
criterion. Usually when you restart ALS on updated data, you use the
previous M matrix as a starting point. If the change in data is small,
one iteration should usually leave you still "converged" actually.
But, from random starting point -- not typical.

ALS is similar to SVD only in broad terms. The SVD is not always used
to make a low-rank factorization, and, its outputs carry more
guarantees -- they are orthonormal bases because it has factored out
scaling factors into the third matrix Sigma. I think of ALS as more
simplistic and therefore possibly faster. WIth k features I believe
(?) the SVD is necessarily a k-iteration process at least, whereas ALS
might be of use after 1 iteration. The SVD is not a "shortcut" for
ALS. If you go to the trouble of a full SVD, you can certainly use
that factorization as-is though.

You need regularization.


It should be pointed out that the "ALS" often spoken of here is not
one that factors the input matrix A. There's a variant, that I have
had good results with, for 'implicit' feedback. There, you are
actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
values in A as weights in the loss function. You're reconstructing
"interacted or not" and using input value as a confidence measure.
This "works" for ratings although the interpretation in this variant
doesn't line up with the nature of ratings. It works quite nicely for
things like clicks, etc.

(Much more can be said on this point.)



On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <[email protected]> wrote:
> It's quite hard for me to get the mathematical concepts of the ALS
> recommenders. It would be great if someone could help me to figure out
> the details. This is my current status:
>
> 1. The item-feature (M) matrix is initialized using the average ratings
> and random values (explicit case)
>
> 2. The user-feature (U) matrix is solved using the partial derivative of
> the error function with respect to u_i (the columns of row-vectors of U)
>
> Supposed we use as many features as items are known and the error
> function does not use any regularization. Would U be solved within the
> first iteration? If not, I do not understand why more than one iteration
> is needed.
> Furthermore, I believe to have understood that using fewer features than
> items and also applying regularization, does not allow to solve U in a
> way that the stopping criterion can be met after only one iteration.
> Thus, iteration is required to gradually converge to the stopping
> criterion.
>
> I hope I have pointed out my problems clearly enough.
>

Reply via email to